[GitHub] [hudi] vinothsiva1989 opened a new issue #2031: [SUPPORT]

2020-08-24 Thread GitBox


vinothsiva1989 opened a new issue #2031:
URL: https://github.com/apache/hudi/issues/2031


   i am new to hudi please help .
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior: 
   step1:
   spark-shell \
 --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.6.0,'org.apache.spark:spark-avro_2.12:3.0.0'
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 
--conf 'spark.sql.hive.convertMetastoreParquet=false'
   step2:
   import org.apache.spark.sql.SaveMode
   import org.apache.spark.sql.functions._ 
   import org.apache.hudi.DataSourceWriteOptions 
   import org.apache.hudi.config.HoodieWriteConfig 
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   import org.apache.spark.sql._
   
   step3: create hive table 
   create external table hudi_parquet (op string,
   pk_id int,
   name string,
   value int,
   updated_at timestamp,
   created_at timestamp)
   stored as parquet
   location '/user/vinoth.siva/hudi_parquet';
   
   step4: insert data into hive table
   insert into hudi_parquet values ('I',5,'htc',50,'2020-02-06 
18:00:39','2020-02-06 18:00:39')
   
   step 5: loding data into dataframe
   val df=spark.read.parquet("/user/vinoth.siva/hudi_parquet/00_0");
   df.show()
   
   Step6: Hudi options
val hudiOptions = Map[String,String](
 HoodieWriteConfig.TABLE_NAME -> "my_hudi_table",
 DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE", 
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "pk_id",
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "created_at",
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "updated_at",
 DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
 DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> "my_hudi_table",
 DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "created_at",
 DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName
   )
   
   step7: Writing hudi data.

df.write.format("org.apache.hudi").option(DataSourceWriteOptions.OPERATION_OPT_KEY,
 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL).options(hudiOptions).mode(SaveMode.Overwrite).save("/user/vinoth.siva/hudi_cow")

   
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.6.0
   
   * Spark version :3
   
   * Hive version :1.2
   
   * Hadoop version :2.7
   
   * Storage (HDFS/S3/GCS..) :hdfs
   
   * Running on Docker? (yes/no)  no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   20/08/25 10:01:21 WARN hudi.HoodieSparkSqlWriter$: hoodie table at 
/user/vinoth.siva/hudi_cow already exists. Deleting existing data & overwriting 
with new data.
   20/08/25 10:01:22 ERROR executor.Executor: Exception in task 0.0 in stage 
3.0 (TID 3)
   java.lang.NoSuchMethodError: 'java.lang.Object 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(org.apache.spark.sql.catalyst.InternalRow)'
   at 
org.apache.hudi.AvroConversionUtils$.$anonfun$createRdd$1(AvroConversionUtils.scala:44)
   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
   at scala.collection.Iterator$SliceIterator.next(Iterator.scala:271)
   at scala.collection.Iterator.foreach(Iterator.scala:941)
   at scala.collection.Iterator.foreach$(Iterator.scala:941)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
   at 
scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
   at 
scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
   at 
scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
   at 
scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
   at 
scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
   at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1423)
   at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2133)
 

[jira] [Resolved] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide

2020-08-24 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu resolved HUDI-1103.
---
Resolution: Fixed

> Improve the code format of Delete data demo in Quick-Start Guide
> 
>
> Key: HUDI-1103
> URL: https://issues.apache.org/jira/browse/HUDI-1103
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: wangxianghu
>Assignee: Trevorzhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> {color}Currently, the delete data demo code is not runnable in spark-shell 
> {code:java}
> scala> val df = spark
> df: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@74e7d97bscala>   .read
> :1: error: illegal start of definition
>   .read
>   ^scala>   .json(spark.sparkContext.parallelize(deletes, 2))
> :1: error: illegal start of definition
>   .json(spark.sparkContext.parallelize(deletes, 2))
>   ^
> {code}
> This dot symbol should be  at the end of the line or put a "\" at the end
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide

2020-08-24 Thread wangxianghu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183721#comment-17183721
 ] 

wangxianghu commented on HUDI-1103:
---

fixed via asf-site branch : 1d30581ef8ddfa08a56b2a70f29370b927c649ca

> Improve the code format of Delete data demo in Quick-Start Guide
> 
>
> Key: HUDI-1103
> URL: https://issues.apache.org/jira/browse/HUDI-1103
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: wangxianghu
>Assignee: Trevorzhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> {color}Currently, the delete data demo code is not runnable in spark-shell 
> {code:java}
> scala> val df = spark
> df: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@74e7d97bscala>   .read
> :1: error: illegal start of definition
>   .read
>   ^scala>   .json(spark.sparkContext.parallelize(deletes, 2))
> :1: error: illegal start of definition
>   .json(spark.sparkContext.parallelize(deletes, 2))
>   ^
> {code}
> This dot symbol should be  at the end of the line or put a "\" at the end
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide

2020-08-24 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1103:
--
Status: Open  (was: New)

> Improve the code format of Delete data demo in Quick-Start Guide
> 
>
> Key: HUDI-1103
> URL: https://issues.apache.org/jira/browse/HUDI-1103
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: wangxianghu
>Assignee: Trevorzhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> {color}Currently, the delete data demo code is not runnable in spark-shell 
> {code:java}
> scala> val df = spark
> df: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@74e7d97bscala>   .read
> :1: error: illegal start of definition
>   .read
>   ^scala>   .json(spark.sparkContext.parallelize(deletes, 2))
> :1: error: illegal start of definition
>   .json(spark.sparkContext.parallelize(deletes, 2))
>   ^
> {code}
> This dot symbol should be  at the end of the line or put a "\" at the end
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1214) Need ability to set deltastreamer checkpoints when doing Spark datasource writes

2020-08-24 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183706#comment-17183706
 ] 

Balaji Varadarajan commented on HUDI-1214:
--

[~Trevorzhang] : Go for it

> Need ability to set deltastreamer checkpoints when doing Spark datasource 
> writes
> 
>
> Key: HUDI-1214
> URL: https://issues.apache.org/jira/browse/HUDI-1214
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Such support is needed  for bootstrapping cases when users use spark write to 
> do initial bootstrap and then subsequently use deltastreamer.
> DeltaStreamer manages checkpoints inside hoodie commit files and expects 
> checkpoints in previously committed metadata. Users are expected to pass 
> checkpoint or initial checkpoint provider when performing bootstrap through 
> deltastreamer. Such support is not present when doing bootstrap using Spark 
> Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1201) HoodieDeltaStreamer: Allow user overrides to read from earliest kafka offset when commit files do not have checkpoint

2020-08-24 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183707#comment-17183707
 ] 

Balaji Varadarajan commented on HUDI-1201:
--

[~Trevorzhang] : Go for it
 * [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13324048]

> HoodieDeltaStreamer: Allow user overrides to read from earliest kafka offset 
> when commit files do not have checkpoint
> -
>
> Key: HUDI-1201
> URL: https://issues.apache.org/jira/browse/HUDI-1201
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> [https://github.com/apache/hudi/issues/1985]
>  
> It would be easier for user to just specify deltastreamer to read from 
> earliest offset instead  of implementing -initial-checkpoint-provider or 
> passing raw kafka checkpoints when the table was initially bootstrapped 
> through spark.write().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar commented on issue #1979: [SUPPORT]: Is it possible to incrementally read only upserted rows where a material change has occurred?

2020-08-24 Thread GitBox


bvaradar commented on issue #1979:
URL: https://github.com/apache/hudi/issues/1979#issuecomment-679522323


   @hughfdjackson : In general getting incremental read to discard duplicates 
is not possible for MOR table types as we defer the merging of records to 
compaction.
   
   I was thinking about alternate ways to achieve your use-case for COW table 
by using an application level boolean flag. Let me know if this makes sense:
   
   1. Introduce additional  boolean column "changed". Default Value is false.
   2. Have your own implementation of HoodieRecordPayload plugged-in.
   3a In HoodieRecordPayload.getInsertValue(), return an avro record with 
changed = true. This function is called first time  when the new record is 
inserted.
   3(b) In HoodieRecordPayload.combineAndGetUpdateValue(), if you determine, 
there is no material change, set changed = false else set it to true.
   
   In your incremental query,  add the filter changed = true to filter out 
those without material changes ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2007: [SUPPORT] Is Timeline metadata queryable ?

2020-08-24 Thread GitBox


bvaradar commented on issue #2007:
URL: https://github.com/apache/hudi/issues/2007#issuecomment-679497286


   @ashishmgofficial : If you use DeltaStreamer, it comes with kafka 
integration and manages checkpoints internally. So, there is no need to query 
timeline metadata separately. Do you have any specific requirement where you 
need to look at timeline ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Trevor-zhang commented on a change in pull request #2021: [HUDI-1218] Introduce BulkInsertSortMode as Independent class

2020-08-24 Thread GitBox


Trevor-zhang commented on a change in pull request #2021:
URL: https://github.com/apache/hudi/pull/2021#discussion_r476102080



##
File path: 
hudi-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java
##
@@ -39,10 +39,4 @@ public static BulkInsertPartitioner get(BulkInsertSortMode 
sortMode) {
 throw new HoodieException("The bulk insert mode \"" + sortMode.name() 
+ "\" is not supported.");

Review comment:
   > `The bulk insert sort mode `? better?
   I think so. i have changed
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2021: [HUDI-1218]Introduce BulkInsertSortMode as Independent class

2020-08-24 Thread GitBox


yanghua commented on a change in pull request #2021:
URL: https://github.com/apache/hudi/pull/2021#discussion_r476098066



##
File path: 
hudi-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java
##
@@ -39,10 +39,4 @@ public static BulkInsertPartitioner get(BulkInsertSortMode 
sortMode) {
 throw new HoodieException("The bulk insert mode \"" + sortMode.name() 
+ "\" is not supported.");

Review comment:
   `The bulk insert sort mode `? better?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dm-tran commented on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-08-24 Thread GitBox


dm-tran commented on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-679481632


   Thank you for your answer @bvaradar !
   
   > Can you please add the details of "commit showfiles --commit 
20200821153748"
   
   ```
   
╔═══╤╤═╤═══╤═══╤═╤══╤═══╗
   ║ Partition Path│ File ID│ Previous 
Commit │ Total Records Updated │ Total Records Written │ Total Bytes Written │ 
Total Errors │ File Size ║
   
╠═══╪╪═╪═══╪═══╪═╪══╪═══╣
   ║ daas_date=2020-04 │ 63bacea1-d6af-4ce0-8dc8-6ce9db8df332-0 │ 
20200821152906  │ 212   │ 534115│ 22998619  
  │ 0│ 22998619  ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2020-04 │ 9c5e022c-feda-4059-84f6-752344cea4a9-0 │ 
20200821152906  │ 89│ 460341│ 18755115  
  │ 0│ 18755115  ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2020-04 │ 80be527b-eda7-42f3-8565-c15e9447d731-0 │ 
20200821152906  │ 39│ 192455│ 9112346   
  │ 0│ 9112346   ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2020-04 │ 569b2555-5cd6-416a-b7d7-11897603a1e3-0 │ 
20200821152906  │ 3 │ 483483│ 19114286  
  │ 0│ 19114286  ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2020-05 │ 80be527b-eda7-42f3-8565-c15e9447d731-1 │ 
20200821152906  │ 106   │ 302728│ 13385764  
  │ 0│ 13385764  ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2020-05 │ 27da8cb6-e4b7-4c29-904b-25d3ba321d0a-0 │ 
20200821152906  │ 84│ 482538│ 19568311  
  │ 0│ 19568311  ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2020-05 │ 0c376059-0279-4967-8002-70c3cd9c6b8e-0 │ 
20200821152906  │ 84│ 498131│ 21751990  
  │ 0│ 21751990  ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2020-05 │ 9730fe61-5584-4156-b25c-8c8ef41583f4-0 │ 
20200821152906  │ 76│ 500352│ 19812831  
  │ 0│ 19812831  ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2020-05 │ c2c3fb95-3e58-4021-80c4-7e48aace8dda-0 │ 
20200821152906  │ 72│ 484533│ 21001957  
  │ 0│ 21001957  ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2020-05 │ ec2ba5dc-7dd7-4cc7-93cd-1358476a124f-0 │ 
20200821152906  │ 61│ 509569│ 21960018  
  │ 0│ 21960018  ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2020-05 │ bd54d7bb-2fb7-475f-8ca2-47594a1c3206-0 │ 
20200821152906  │ 46│ 342451│ 14678548  
  │ 0│ 14678548  ║
   
╟───┼┼─┼───┼───┼─┼──┼───╢
   ║ daas_date=2019-05 │ a0d89a7a-0621-469a-8359-c4c4b8948ff5-1 │ 
20200821152906  │ 3 │ 445248│ 16992382  
  │ 0│ 16992382  ║
   

[GitHub] [hudi] n3nash commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


n3nash commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r476026846



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.exception.HoodieIOException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * TimelineUtils provides a common way to query incremental meta-data changes 
for a hoodie table.
+ *
+ * This is useful in multiple places including:
+ * 1) HiveSync - this can be used to query partitions that changed since 
previous sync.
+ * 2) Incremental reads - InputFormats can use this API to queryxw
+ */
+public class TimelineUtils {
+
+  /**
+   * Returns partitions that have new data strictly after commitTime.
+   * Does not include internal operations such as clean in the timeline.
+   */
+  public static List getPartitionsWritten(HoodieTimeline timeline) {
+HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline();
+return getPartitionsMutated(timelineToSync);
+  }
+
+  /**
+   * Returns partitions that have been modified including internal operations 
such as clean in the passed timeline.
+   */
+  public static List getPartitionsMutated(HoodieTimeline timeline) {

Review comment:
   Sure, getAffectedPartitions is fine @satishkotha, I think initially it 
was getWrittenPartitions or something..





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


n3nash commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r476026846



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.exception.HoodieIOException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * TimelineUtils provides a common way to query incremental meta-data changes 
for a hoodie table.
+ *
+ * This is useful in multiple places including:
+ * 1) HiveSync - this can be used to query partitions that changed since 
previous sync.
+ * 2) Incremental reads - InputFormats can use this API to queryxw
+ */
+public class TimelineUtils {
+
+  /**
+   * Returns partitions that have new data strictly after commitTime.
+   * Does not include internal operations such as clean in the timeline.
+   */
+  public static List getPartitionsWritten(HoodieTimeline timeline) {
+HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline();
+return getPartitionsMutated(timelineToSync);
+  }
+
+  /**
+   * Returns partitions that have been modified including internal operations 
such as clean in the passed timeline.
+   */
+  public static List getPartitionsMutated(HoodieTimeline timeline) {

Review comment:
   Sure, getAffectedPartitions is fine.. @satishkotha 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1135] Make timeline server timeout settings configurable.

2020-08-24 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 218d4a6  [HUDI-1135] Make timeline server timeout settings 
configurable.
218d4a6 is described below

commit 218d4a6836ce86216d289fa5fbb389871ad7ba0f
Author: Prashant Wason 
AuthorDate: Mon Aug 24 14:10:48 2020 -0700

[HUDI-1135] Make timeline server timeout settings configurable.
---
 .../hudi/common/table/view/FileSystemViewManager.java |  5 +++--
 .../common/table/view/FileSystemViewStorageConfig.java| 15 ++-
 .../table/view/RemoteHoodieTableFileSystemView.java   |  8 +++-
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
index c4ab712..d310181 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
@@ -169,9 +169,10 @@ public class FileSystemViewManager {
   private static RemoteHoodieTableFileSystemView 
createRemoteFileSystemView(SerializableConfiguration conf,
   FileSystemViewStorageConfig viewConf, HoodieTableMetaClient metaClient) {
 LOG.info("Creating remote view for basePath " + metaClient.getBasePath() + 
". Server="
-+ viewConf.getRemoteViewServerHost() + ":" + 
viewConf.getRemoteViewServerPort());
++ viewConf.getRemoteViewServerHost() + ":" + 
viewConf.getRemoteViewServerPort() + ", Timeout="
++ viewConf.getRemoteTimelineClientTimeoutSecs());
 return new 
RemoteHoodieTableFileSystemView(viewConf.getRemoteViewServerHost(), 
viewConf.getRemoteViewServerPort(),
-metaClient);
+metaClient, viewConf.getRemoteTimelineClientTimeoutSecs());
   }
 
   /**
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java
index 434f873..603f88f 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java
@@ -44,6 +44,8 @@ public class FileSystemViewStorageConfig extends 
DefaultHoodieConfig {
   public static final String FILESYSTEM_VIEW_BOOTSTRAP_BASE_FILE_FRACTION =
   "hoodie.filesystem.view.spillable.bootstrap.base.file.mem.fraction";
   private static final String ROCKSDB_BASE_PATH_PROP = 
"hoodie.filesystem.view.rocksdb.base.path";
+  public static final String FILESTYSTEM_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS =
+  "hoodie.filesystem.view.remote.timeout.secs";
 
   public static final FileSystemViewStorageType DEFAULT_VIEW_STORAGE_TYPE = 
FileSystemViewStorageType.MEMORY;
   public static final FileSystemViewStorageType 
DEFAULT_SECONDARY_VIEW_STORAGE_TYPE = FileSystemViewStorageType.MEMORY;
@@ -52,7 +54,7 @@ public class FileSystemViewStorageConfig extends 
DefaultHoodieConfig {
   public static final String DEFAULT_FILESYSTEM_VIEW_INCREMENTAL_SYNC_MODE = 
"false";
   public static final String DEFUALT_REMOTE_VIEW_SERVER_HOST = "localhost";
   public static final Integer DEFAULT_REMOTE_VIEW_SERVER_PORT = 26754;
-
+  public static final Integer DEFAULT_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS = 5 
* 60; // 5 min
   public static final String DEFAULT_VIEW_SPILLABLE_DIR = "/tmp/view_map/";
   private static final Double DEFAULT_MEM_FRACTION_FOR_PENDING_COMPACTION = 
0.01;
   private static final Double DEFAULT_MEM_FRACTION_FOR_EXTERNAL_DATA_FILE = 
0.05;
@@ -91,6 +93,10 @@ public class FileSystemViewStorageConfig extends 
DefaultHoodieConfig {
 return Integer.parseInt(props.getProperty(FILESYSTEM_VIEW_REMOTE_PORT));
   }
 
+  public Integer getRemoteTimelineClientTimeoutSecs() {
+return 
Integer.parseInt(props.getProperty(FILESTYSTEM_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS));
+  }
+
   public long getMaxMemoryForFileGroupMap() {
 long totalMemory = 
Long.parseLong(props.getProperty(FILESYSTEM_VIEW_SPILLABLE_MEM));
 return totalMemory - getMaxMemoryForPendingCompaction() - 
getMaxMemoryForBootstrapBaseFile();
@@ -175,6 +181,11 @@ public class FileSystemViewStorageConfig extends 
DefaultHoodieConfig {
   return this;
 }
 
+public Builder withRemoteTimelineClientTimeoutSecs(Long 
timelineClientTimeoutSecs) {
+  props.setProperty(FILESTYSTEM_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS, 
timelineClientTimeoutSecs.toString());
+  return this;
+}
+
 public Builder withMemFractionForPendingCompaction(Double 
memFractionForPendingCompaction) {
   props.setProperty(FILESYSTEM_VIEW_PENDING_COMPACTION_MEM_FRACTION, 

[GitHub] [hudi] n3nash merged pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.

2020-08-24 Thread GitBox


n3nash merged pull request #2026:
URL: https://github.com/apache/hudi/pull/2026


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash merged pull request #2027: [HUDI-1136] Add back findInstantsAfterOrEquals to the HoodieTimeline class.

2020-08-24 Thread GitBox


n3nash merged pull request #2027:
URL: https://github.com/apache/hudi/pull/2027


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1136] Add back findInstantsAfterOrEquals to the HoodieTimeline class.

2020-08-24 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9b1f16b  [HUDI-1136] Add back findInstantsAfterOrEquals to the 
HoodieTimeline class.
9b1f16b is described below

commit 9b1f16b604143f5a6926db57173f921fbb6c
Author: Prashant Wason 
AuthorDate: Mon Aug 24 14:24:50 2020 -0700

[HUDI-1136] Add back findInstantsAfterOrEquals to the HoodieTimeline class.
---
 .../apache/hudi/common/table/timeline/HoodieDefaultTimeline.java   | 7 +++
 .../java/org/apache/hudi/common/table/timeline/HoodieTimeline.java | 5 +
 2 files changed, 12 insertions(+)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
index c7a6230..678d056 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
@@ -133,6 +133,13 @@ public class HoodieDefaultTimeline implements 
HoodieTimeline {
   }
 
   @Override
+  public HoodieDefaultTimeline findInstantsAfterOrEquals(String commitTime, 
int numCommits) {
+return new HoodieDefaultTimeline(instants.stream()
+.filter(s -> HoodieTimeline.compareTimestamps(s.getTimestamp(), 
GREATER_THAN_OR_EQUALS, commitTime))
+.limit(numCommits), details);
+  }
+
+  @Override
   public HoodieDefaultTimeline findInstantsBefore(String instantTime) {
 return new HoodieDefaultTimeline(instants.stream()
 .filter(s -> HoodieTimeline.compareTimestamps(s.getTimestamp(), 
LESSER_THAN, instantTime)),
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
index 45b9e34..b7c405e 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
@@ -141,6 +141,11 @@ public interface HoodieTimeline extends Serializable {
   HoodieTimeline filterPendingCompactionTimeline();
 
   /**
+   * Create a new Timeline with all the instants after startTs.
+   */
+  HoodieTimeline findInstantsAfterOrEquals(String commitTime, int numCommits);
+
+  /**
* Create a new Timeline with instants after startTs and before or on endTs.
*/
   HoodieTimeline findInstantsInRange(String startTs, String endTs);



[GitHub] [hudi] n3nash commented on pull request #2030: [HUDI-1130] hudi-test-suite support for schema evolution (can be trig…

2020-08-24 Thread GitBox


n3nash commented on pull request #2030:
URL: https://github.com/apache/hudi/pull/2030#issuecomment-679429416


   @vinothchandar Yes, that PR is following later today by @modi95. We will 
merge this after that.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


satishkotha commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475962559



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.exception.HoodieIOException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * TimelineUtils provides a common way to query incremental meta-data changes 
for a hoodie table.
+ *
+ * This is useful in multiple places including:
+ * 1) HiveSync - this can be used to query partitions that changed since 
previous sync.
+ * 2) Incremental reads - InputFormats can use this API to queryxw
+ */
+public class TimelineUtils {
+
+  /**
+   * Returns partitions that have new data strictly after commitTime.
+   * Does not include internal operations such as clean in the timeline.
+   */
+  public static List getPartitionsWritten(HoodieTimeline timeline) {
+HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline();
+return getPartitionsMutated(timelineToSync);
+  }
+
+  /**
+   * Returns partitions that have been modified including internal operations 
such as clean in the passed timeline.
+   */
+  public static List getPartitionsMutated(HoodieTimeline timeline) {
+return timeline.filterCompletedInstants().getInstants().flatMap(s -> {
+  switch (s.getAction()) {
+case HoodieTimeline.COMMIT_ACTION:
+case HoodieTimeline.DELTA_COMMIT_ACTION:
+  try {
+HoodieCommitMetadata commitMetadata = 
HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(s).get(), 
HoodieCommitMetadata.class);
+return commitMetadata.getPartitionToWriteStats().keySet().stream();
+  } catch (IOException e) {
+throw new HoodieIOException("Failed to get partitions written 
between " + timeline.firstInstant() + " " + timeline.lastInstant(), e);
+  }
+case HoodieTimeline.CLEAN_ACTION:
+  try {
+HoodieCleanMetadata cleanMetadata = 
TimelineMetadataUtils.deserializeHoodieCleanMetadata(timeline.getInstantDetails(s).get());
+return cleanMetadata.getPartitionMetadata().keySet().stream();
+  } catch (IOException e) {
+throw new HoodieIOException("Failed to get partitions cleaned 
between " + timeline.firstInstant() + " " + timeline.lastInstant(), e);
+  }
+case HoodieTimeline.COMPACTION_ACTION:
+  // compaction is not a completed instant.  So no need to consider 
this action.
+case HoodieTimeline.SAVEPOINT_ACTION:
+case HoodieTimeline.ROLLBACK_ACTION:
+case HoodieTimeline.RESTORE_ACTION:
+  return Stream.empty();

Review comment:
   compaction is not treated as completed instants. So this can be ignored. 
I implemented all other actions and added tests





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha merged pull request #2028: Site update and release page for 0.6.0

2020-08-24 Thread GitBox


bhasudha merged pull request #2028:
URL: https://github.com/apache/hudi/pull/2028


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


satishkotha commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475961204



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.exception.HoodieIOException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * TimelineUtils provides a common way to query incremental meta-data changes 
for a hoodie table.
+ *
+ * This is useful in multiple places including:
+ * 1) HiveSync - this can be used to query partitions that changed since 
previous sync.
+ * 2) Incremental reads - InputFormats can use this API to queryxw
+ */
+public class TimelineUtils {
+
+  /**
+   * Returns partitions that have new data strictly after commitTime.
+   * Does not include internal operations such as clean in the timeline.
+   */
+  public static List getPartitionsWritten(HoodieTimeline timeline) {
+HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline();
+return getPartitionsMutated(timelineToSync);
+  }
+
+  /**
+   * Returns partitions that have been modified including internal operations 
such as clean in the passed timeline.
+   */
+  public static List getPartitionsMutated(HoodieTimeline timeline) {

Review comment:
   Yes, this is for internal use. I have no strong opinion on name. I think 
it was initially getAffectedPartitions, changed to getPartitionsMutated because 
of @n3nash suggestion. i'm fine with going back if he agrees

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.exception.HoodieIOException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * TimelineUtils provides a common way to query incremental meta-data changes 
for a hoodie table.
+ *
+ * This is useful in multiple places including:
+ * 1) HiveSync - this can be used to query partitions that changed since 
previous sync.
+ * 2) Incremental reads - InputFormats can use this API to queryxw

Review comment:
   thank you! fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2030: [HUDI-1130] hudi-test-suite support for schema evolution (can be trig…

2020-08-24 Thread GitBox


vinothchandar commented on pull request #2030:
URL: https://github.com/apache/hudi/pull/2030#issuecomment-679411848


   can we first make the test-suite tests work on master and run in CI, before 
we merge more features? cc @n3nash 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1130) Allow for schema evolution within DAG for hudi test suite

2020-08-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1130:
-
Labels: pull-request-available  (was: )

> Allow for schema evolution within DAG for hudi test suite
> -
>
> Key: HUDI-1130
> URL: https://issues.apache.org/jira/browse/HUDI-1130
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nbalajee opened a new pull request #2030: [HUDI-1130] hudi-test-suite support for schema evolution (can be trig…

2020-08-24 Thread GitBox


nbalajee opened a new pull request #2030:
URL: https://github.com/apache/hudi/pull/2030


   …gered on any insert/upsert DAG node).
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   Each insert/upsert dag node in the hudi test-suite can specify a schema to 
be used for the node.   DAG execution context
   is reinitialized with the new/evolved schema.  This allows verification of 
schema evolution scenarios.
   
   ## Brief change log
- Modify the test suite to accept "reinitialize_context" flag and new 
schema file.
- reinitialize the writer context, as part of DAG node execution.  
(Remaining nodes will use the updated schema).
   
   ## Verify this pull request
   
   Verified using the hudi-test-suite.
   
   This change added tests and can be verified as follows:
   
   - Launched hudi test suite with the following:
   ```
   insert_1:
 config:
   record_size: 7000
   num_partitions_insert: 1
   repeat_count: 5
   num_records_insert: 10
   reinitialize_context: true
   hoodie.deltastreamer.schemaprovider.source.schema.file: 
"file:///tmp/evolved.avsc"
   ```
   
   ## Committer checklist
   
- [ x] Has a corresponding JIRA in PR title & commit

- [x ] Commit message is descriptive of the change

- [ x] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475940809



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieHFileRealtimeInputFormat.java
##
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.realtime;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.stream.Stream;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.hive.serde2.ColumnProjectionUtils;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hudi.common.table.timeline.HoodieDefaultTimeline;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.hadoop.HoodieHFileInputFormat;
+import org.apache.hudi.hadoop.UseFileSplitsFromInputFormat;
+import org.apache.hudi.hadoop.UseRecordReaderFromInputFormat;
+import org.apache.hudi.hadoop.utils.HoodieInputFormatUtils;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+/**
+ * HoodieRealtimeInputFormat for HUDI datasets which store data in HFile base 
file format.
+ */
+@UseRecordReaderFromInputFormat
+@UseFileSplitsFromInputFormat
+public class HoodieHFileRealtimeInputFormat extends HoodieHFileInputFormat {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieHFileRealtimeInputFormat.class);
+
+  @Override
+  public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException 
{
+Stream fileSplits = Arrays.stream(super.getSplits(job, 
numSplits)).map(is -> (FileSplit) is);
+return HoodieRealtimeInputFormatUtils.getRealtimeSplits(job, fileSplits);
+  }
+
+  @Override
+  public FileStatus[] listStatus(JobConf job) throws IOException {
+// Call the HoodieInputFormat::listStatus to obtain all latest hfiles, 
based on commit timeline.
+return super.listStatus(job);
+  }
+
+  @Override
+  protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline 
timeline) {
+// no specific filtering for Realtime format
+return timeline;
+  }
+
+  @Override
+  public RecordReader getRecordReader(final 
InputSplit split, final JobConf jobConf,
+  final Reporter reporter) throws IOException {
+// Hive on Spark invokes multiple getRecordReaders from different threads 
in the same spark task (and hence the

Review comment:
   we need to share code somehow. This same large comment need not be in 
multiple places





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-24 Thread GitBox


umehrot2 commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-679408763


   @rubenssoto yes currently EMR presto is on 0.232, but in upcoming releases 
you will see later versions of presto where you will be able to use this patch.
   
   If you want to manually give it a shot on current emr version..you can try 
to build presto 0.233 and replace presto-hive jar I believe on all nodes of the 
cluster and restart presto-server on all nodes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475940393



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHFileInputFormat.java
##
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop;
+
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapred.FileInputFormat;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieDefaultTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.hadoop.utils.HoodieHiveUtils;
+import org.apache.hudi.hadoop.utils.HoodieInputFormatUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * HoodieInputFormat for HUDI datasets which store data in HFile base file 
format.
+ */
+@UseFileSplitsFromInputFormat
+public class HoodieHFileInputFormat extends FileInputFormat implements Configurable {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieHFileInputFormat.class);
+
+  protected Configuration conf;
+
+  protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline 
timeline) {
+return HoodieInputFormatUtils.filterInstantsTimeline(timeline);
+  }
+
+  @Override
+  public FileStatus[] listStatus(JobConf job) throws IOException {

Review comment:
   While this is true, we can use more helpers and avoid copying to a large 
degree? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475939558



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
##
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.log.block;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.storage.HoodieHFileReader;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import org.apache.avro.Schema;
+import org.apache.avro.Schema.Field;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.compress.Compression;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.util.Pair;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.TreeMap;
+import java.util.stream.Collectors;
+
+import javax.annotation.Nonnull;
+
+/**
+ * HoodieHFileDataBlock contains a list of records stored inside an HFile 
format. It is used with the HFile
+ * base file format.
+ */
+public class HoodieHFileDataBlock extends HoodieDataBlock {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieHFileDataBlock.class);
+  private static Compression.Algorithm compressionAlgorithm = 
Compression.Algorithm.GZ;
+  private static int blockSize = 1 * 1024 * 1024;
+
+  public HoodieHFileDataBlock(@Nonnull Map 
logBlockHeader,
+   @Nonnull Map logBlockFooter,
+   @Nonnull Option blockContentLocation, 
@Nonnull Option content,
+   FSDataInputStream inputStream, boolean readBlockLazily) {
+super(logBlockHeader, logBlockFooter, blockContentLocation, content, 
inputStream, readBlockLazily);
+  }
+
+  public HoodieHFileDataBlock(HoodieLogFile logFile, FSDataInputStream 
inputStream, Option content,
+   boolean readBlockLazily, long position, long blockSize, long 
blockEndpos, Schema readerSchema,
+   Map header, Map 
footer) {
+super(content, inputStream, readBlockLazily,
+  Option.of(new HoodieLogBlockContentLocation(logFile, position, 
blockSize, blockEndpos)), readerSchema, header,
+  footer);
+  }
+
+  public HoodieHFileDataBlock(@Nonnull List records, @Nonnull 
Map header) {
+super(records, header, new HashMap<>());
+  }
+
+  @Override
+  public HoodieLogBlockType getBlockType() {
+return HoodieLogBlockType.HFILE_DATA_BLOCK;
+  }
+
+  @Override
+  protected byte[] serializeRecords() throws IOException {
+HFileContext context = new 
HFileContextBuilder().withBlockSize(blockSize).withCompression(compressionAlgorithm)
+.build();
+Configuration conf = new Configuration();
+CacheConfig cacheConfig = new CacheConfig(conf);
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
+
+HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
+.withOutputStream(ostream).withFileContext(context).create();
+
+// Serialize records into bytes
+Map recordMap = new TreeMap<>();
+Iterator itr = records.iterator();
+boolean useIntegerKey = false;
+int key = 0;
+int keySize = 0;
+Field keyField = 
records.get(0).getSchema().getField(HoodieRecord.RECORD_KEY_METADATA_FIELD);
+if (keyField == null) {
+  // Missing key metadata field so we should use an integer sequence key
+  useIntegerKey = true;
+  keySize = (int) Math.ceil(Math.log(records.size())) + 1;
+}
+while 

[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475939229



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
##
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.log.block;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.storage.HoodieHFileReader;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import org.apache.avro.Schema;
+import org.apache.avro.Schema.Field;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.compress.Compression;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.util.Pair;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.TreeMap;
+import java.util.stream.Collectors;
+
+import javax.annotation.Nonnull;
+
+/**
+ * HoodieHFileDataBlock contains a list of records stored inside an HFile 
format. It is used with the HFile
+ * base file format.
+ */
+public class HoodieHFileDataBlock extends HoodieDataBlock {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieHFileDataBlock.class);
+  private static Compression.Algorithm compressionAlgorithm = 
Compression.Algorithm.GZ;
+  private static int blockSize = 1 * 1024 * 1024;
+
+  public HoodieHFileDataBlock(@Nonnull Map 
logBlockHeader,
+   @Nonnull Map logBlockFooter,
+   @Nonnull Option blockContentLocation, 
@Nonnull Option content,
+   FSDataInputStream inputStream, boolean readBlockLazily) {
+super(logBlockHeader, logBlockFooter, blockContentLocation, content, 
inputStream, readBlockLazily);
+  }
+
+  public HoodieHFileDataBlock(HoodieLogFile logFile, FSDataInputStream 
inputStream, Option content,
+   boolean readBlockLazily, long position, long blockSize, long 
blockEndpos, Schema readerSchema,
+   Map header, Map 
footer) {
+super(content, inputStream, readBlockLazily,
+  Option.of(new HoodieLogBlockContentLocation(logFile, position, 
blockSize, blockEndpos)), readerSchema, header,
+  footer);
+  }
+
+  public HoodieHFileDataBlock(@Nonnull List records, @Nonnull 
Map header) {
+super(records, header, new HashMap<>());
+  }
+
+  @Override
+  public HoodieLogBlockType getBlockType() {
+return HoodieLogBlockType.HFILE_DATA_BLOCK;
+  }
+
+  @Override
+  protected byte[] serializeRecords() throws IOException {
+HFileContext context = new 
HFileContextBuilder().withBlockSize(blockSize).withCompression(compressionAlgorithm)
+.build();
+Configuration conf = new Configuration();
+CacheConfig cacheConfig = new CacheConfig(conf);
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
+
+HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
+.withOutputStream(ostream).withFileContext(context).create();
+
+// Serialize records into bytes
+Map recordMap = new TreeMap<>();
+Iterator itr = records.iterator();
+boolean useIntegerKey = false;
+int key = 0;
+int keySize = 0;
+Field keyField = 
records.get(0).getSchema().getField(HoodieRecord.RECORD_KEY_METADATA_FIELD);
+if (keyField == null) {
+  // Missing key metadata field so we should use an integer sequence key
+  useIntegerKey = true;
+  keySize = (int) Math.ceil(Math.log(records.size())) + 1;
+}
+while 

[GitHub] [hudi] s-sanjay commented on issue #1895: HUDI Dataset backed by Hive Metastore fails on Presto with Unknown converted type TIMESTAMP_MICROS

2020-08-24 Thread GitBox


s-sanjay commented on issue #1895:
URL: https://github.com/apache/hudi/issues/1895#issuecomment-679406205


   @FelixKJose I have raised a 
[PR](https://github.com/prestodb/presto/pull/15074)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475937184



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/storage/HoodieHFileConfig.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hbase.io.compress.Compression;
+import org.apache.hudi.common.bloom.BloomFilter;
+
+public class HoodieHFileConfig {
+
+  private Compression.Algorithm compressionAlgorithm;
+  private int blockSize;
+  private long maxFileSize;
+  private Configuration hadoopConf;
+  private BloomFilter bloomFilter;
+
+  public HoodieHFileConfig(Compression.Algorithm compressionAlgorithm, int 
blockSize, long maxFileSize,

Review comment:
   I think its here. 
https://github.com/nsivabalan/hudi/commit/70ee9947335c1ed0c92a285225ab728adaa7c5ab
 ? @nsivabalan  ? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475936227



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Queue;
+
+/**
+ * Hoodie merge handle which writes records (new inserts or updates) sorted by 
their key.
+ *
+ * The implementation performs a merge-sort by comparing the key of the record 
being written to the list of
+ * keys in newRecordKeys (sorted in-memory).
+ */
+public class HoodieSortedMergeHandle extends 
HoodieMergeHandle {
+
+  private Queue newRecordKeysSorted = new PriorityQueue<>();
+
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+   Iterator> recordItr, String partitionPath, String 
fileId, SparkTaskContextSupplier sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, 
sparkTaskContextSupplier);
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Called by compactor code path.
+   */
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,

Review comment:
   So, no code calls this atm ?
   
   The following code in `HoodieMergeOnReadTableCompactor` 
   
   ```
   if (oldDataFileOpt.isPresent()) {
 result = hoodieCopyOnWriteTable.handleUpdate(instantTime, 
operation.getPartitionPath(),
 operation.getFileId(), scanner.getRecords(),
 oldDataFileOpt.get());
   }
   ```
   ends up calling the following. 
   
   ```
protected HoodieMergeHandle getUpdateHandle(String instantTime, String 
partitionPath, String fileId,
 Map> keyToNewRecords, HoodieBaseFile 
dataFileToBeMerged) {
   return new HoodieMergeHandle<>(config, instantTime, this, 
keyToNewRecords,
   partitionPath, fileId, dataFileToBeMerged, 
sparkTaskContextSupplier);
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475936227



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Queue;
+
+/**
+ * Hoodie merge handle which writes records (new inserts or updates) sorted by 
their key.
+ *
+ * The implementation performs a merge-sort by comparing the key of the record 
being written to the list of
+ * keys in newRecordKeys (sorted in-memory).
+ */
+public class HoodieSortedMergeHandle extends 
HoodieMergeHandle {
+
+  private Queue newRecordKeysSorted = new PriorityQueue<>();
+
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+   Iterator> recordItr, String partitionPath, String 
fileId, SparkTaskContextSupplier sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, 
sparkTaskContextSupplier);
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Called by compactor code path.
+   */
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,

Review comment:
   So, no code calls this atm ? @prashantwason 
   
   The following code in `HoodieMergeOnReadTableCompactor` 
   
   ```
   if (oldDataFileOpt.isPresent()) {
 result = hoodieCopyOnWriteTable.handleUpdate(instantTime, 
operation.getPartitionPath(),
 operation.getFileId(), scanner.getRecords(),
 oldDataFileOpt.get());
   }
   ```
   ends up calling the following. 
   
   ```
protected HoodieMergeHandle getUpdateHandle(String instantTime, String 
partitionPath, String fileId,
 Map> keyToNewRecords, HoodieBaseFile 
dataFileToBeMerged) {
   return new HoodieMergeHandle<>(config, instantTime, this, 
keyToNewRecords,
   partitionPath, fileId, dataFileToBeMerged, 
sparkTaskContextSupplier);
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475935700



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Queue;
+
+/**
+ * Hoodie merge handle which writes records (new inserts or updates) sorted by 
their key.
+ *
+ * The implementation performs a merge-sort by comparing the key of the record 
being written to the list of
+ * keys in newRecordKeys (sorted in-memory).
+ */
+public class HoodieSortedMergeHandle extends 
HoodieMergeHandle {
+
+  private Queue newRecordKeysSorted = new PriorityQueue<>();
+
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+   Iterator> recordItr, String partitionPath, String 
fileId, SparkTaskContextSupplier sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, 
sparkTaskContextSupplier);
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Called by compactor code path.
+   */
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+  Map> keyToNewRecordsOrig, String partitionPath, 
String fileId,
+  HoodieBaseFile dataFileToBeMerged, SparkTaskContextSupplier 
sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, keyToNewRecordsOrig, 
partitionPath, fileId, dataFileToBeMerged,
+sparkTaskContextSupplier);
+
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Go through an old record. Here if we detect a newer version shows up, we 
write the new one to the file.
+   */
+  @Override
+  public void write(GenericRecord oldRecord) {
+String key = 
oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString();
+
+// To maintain overall sorted order across updates and inserts, write any 
new inserts whose keys are less than
+// the oldRecord's key.
+while (!newRecordKeysSorted.isEmpty() && 
newRecordKeysSorted.peek().compareTo(key) <= 0) {
+  String keyToPreWrite = newRecordKeysSorted.remove();

Review comment:
   cc @prashantwason can you please chime in. I feel we can avoid the queue 
altogether and just sort merge? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1980: [SUPPORT] Small files (423KB) generated after running delete query

2020-08-24 Thread GitBox


bvaradar commented on issue #1980:
URL: https://github.com/apache/hudi/issues/1980#issuecomment-679403428


   Yes, this is expected. We retain the penultimate version of the file to 
prevent a running query from failing. In this case, you might see only one 
version  of some file which did not see any deletes. That is expected.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


bvaradar commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475931147



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.exception.HoodieIOException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * TimelineUtils provides a common way to query incremental meta-data changes 
for a hoodie table.
+ *
+ * This is useful in multiple places including:
+ * 1) HiveSync - this can be used to query partitions that changed since 
previous sync.
+ * 2) Incremental reads - InputFormats can use this API to queryxw

Review comment:
   typo : queryxw -> query

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.exception.HoodieIOException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * TimelineUtils provides a common way to query incremental meta-data changes 
for a hoodie table.
+ *
+ * This is useful in multiple places including:
+ * 1) HiveSync - this can be used to query partitions that changed since 
previous sync.
+ * 2) Incremental reads - InputFormats can use this API to queryxw
+ */
+public class TimelineUtils {
+
+  /**
+   * Returns partitions that have new data strictly after commitTime.
+   * Does not include internal operations such as clean in the timeline.
+   */
+  public static List getPartitionsWritten(HoodieTimeline timeline) {
+HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline();
+return getPartitionsMutated(timelineToSync);
+  }
+
+  /**
+   * Returns partitions that have been modified including internal operations 
such as clean in the passed timeline.
+   */
+  public static List getPartitionsMutated(HoodieTimeline timeline) {

Review comment:
   I am assuming you are planning to use this API internally. Right ?

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific 

[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475932748



##
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java
##
@@ -55,7 +56,7 @@
   private long recordsWritten = 0;
   private long insertRecordsWritten = 0;
   private long recordsDeleted = 0;
-  private Iterator> recordIterator;
+  private Map> recordMap;

Review comment:
   okay understood this better. looks good for now 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.

2020-08-24 Thread GitBox


prashantwason commented on pull request #2026:
URL: https://github.com/apache/hudi/pull/2026#issuecomment-679401505


   @n3nash corrected the errors.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.

2020-08-24 Thread GitBox


prashantwason commented on a change in pull request #2026:
URL: https://github.com/apache/hudi/pull/2026#discussion_r475932300



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java
##
@@ -147,7 +153,7 @@ public RemoteHoodieTableFileSystemView(String server, int 
port, HoodieTableMetaC
 String url = builder.toString();
 LOG.info("Sending request : (" + url + ")");
 Response response;
-int timeout = 1000 * 300; // 5 min timeout
+int timeout = this.timeoutSecs * 1000; // msec

Review comment:
   Yes, msec.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java
##
@@ -52,7 +54,7 @@
   public static final String DEFAULT_FILESYSTEM_VIEW_INCREMENTAL_SYNC_MODE = 
"false";
   public static final String DEFUALT_REMOTE_VIEW_SERVER_HOST = "localhost";
   public static final Integer DEFAULT_REMOTE_VIEW_SERVER_PORT = 26754;
-
+  public static final Integer DEFAULT_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS = 
300 * 60; // 5 min

Review comment:
   Corrected.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] modi95 commented on a change in pull request #1975: [HUDI-1194][WIP] Reorganize HoodieHiveClient based on the way to call Hive API

2020-08-24 Thread GitBox


modi95 commented on a change in pull request #1975:
URL: https://github.com/apache/hudi/pull/1975#discussion_r475932170



##
File path: pom.xml
##
@@ -94,7 +94,7 @@
 2.9.9
 2.7.3
 org.apache.hive
-2.3.1
+2.3.6

Review comment:
   Why are we updating Hive? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] modi95 commented on a change in pull request #1975: [HUDI-1194][WIP] Reorganize HoodieHiveClient based on the way to call Hive API

2020-08-24 Thread GitBox


modi95 commented on a change in pull request #1975:
URL: https://github.com/apache/hudi/pull/1975#discussion_r475930539



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -303,8 +305,28 @@ object DataSourceWriteOptions {
   val DEFAULT_HIVE_PARTITION_EXTRACTOR_CLASS_OPT_VAL = 
classOf[SlashEncodedDayPartitionValueExtractor].getCanonicalName
   val DEFAULT_HIVE_ASSUME_DATE_PARTITION_OPT_VAL = "false"
   val DEFAULT_USE_PRE_APACHE_INPUT_FORMAT_OPT_VAL = "false"
+  val DEFAULT_HIVE_CLIENT_CLASS_OPT_VAL = 
classOf[HoodieHiveJDBCClient].getCanonicalName
+
+  @deprecated
+  val HIVE_USE_JDBC_OPT_KEY = "hoodie.datasource.hive_sync.use_jdbc"
+  @deprecated
   val DEFAULT_HIVE_USE_JDBC_OPT_VAL = "true"
 
+  def translateUseJDBCToHiveClientClass(optParams: Map[String, String]) : 
Map[String, String] = {
+if (optParams.contains(HIVE_USE_JDBC_OPT_KEY) && 
!optParams.contains(HIVE_CLIENT_CLASS_OPT_KEY)) {
+  log.warn(HIVE_USE_JDBC_OPT_KEY + " is deprecated and will be removed in 
a later release; Please use " + HIVE_CLIENT_CLASS_OPT_KEY)
+  if (optParams(HIVE_USE_JDBC_OPT_KEY).equals("true")) {
+optParams ++ Map(HIVE_CLIENT_CLASS_OPT_KEY -> 
DEFAULT_HIVE_CLIENT_CLASS_OPT_VAL)
+  } else if (optParams(HIVE_USE_JDBC_OPT_KEY).equals("false")) {
+optParams ++ Map(HIVE_CLIENT_CLASS_OPT_KEY -> 
classOf[HoodieHiveDriverClient].getCanonicalName)

Review comment:
   Could you add a comment either here or in `HoodieHiveDriverClient` 
explain why this is used when `HIVE_USE_JDBC_OPT_KEY` is `false`? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] jiegzhan commented on issue #1980: [SUPPORT] Small files (423KB) generated after running delete query

2020-08-24 Thread GitBox


jiegzhan commented on issue #1980:
URL: https://github.com/apache/hudi/issues/1980#issuecomment-679398095


   @bvaradar, before re-clustering is available, I tested 
[hoodie.cleaner.commits.retained](https://hudi.apache.org/docs/configurations.html#retainCommits).
 
   I set option("hoodie.cleaner.commits.retained", 1), then issued a few delete 
queries. For each parquet file in S3, the latest version and 1 older version 
(sometimes, not always) got kept in S3, all other versions are gone from S3. Is 
this how it works?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


n3nash commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475925314



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.exception.HoodieIOException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * TimelineUtils provides a common way to query incremental meta-data changes 
for a hoodie table.
+ *
+ * This is useful in multiple places including:
+ * 1) HiveSync - this can be used to query partitions that changed since 
previous sync.
+ * 2) Incremental reads - InputFormats can use this API to queryxw
+ */
+public class TimelineUtils {
+
+  /**
+   * Returns partitions that have new data strictly after commitTime.
+   * Does not include internal operations such as clean in the timeline.
+   */
+  public static List getPartitionsWritten(HoodieTimeline timeline) {
+HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline();
+return getPartitionsMutated(timelineToSync);
+  }
+
+  /**
+   * Returns partitions that have been modified including internal operations 
such as clean in the passed timeline.
+   */
+  public static List getPartitionsMutated(HoodieTimeline timeline) {
+return timeline.filterCompletedInstants().getInstants().flatMap(s -> {
+  switch (s.getAction()) {
+case HoodieTimeline.COMMIT_ACTION:
+case HoodieTimeline.DELTA_COMMIT_ACTION:
+  try {
+HoodieCommitMetadata commitMetadata = 
HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(s).get(), 
HoodieCommitMetadata.class);
+return commitMetadata.getPartitionToWriteStats().keySet().stream();
+  } catch (IOException e) {
+throw new HoodieIOException("Failed to get partitions written 
between " + timeline.firstInstant() + " " + timeline.lastInstant(), e);
+  }
+case HoodieTimeline.CLEAN_ACTION:
+  try {
+HoodieCleanMetadata cleanMetadata = 
TimelineMetadataUtils.deserializeHoodieCleanMetadata(timeline.getInstantDetails(s).get());
+return cleanMetadata.getPartitionMetadata().keySet().stream();
+  } catch (IOException e) {
+throw new HoodieIOException("Failed to get partitions cleaned 
between " + timeline.firstInstant() + " " + timeline.lastInstant(), e);
+  }
+case HoodieTimeline.COMPACTION_ACTION:
+  // compaction is not a completed instant.  So no need to consider 
this action.
+case HoodieTimeline.SAVEPOINT_ACTION:
+case HoodieTimeline.ROLLBACK_ACTION:
+case HoodieTimeline.RESTORE_ACTION:
+  return Stream.empty();

Review comment:
   Do you want to throw an exception here for now so it's not treated 
incorrectly ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


n3nash commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475925115



##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/TestTimelineUtils.java
##
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanPartitionMetadata;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.table.timeline.TimelineUtils;
+import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
+import org.apache.hudi.common.util.Option;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class TestTimelineUtils extends HoodieCommonTestHarness {
+
+  @BeforeEach
+  public void setUp() throws Exception {
+initMetaClient();
+  }
+
+  @Test
+  public void testGetPartitions() throws IOException {
+HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();
+HoodieTimeline activeCommitTimeline = activeTimeline.getCommitTimeline();
+assertTrue(activeCommitTimeline.empty());
+
+String olderPartition = "0"; // older partitions that is modified by all 
cleans
+for (int i = 1; i <= 5; i++) {
+  String ts = i + "";
+  HoodieInstant instant = new HoodieInstant(true, 
HoodieTimeline.COMMIT_ACTION, ts);
+  activeTimeline.createNewInstant(instant);
+  activeTimeline.saveAsComplete(instant, Option.of(getCommitMeta(basePath, 
ts, ts, 2)));
+
+  HoodieInstant cleanInstant = new HoodieInstant(true, 
HoodieTimeline.CLEAN_ACTION, ts);
+  activeTimeline.createNewInstant(cleanInstant);
+  activeTimeline.saveAsComplete(cleanInstant, getCleanMeta(olderPartition, 
ts));
+}
+
+metaClient.reloadActiveTimeline();
+
+// verify modified partitions included cleaned data
+List partitions = 
TimelineUtils.getPartitionsMutated(metaClient.getActiveTimeline().findInstantsAfter("1",
 10));
+assertEquals(5, partitions.size());
+assertEquals(partitions, Arrays.asList(new String[]{"0", "2", "3", "4", 
"5"}));
+
+partitions = 
TimelineUtils.getPartitionsMutated(metaClient.getActiveTimeline().findInstantsInRange("1",
 "4"));
+assertEquals(4, partitions.size());
+assertEquals(partitions, Arrays.asList(new String[]{"0", "2", "3", "4"}));
+
+// verify only commit actions
+partitions = 
TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().findInstantsAfter("1",
 10));
+assertEquals(4, partitions.size());
+assertEquals(partitions, Arrays.asList(new String[]{"2", "3", "4", "5"}));
+
+partitions = 
TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().findInstantsInRange("1",
 "4"));
+assertEquals(3, partitions.size());
+assertEquals(partitions, Arrays.asList(new String[]{"2", "3", "4"}));
+  }
+
+  @Test
+  public void testGetPartitionsUnpartitioned() throws IOException {
+HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();
+HoodieTimeline activeCommitTimeline = activeTimeline.getCommitTimeline();
+assertTrue(activeCommitTimeline.empty());
+
+String partitionPath = "";
+for (int i = 1; i <= 5; i++) {
+  String ts = i + "";
+  HoodieInstant instant = new HoodieInstant(true, 
HoodieTimeline.COMMIT_ACTION, ts);
+  activeTimeline.createNewInstant(instant);
+  

[GitHub] [hudi] prashanthvg89 opened a new issue #2029: Records seen with _hoodie_is_deleted set to true on non-existing record

2020-08-24 Thread GitBox


prashanthvg89 opened a new issue #2029:
URL: https://github.com/apache/hudi/issues/2029


   If a Hudi table, let's say, has zero rows and I issue an upsert with 
_hoodie_is_deleted = true then the record is still visible when I read the 
table. It works if the record was already existing but as a user I expect the 
same behavior on a non-existing record, too. So far the only way out seems to 
be to use soft delete or query with _hoodie_is_deleted = false



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.

2020-08-24 Thread GitBox


n3nash commented on a change in pull request #2026:
URL: https://github.com/apache/hudi/pull/2026#discussion_r475920521



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java
##
@@ -147,7 +153,7 @@ public RemoteHoodieTableFileSystemView(String server, int 
port, HoodieTableMetaC
 String url = builder.toString();
 LOG.info("Sending request : (" + url + ")");
 Response response;
-int timeout = 1000 * 300; // 5 min timeout
+int timeout = this.timeoutSecs * 1000; // msec

Review comment:
   Does the Request.connect expect timeout in millisecs ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


satishkotha commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475920279



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
##
@@ -296,6 +300,42 @@ public boolean isBeforeTimelineStarts(String instant) {
 return details.apply(instant);
   }
 
+  /**
+   * Returns partitions that have been modified in the timeline. This includes 
internal operations such as clean.
+   * Note that this only returns data for completed instants.
+   */
+  public List getPartitionsMutated() {
+return filterCompletedInstants().getInstants().flatMap(s -> {
+  switch (s.getAction()) {
+case HoodieTimeline.COMMIT_ACTION:
+case HoodieTimeline.DELTA_COMMIT_ACTION:
+  try {
+HoodieCommitMetadata commitMetadata = 
HoodieCommitMetadata.fromBytes(getInstantDetails(s).get(), 
HoodieCommitMetadata.class);
+return commitMetadata.getPartitionToWriteStats().keySet().stream();
+  } catch (IOException e) {
+throw new HoodieIOException("Failed to get partitions written 
between " + firstInstant() + " " + lastInstant(), e);
+  }
+case HoodieTimeline.CLEAN_ACTION:

Review comment:
   The method takes in a timeline. So hive sync only passes "commit" 
timeline to this method and gets only partitions modified by commit instants.  
I added another method just for clarity that only looks at commits. Let me know 
if you have any suggestions.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.

2020-08-24 Thread GitBox


n3nash commented on a change in pull request #2026:
URL: https://github.com/apache/hudi/pull/2026#discussion_r475920074



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java
##
@@ -52,7 +54,7 @@
   public static final String DEFAULT_FILESYSTEM_VIEW_INCREMENTAL_SYNC_MODE = 
"false";
   public static final String DEFUALT_REMOTE_VIEW_SERVER_HOST = "localhost";
   public static final Integer DEFAULT_REMOTE_VIEW_SERVER_PORT = 26754;
-
+  public static final Integer DEFAULT_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS = 
300 * 60; // 5 min

Review comment:
   @prashantwason How is this 5 mins ? If it's secs, it should just be 300 
? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


satishkotha commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475919445



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
##
@@ -296,6 +300,42 @@ public boolean isBeforeTimelineStarts(String instant) {
 return details.apply(instant);
   }
 
+  /**

Review comment:
   Thank you. Moved it





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


satishkotha commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475919254



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
##
@@ -232,6 +233,12 @@
*/
   Option getInstantDetails(HoodieInstant instant);
 
+  /**
+   * Returns partitions that have been modified in the timeline. This includes 
internal operations such as clean.
+   * Note that this only returns data for completed instants.
+   */
+  List getPartitionsMutated();

Review comment:
   created TimelineUtils





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (ea983ff -> f7e02aa)

2020-08-24 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from ea983ff  [HUDI-1137] Add option to configure different path selector
 add f7e02aa  [MINOR] Update DOAP with 0.6.0 Release (#2024)

No new revisions were added by this update.

Summary of changes:
 doap_HUDI.rdf | 5 +
 1 file changed, 5 insertions(+)



[GitHub] [hudi] bhasudha merged pull request #2024: [MINOR] Update DOAP with 0.6.0 Release

2020-08-24 Thread GitBox


bhasudha merged pull request #2024:
URL: https://github.com/apache/hudi/pull/2024


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on pull request #2016: [WIP] Add release page doc for 0.6.0

2020-08-24 Thread GitBox


bhasudha commented on pull request #2016:
URL: https://github.com/apache/hudi/pull/2016#issuecomment-679383433


   closing this in favor of https://github.com/apache/hudi/pull/2028 . Capture 
the comment there. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha closed pull request #2016: [WIP] Add release page doc for 0.6.0

2020-08-24 Thread GitBox


bhasudha closed pull request #2016:
URL: https://github.com/apache/hudi/pull/2016


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha opened a new pull request #2028: Site update and release page for 0.6.0

2020-08-24 Thread GitBox


bhasudha opened a new pull request #2028:
URL: https://github.com/apache/hudi/pull/2028


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1136) Add back findInstantsAfterOrEquals to the HoodieTimeline class

2020-08-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1136:
-
Labels: pull-request-available  (was: )

> Add back findInstantsAfterOrEquals to the HoodieTimeline class
> --
>
> Key: HUDI-1136
> URL: https://issues.apache.org/jira/browse/HUDI-1136
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475907308



##
File path: 
hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java
##
@@ -45,22 +45,39 @@
 import org.apache.avro.generic.GenericRecord;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 import org.apache.parquet.avro.AvroSchemaConverter;
 import org.apache.parquet.hadoop.ParquetWriter;
 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
 import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.StructType;
+
+import com.databricks.spark.avro.SchemaConverters;
+
+import scala.Function1;

Review comment:
   Figured out a way to fix it for now. tests seem happy 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason opened a new pull request #2027: [HUDI-1136] Add back findInstantsAfterOrEquals to the HoodieTimeline class.

2020-08-24 Thread GitBox


prashantwason opened a new pull request #2027:
URL: https://github.com/apache/hudi/pull/2027


   ## What is the purpose of the pull request
   
   Add an API findInstantsAfterOrEquals  to HoodieTimeline.
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


bvaradar commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475895369



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
##
@@ -296,6 +300,42 @@ public boolean isBeforeTimelineStarts(String instant) {
 return details.apply(instant);
   }
 
+  /**

Review comment:
   Timeline APIs are only about instants  in general. I think adding 
partitions here is breaking that abstraction. Can you move this to some helper 
class

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
##
@@ -232,6 +233,12 @@
*/
   Option getInstantDetails(HoodieInstant instant);
 
+  /**
+   * Returns partitions that have been modified in the timeline. This includes 
internal operations such as clean.
+   * Note that this only returns data for completed instants.
+   */
+  List getPartitionsMutated();

Review comment:
   +1 on abstraction point. I think having a separate helper class would be 
better.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
##
@@ -296,6 +300,42 @@ public boolean isBeforeTimelineStarts(String instant) {
 return details.apply(instant);
   }
 
+  /**
+   * Returns partitions that have been modified in the timeline. This includes 
internal operations such as clean.
+   * Note that this only returns data for completed instants.
+   */
+  public List getPartitionsMutated() {
+return filterCompletedInstants().getInstants().flatMap(s -> {
+  switch (s.getAction()) {
+case HoodieTimeline.COMMIT_ACTION:
+case HoodieTimeline.DELTA_COMMIT_ACTION:
+  try {
+HoodieCommitMetadata commitMetadata = 
HoodieCommitMetadata.fromBytes(getInstantDetails(s).get(), 
HoodieCommitMetadata.class);
+return commitMetadata.getPartitionToWriteStats().keySet().stream();
+  } catch (IOException e) {
+throw new HoodieIOException("Failed to get partitions written 
between " + firstInstant() + " " + lastInstant(), e);
+  }
+case HoodieTimeline.CLEAN_ACTION:

Review comment:
   We dont need to look at clean and (event compaction) for figuring out 
changed partitions for the case of hive syncing. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


satishkotha commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475901534



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
##
@@ -232,6 +233,12 @@
*/
   Option getInstantDetails(HoodieInstant instant);
 
+  /**
+   * Returns partitions that have been modified in the timeline. This includes 
internal operations such as clean.
+   * Note that this only returns data for completed instants.
+   */
+  List getPartitionsMutated();

Review comment:
   @n3nash I want this to be in hudi-common, so this can be reused in 
hadoop-mr and hive-sync.  Do you want me to create TimelineUtils in common?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1135) Make timeline server timeout settings configurable

2020-08-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1135:
-
Labels: pull-request-available  (was: )

> Make timeline server timeout settings configurable
> --
>
> Key: HUDI-1135
> URL: https://issues.apache.org/jira/browse/HUDI-1135
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason opened a new pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.

2020-08-24 Thread GitBox


prashantwason opened a new pull request #2026:
URL: https://github.com/apache/hudi/pull/2026


   ## *Tips*
   ## What is the purpose of the pull request
   
   Make timeline server timeout settings configurable.
   
   ## Brief change log
   
   Add timeout config settings for timeline server.
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   This pull request is already covered by existing tests.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475893386



##
File path: 
hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java
##
@@ -45,22 +45,39 @@
 import org.apache.avro.generic.GenericRecord;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 import org.apache.parquet.avro.AvroSchemaConverter;
 import org.apache.parquet.hadoop.ParquetWriter;
 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
 import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.StructType;
+
+import com.databricks.spark.avro.SchemaConverters;
+
+import scala.Function1;

Review comment:
   @prashantwason it's best we don't import scala methods into java. 
com.databricks/avro is simply deprecated. We have eequivalent functionality in 
spark-avro itself. In general we have to fix this test code that reads HFile as 
a DataSet in a different way .  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


n3nash commented on a change in pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#discussion_r475892688



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
##
@@ -232,6 +233,12 @@
*/
   Option getInstantDetails(HoodieInstant instant);
 
+  /**
+   * Returns partitions that have been modified in the timeline. This includes 
internal operations such as clean.
+   * Note that this only returns data for completed instants.
+   */
+  List getPartitionsMutated();

Review comment:
   I don't think this makes sense in the timeline as of now. If you take a 
look at the timeline API's,  they only talk about the metadata that has 
changed. `getPartitionsMutated` conceptually is providing what has changed in 
the underlying data as opposed to what has changed in the timeline per se. 
Generally, all of this information should come from the timeline but that 
requires a full redesign on the timeline. Should we add this API here -> 
https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L195
 ? And you can wrap this functionality in a TimelineUtils ? 
   When we have clearer design on timeline, we can merge back the TimelineUtils 
to the real timeline...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 edited a comment on issue #1955: [SUPPORT] DMS partition treated as part of pk

2020-08-24 Thread GitBox


tooptoop4 edited a comment on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-679356547


   @nsivabalan  perfect, that is how I expect. perhaps the default should be 
global index? or documentation should be updated?
   
   From coming from RDBMS background the PK is unique at table level not at 
partition level but reading below configs it is not clear that hudi default is 
different and I'm sure will trip up many newcomers to hudi:
   
   "RECORDKEY_FIELD_OPT_KEY (Required): **Primary key** field(s). Nested fields 
can be specified using the dot notation eg: a.b.c. When using multiple columns 
as primary key use comma separated notation, eg: "col1,col2,col3,etc". Single 
or multiple columns as primary key specified by KEYGENERATOR_CLASS_OPT_KEY 
property.
   Default value: "uuid"
   
   PARTITIONPATH_FIELD_OPT_KEY (Required): Columns to be used for 
**partitioning** the table. To prevent partitioning, provide empty string as 
value eg: "". Specify partitioning/no partitioning using 
KEYGENERATOR_CLASS_OPT_KEY. If synchronizing to hive, also specify using 
HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY.
   Default value: "partitionpath""
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

2020-08-24 Thread GitBox


tooptoop4 commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-679356547


   perfect, that is how I expect. perhaps the default should be global index? 
or documentation should be updated?
   
   From coming from RDBMS background the PK is unique at table level not at 
partition level but reading below configs it is not clear that hudi default is 
different and I'm sure will trip up many newcomers to hudi:
   
   "RECORDKEY_FIELD_OPT_KEY (Required): **Primary key** field(s). Nested fields 
can be specified using the dot notation eg: a.b.c. When using multiple columns 
as primary key use comma separated notation, eg: "col1,col2,col3,etc". Single 
or multiple columns as primary key specified by KEYGENERATOR_CLASS_OPT_KEY 
property.
   Default value: "uuid"
   
   PARTITIONPATH_FIELD_OPT_KEY (Required): Columns to be used for 
**partitioning** the table. To prevent partitioning, provide empty string as 
value eg: "". Specify partitioning/no partitioning using 
KEYGENERATOR_CLASS_OPT_KEY. If synchronizing to hive, also specify using 
HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY.
   Default value: "partitionpath""
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 commented on issue #1954: [SUPPORT] DMS Caused by: java.lang.IllegalArgumentException: Partition key parts [] does not match with partition values

2020-08-24 Thread GitBox


tooptoop4 commented on issue #1954:
URL: https://github.com/apache/hudi/issues/1954#issuecomment-679353671


   @bvaradar in each comment I am trying brand new tables with different spark 
submits. So not changing an existing table.
   
   try to reproduce with
   
   /home/ec2-user/spark_home/bin/spark-submit --conf 
"spark.hadoop.fs.s3a.proxy.host=redact" --conf 
"spark.hadoop.fs.s3a.proxy.port=redact" --conf 
"spark.driver.extraClassPath=/home/ec2-user/json-20090211.jar" --conf 
"spark.executor.extraClassPath=/home/ec2-user/json-20090211.jar" --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars 
"/home/ec2-user/spark-avro_2.11-2.4.6.jar" --master spark://redact:7077 
--deploy-mode client /home/ec2-user/hudi-utilities-bundle_2.11-0.5.3-1.jar 
--table-type COPY_ON_WRITE --source-ordering-field TimeCreated --source-class 
org.apache.hudi.utilities.sources.ParquetDFSSource --enable-hive-sync 
--hoodie-conf hoodie.datasource.hive_sync.database=redact --hoodie-conf 
hoodie.datasource.hive_sync.table=dmstest_multpk7 --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
 --hoodie-conf  hoodie.datasource.hive_sync.use_jdbc=false --target-base-path 
s3a://redact/my2/multpk7 --target-
 table dmstest_multpk7 --transformer-class 
org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class 
org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
 --hoodie-conf hoodie.datasource.write.recordkey.field=version_no,group_company 
--hoodie-conf "hoodie.datasource.write.partitionpath.field=" --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://redact/dbo/tbl
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

2020-08-24 Thread GitBox


nsivabalan commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-679352339


   @tooptoop4 : can you clarify what you mean by this.
   ```
   ie for each version_no,group_company combo, i want to get the latest row by 
TimeCreated (ie the source-ordering-field) and then partition on whatever 
sys_user that latest row has.
   ```
   But in general, yes, if you use global index with the update partition path 
set, you should not see any duplicates in your entire hoodie dataset. 
   
   I can try to illustrate with an eg. Lets say each row consists only 4 vals, 
v_no(version no), cmp (group_company), time_cr, sys_user.
   Incase of regular index, combination of record keys and partition path forms 
unique keys. 
   
   If you are using regular index and ingest 
   v_1, c_1, t_1, u_1
   v_2, c_1, t_1, u_1
   v_1, c_1, t_1, u_2
   v_1, c_1, t_1, u_3
   
   This will result in 2 rows going to partition u_1, 1 row to partition u_2, 
and one row to u_3. 
   
   In 2nd batch of updates, lets say you ingest few more rows. 
   v_1, c_1, t_2, u_1
   v_3, c_1, t_2, u_1
   v_1, c_2, t_2, u_2
   v_1, c_3, t_2, u_3
   
   Here is the result
   u_1:
   v_1, c_1, t_2, u_1 (updated with latest value)
   v_2, c_1, t_1, u_1
   v_3, c_1, t_2, u_1 (insert from 2nd batch)
   u_2:
   v_1, c_2, t_2, u_2 (updated with latest value)
   u_3:
   v_1, c_1, t_1, u_3
   v_1, c_3, t_2, u_3(insert from 2nd batch)
   
   Incase of global index, only record keys are unique. 
   Lets see an example with global bloom, but with the update partition path 
config not set.
   
   If 1st batch of ingest contains
   v_1, c_1, t_1, u_1
   v_1, c_2, t_1, u_1
   v_2, c_1, t_1, u_2
   v_3, c_1, t_1, u_3
   
   result will be. 
   
   v_1, c_1, t_1, u_1 
   v_1, c_2, t_1, u_1
   v_2, c_1, t_1, u_2 
   v_3, c_1, t_1, u_3
   
   And 2nd batch of ingest contains 
   v_1, c_1, t_2, u_1 (updating with latest time)
   v_1, c_2, t_2, u_2 (moving v1,c2 from u_1 to u_2). expectation is that, this 
will update U_1 only, since the config is not set. and hence new partition path 
i.e. u_2 will be ignored. 
   v_2, c_2, t_2, u_2 (new insert)
   v_1, c_3, t_2, u_3 (new insert)
   
   So, the result will be
   v_1, c_1, t_2, u_1 (updated with latest time)
   v_1, c_2, t_2, u_1 (updated with latest time even though incoming record was 
sent to u_2)
   v_2, c_1, t_1, u_2 
   v_2, c_2, t_2, u_2 (new insert)
   v_3, c_1, t_1, u_3
   v_1, c_3, t_2, u_3 (new insert)
   
   We can go the same with the config value set. 
   
   result from first batch:
   v_1, c_1, t_1, u_1 
   v_1, c_2, t_1, u_1
   v_2, c_1, t_1, u_2 
   v_3, c_1, t_1, u_3
   
   And 2nd batch of ingest contains 
   v_1, c_1, t_2, u_1 (updating with latest time)
   v_1, c_2, t_2, u_2 (moving v1,c2 from u_1 to u_2). expectation is that, this 
will insert a new record to u_2 and will delete corres record from u_1, since 
the config is set.
   v_2, c_2, t_2, u_2 (new insert)
   v_1, c_3, t_2, u_3 (new insert)
   
   So, the result will be
   v_1, c_1, t_2, u_1 (updated with latest time)
   v_1, c_2, t_2, u_2 (updated with latest time and old record is deleted)
   v_2, c_1, t_1, u_2 
   v_2, c_2, t_2, u_2 (new insert)
   v_3, c_1, t_1, u_3
   v_1, c_3, t_2, u_3 (new insert)
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash merged pull request #2023: [HUDI-1137] Add option to configure different path selector

2020-08-24 Thread GitBox


n3nash merged pull request #2023:
URL: https://github.com/apache/hudi/pull/2023


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1137] Add option to configure different path selector

2020-08-24 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new ea983ff  [HUDI-1137] Add option to configure different path selector
ea983ff is described below

commit ea983ff912dab8604eec3085bc1c041cb6e60bc8
Author: Satish Kotha 
AuthorDate: Mon Aug 24 11:11:10 2020 -0700

[HUDI-1137] Add option to configure different path selector
---
 .../integ/testsuite/helpers/DFSTestSuitePathSelector.java  | 14 --
 .../main/java/org/apache/hudi/utilities/UtilHelpers.java   |  9 +++--
 .../org/apache/hudi/utilities/sources/AvroDFSSource.java   |  3 +--
 .../hudi/utilities/sources/helpers/DFSPathSelector.java|  1 +
 4 files changed, 21 insertions(+), 6 deletions(-)

diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/helpers/DFSTestSuitePathSelector.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/helpers/DFSTestSuitePathSelector.java
index b67e21f..bfc8368 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/helpers/DFSTestSuitePathSelector.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/helpers/DFSTestSuitePathSelector.java
@@ -32,12 +32,17 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.collection.ImmutablePair;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.integ.testsuite.HoodieTestSuiteJob;
 import org.apache.hudi.utilities.sources.helpers.DFSPathSelector;
 
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
 /**
  * A custom dfs path selector used only for the hudi test suite. To be used 
only if workload is not run inline.
  */
 public class DFSTestSuitePathSelector extends DFSPathSelector {
+  private static volatile Logger log = 
LoggerFactory.getLogger(HoodieTestSuiteJob.class);
 
   public DFSTestSuitePathSelector(TypedProperties props, Configuration 
hadoopConf) {
 super(props, hadoopConf);
@@ -54,9 +59,12 @@ public class DFSTestSuitePathSelector extends 
DFSPathSelector {
 lastBatchId = Integer.parseInt(lastCheckpointStr.get());
 nextBatchId = lastBatchId + 1;
   } else {
-lastBatchId = -1;
-nextBatchId = 0;
+lastBatchId = 0;
+nextBatchId = 1;
   }
+
+  log.info("Using DFSTestSuitePathSelector, checkpoint: " + 
lastCheckpointStr + " sourceLimit: " + sourceLimit
+  + " lastBatchId: " + lastBatchId + " nextBatchId: " + nextBatchId);
   // obtain all eligible files for the batch
   List eligibleFiles = new ArrayList<>();
   FileStatus[] fileStatuses = fs.globStatus(
@@ -73,6 +81,8 @@ public class DFSTestSuitePathSelector extends DFSPathSelector 
{
   }
 }
   }
+
+  log.info("Reading " + eligibleFiles.size() + " files. ");
   // no data to readAvro
   if (eligibleFiles.size() == 0) {
 return new ImmutablePair<>(Option.empty(),
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
index 0531196..14e16ab 100644
--- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
+++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
@@ -352,12 +352,17 @@ public class UtilHelpers {
 }
   }
 
-  public static DFSPathSelector createSourceSelector(String 
sourceSelectorClass, TypedProperties props,
+  public static DFSPathSelector createSourceSelector(TypedProperties props,
   Configuration conf) throws IOException {
+String sourceSelectorClass =
+props.getString(DFSPathSelector.Config.SOURCE_INPUT_SELECTOR, 
DFSPathSelector.class.getName());
 try {
-  return (DFSPathSelector) ReflectionUtils.loadClass(sourceSelectorClass,
+  DFSPathSelector selector = (DFSPathSelector) 
ReflectionUtils.loadClass(sourceSelectorClass,
   new Class[]{TypedProperties.class, Configuration.class},
   props, conf);
+
+  LOG.info("Using path selector " + selector.getClass().getName());
+  return selector;
 } catch (Throwable e) {
   throw new IOException("Could not load source selector class " + 
sourceSelectorClass, e);
 }
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroDFSSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroDFSSource.java
index b5ce96f..b8f29e8 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroDFSSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroDFSSource.java
@@ -47,8 +47,7 @@ public class AvroDFSSource extends AvroSource {
   SchemaProvider schemaProvider) throws IOException {
 super(props, sparkContext, sparkSession, 

[GitHub] [hudi] nsivabalan commented on a change in pull request #2016: [WIP] Add release page doc for 0.6.0

2020-08-24 Thread GitBox


nsivabalan commented on a change in pull request #2016:
URL: https://github.com/apache/hudi/pull/2016#discussion_r475863568



##
File path: docs/_pages/releases.md
##
@@ -5,6 +5,72 @@ layout: releases
 toc: true
 last_modified_at: 2020-05-28T08:40:00-07:00
 ---
+## [Release 0.6.0](https://github.com/apache/hudi/releases/tag/release-0.6.0) 
([docs](/docs/0.6.0-quick-start-guide.html))
+
+### Download Information
+ * Source Release : [Apache Hudi 0.6.0 Source 
Release](https://downloads.apache.org/hudi/0.6.0/hudi-0.6.0.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.6.0/hudi-0.6.0.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.6.0/hudi-0.6.0.src.tgz.sha512))
+ * Apache Hudi jars corresponding to this release is available 
[here](https://repository.apache.org/#nexus-search;quick~hudi)
+
+### Migration Guide for this release
+ - With 0.6.0 Hudi is moving from list based rollback to marker based 
rollbacks. To smoothly aid this transition a 
+ new property called `hoodie.table.version` is added to hoodie.properties 
file. Whenever hoodie is launched with 
+ newer table version i.e 1 (or moving from pre 0.6.0 to 0.6.0), an upgrade 
step will be executed automatically 
+ to adhere to marker based rollback. This automatic upgrade step will happen 
just once per dataset as the 
+ `hoodie.table.version` will be updated in property file after upgrade is 
completed.
+ - Similarly, a command line tool for Downgrading is added if in case some 
users want to downgrade hoodie from 
+ table version 1 to 0 or move from hoodie 0.6.0 to pre 0.6.0
+ 
+### Release Highlights
+
+ Ingestion side improvements:
+  - Hudi now supports `Azure Data Lake Storage V2` , `Alluxio` and `Tencent 
Cloud Object Storage` storages.
+  - Add support for "bulk_insert" without converting to RDD. This has better 
performance compared to existing "bulk_insert".
+This implementation uses Datasource for writing to storage with support 
for key generators to operate on Row 
+(rather than HoodieRecords as per previous "bulk_insert") is added.
+  - # TODO Add more about bulk insert modes. 
+  - # TODO Add more on bootstrap. 
+  - In previous versions, auto clean runs synchronously after ingestion. 
Starting 0.6.0, Hudi does cleaning and ingestion in parallel.
+  - Support async compaction for spark streaming writes to hudi table. 
Previous versions supported only inline compactions.
+  - Implemented rollbacks using marker files instead of relying on commit 
metadata. Please check the migration guide for more details on this.
+  - A new InlineFileSystem has been added to support embedding any file format 
as an inline format within a regular file.
+
+ Query side improvements:
+  - Starting 0.6.0, snapshot queries are feasible via spark datasource. 
+  - In prior versions we only supported HoodieCombineHiveInputFormat for 
CopyOnWrite tables to ensure that there is a limit on the number of mappers 
spawned for
+any query. Hudi now supports Merge on Read tables also using 
HoodieCombineInputFormat.
+  - Speedup spark read queries by caching metaclient in HoodieROPathFilter. 
This helps reduce listing related overheads in S3 when filtering files for 
read-optimized queries. 
+
+ DeltaStreamer improvements:
+  - HoodieMultiDeltaStreamer: adds support for ingesting multiple kafka 
streams in a single DeltaStreamer deployment
+  - Added a new tool - InitialCheckPointProvider, to set checkpoints when 
migrating to DeltaStreamer after an initial load of the table is complete..
+  - Add CSV source support.
+  - Added chained transformer that can add chain multiple transformers.
+
+ Indexing improvements:
+  - Added a new index `HoodieSimpleIndex` which joins incoming records with 
base files to index records.
+  - Added ability to configure user defined indexes.
+
+ Key generation improvements:
+  - Introduced `CustomTimestampBasedKeyGenerator` to support complex keys as 
record key and custom partition paths.
+  - Support more time units and dat/time formats in 
`TimestampBasedKeyGenerator`  
+
+ Developer productivity and monitoring improvements:
+  - Spark DAGs are named to aid better debuggability
+  - Console, JMX, Prometheus and DataDog metric reporters have been added.
+  - Support pluggable metrics reporting by introducing proper abstraction for 
user defined metrics.
+
+ CLI related features:
+  - Added support for deleting savepoints via CLI
+  - Added a new command - `export instants`, to export metadata of instants

Review comment:
   sure. you can add something like this. Feel free to edit as per 
convenience.
   ```
   A command line tool is added to hudi-cli, to assist in upgrading or 
downgrading the hoodie dataset. "UPGRADE" or "DOWNGRADE" is the command to use. 
DOWNGRADE has to be done using hudi-cli if someone prefers to downgrade their 
hoodie dataset from 0.6.0 to any pre 0.6.0 versions. 
   ```





[jira] [Updated] (HUDI-1056) Ensure validate_staged_release.sh also runs against released version in release repo

2020-08-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1056:
-
Labels: pull-request-available  (was: )

> Ensure validate_staged_release.sh also runs against released version in 
> release repo
> 
>
> Key: HUDI-1056
> URL: https://issues.apache.org/jira/browse/HUDI-1056
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: Balaji Varadarajan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> Currently, validate_staged_release.sh expects rc_num to be set. Once we 
> release the version, we need this script to download released source tarball 
> and run checks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan opened a new pull request #2025: [HUDI-1056] Fix release validate script for rc_num and release_type

2020-08-24 Thread GitBox


nsivabalan opened a new pull request #2025:
URL: https://github.com/apache/hudi/pull/2025


   ## What is the purpose of the pull request
   
   Fixing release validate script for making rc_num optional and to introduce 
release_type(dev/release)
   
   ## Brief change log
   
   Fixing release validate script for making rc_num optional and to introduce 
release_type(dev/release)
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ x] Has a corresponding JIRA in PR title & commit

- [ x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1962: [SUPPORT] Unable to filter hudi table in hive on partition column

2020-08-24 Thread GitBox


bvaradar commented on issue #1962:
URL: https://github.com/apache/hudi/issues/1962#issuecomment-679316776


   For the second case, Hive Metastore would be filtering out partitions and 
only return specific paths. I think there is some inconsistency between the 
path used in the filesystem and the one that is present in meta-store. 
   
   @sassai : Sorry for the delay, Can you recursively list your hoodie data set 
and attach the output. Also please add the file contents of latest .commit or 
.deltacommit file .
   
   Also, add the output for one of the partition with location: 
   describe formatted table_name partition 
(year=xxx,month=xxx,day=xxx,hour=xxx,minute=xxx);
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-24 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r475815871



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
##
@@ -81,6 +87,7 @@ public Builder fromProperties(Properties props) {
 
 public Builder limitFileSize(long maxFileSize) {
   props.setProperty(PARQUET_FILE_MAX_BYTES, String.valueOf(maxFileSize));
+  props.setProperty(HFILE_FILE_MAX_BYTES, String.valueOf(maxFileSize));

Review comment:
   not sure if its a good idea to overload two configs like this. we may 
need to break this builder method up separately.

##
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java
##
@@ -55,7 +56,7 @@
   private long recordsWritten = 0;
   private long insertRecordsWritten = 0;
   private long recordsDeleted = 0;
-  private Iterator> recordIterator;
+  private Map> recordMap;

Review comment:
   need to ensure that having this be a map wont affect normal inserts.

##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Queue;
+
+/**
+ * Hoodie merge handle which writes records (new inserts or updates) sorted by 
their key.
+ *
+ * The implementation performs a merge-sort by comparing the key of the record 
being written to the list of
+ * keys in newRecordKeys (sorted in-memory).
+ */
+public class HoodieSortedMergeHandle extends 
HoodieMergeHandle {
+
+  private Queue newRecordKeysSorted = new PriorityQueue<>();
+
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+   Iterator> recordItr, String partitionPath, String 
fileId, SparkTaskContextSupplier sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, 
sparkTaskContextSupplier);
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Called by compactor code path.
+   */
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+  Map> keyToNewRecordsOrig, String partitionPath, 
String fileId,
+  HoodieBaseFile dataFileToBeMerged, SparkTaskContextSupplier 
sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, keyToNewRecordsOrig, 
partitionPath, fileId, dataFileToBeMerged,
+sparkTaskContextSupplier);
+
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Go through an old record. Here if we detect a newer version shows up, we 
write the new one to the file.
+   */
+  @Override
+  public void write(GenericRecord oldRecord) {
+String key = 
oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString();
+
+// To maintain overall sorted order across updates and inserts, write any 
new inserts whose keys are less than
+// the oldRecord's key.
+while (!newRecordKeysSorted.isEmpty() && 
newRecordKeysSorted.peek().compareTo(key) <= 0) {
+  String keyToPreWrite = newRecordKeysSorted.remove();

Review comment:
   instead, we can just do a streaming sort-merge? 

##
File path: 
hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestHoodieDemo.java
##
@@ -115,6 +115,35 @@ public void testParquetDemo() throws Exception {
 testIncrementalHiveQueryAfterCompaction();
   }
 
+  @Test
+  public void testHFileDemo() throws Exception {

Review comment:
   this is good. but will add to our integration test 

[hudi] branch asf-site updated: Travis CI build asf-site

2020-08-24 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new e995bd2  Travis CI build asf-site
e995bd2 is described below

commit e995bd2819c7df640826bada28f9dd87c0110a62
Author: CI 
AuthorDate: Mon Aug 24 19:11:45 2020 +

Travis CI build asf-site
---
 content/assets/js/lunr/lunr-store.js |  2 +-
 content/cn/docs/querying_data.html   | 12 ++---
 content/docs/comparison.html |  4 +-
 content/docs/configurations.html | 88 
 content/docs/deployment.html |  6 +--
 content/docs/powered_by.html |  4 ++
 content/docs/querying_data.html  | 49 
 content/docs/structure.html  |  2 +-
 8 files changed, 135 insertions(+), 32 deletions(-)

diff --git a/content/assets/js/lunr/lunr-store.js 
b/content/assets/js/lunr/lunr-store.js
index 5e4b619..c07019e 100644
--- a/content/assets/js/lunr/lunr-store.js
+++ b/content/assets/js/lunr/lunr-store.js
@@ -825,7 +825,7 @@ var store = [{
 "url": "https://hudi.apache.org/docs/writing_data.html;,
 "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
 "title": "查询 Hudi 数据集",
-"excerpt":"从概念上讲,Hudi物理存储一次数据到DFS上,同时在其上提供三个逻辑视图,如之前所述。 数据集同步到Hive 
Metastore后,它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包, 
就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。 具体来说,在写入过程中传递了两个由table name命名的Hive表。 
例如,如果table name = hudi_tbl,我们得到 hudi_tbl 实现了由 HoodieParquetInputFormat 
支持的数据集的读优化视图,从而提供了纯列式数据。 hudi_tbl_rt 实现了由 HoodieParquetRealtimeInputFormat 
支持的数据集的实时视图,从而提供了基础数据和日志数据的合并视图。 如概念部分所述,增量处理所需要的 
一个关键原语是增量拉取(以从数据集中获取更改流/日志)。您可以增量提取Hudi数据集,这意味着自指定的即时时间起, 您可�
 �只获得全部更新和新行。 这与插入更新一起使用,对于构建某 [...]
+"excerpt":"从概念上讲,Hudi物理存储一次数据到DFS上,同时在其上提供三个逻辑视图,如之前所述。 数据集同步到Hive 
Metastore后,它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包, 
就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。 具体来说,在写入过程中传递了两个由table name命名的Hive表。 
例如,如果table name = hudi_tbl,我们得到 hudi_tbl 实现了由 HoodieParquetInputFormat 
支持的数据集的读优化视图,从而提供了纯列式数据。 hudi_tbl_rt 实现了由 HoodieParquetRealtimeInputFormat 
支持的数据集的实时视图,从而提供了基础数据和日志数据的合并视图。 如概念部分所述,增量处理所需要的 
一个关键原语是增量拉取(以从数据集中获取更改流/日志)。您可以增量提取Hudi数据集,这意味着自指定的即时时间起, 您可�
 �只获得全部更新和新行。 这与插入更新一起使用,对于构建某 [...]
 "tags": [],
 "url": "https://hudi.apache.org/cn/docs/querying_data.html;,
 "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
diff --git a/content/cn/docs/querying_data.html 
b/content/cn/docs/querying_data.html
index e808682..3129f6d 100644
--- a/content/cn/docs/querying_data.html
+++ b/content/cn/docs/querying_data.html
@@ -375,7 +375,7 @@
   增量拉取
 
   
-  Presto
+  PrestoDB
   Impala (3.4 or later)
 
   读优化表
@@ -434,7 +434,7 @@
   Y
 
 
-  Presto
+  PrestoDB
   Y
   N
 
@@ -477,8 +477,8 @@
   Y
 
 
-  Presto
-  N
+  PrestoDB
+  Y
   N
   Y
 
@@ -703,9 +703,9 @@ Upsert实用程序(HoodieDeltaStreamer
   
 
 
-Presto
+PrestoDB
 
-Presto是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
+PrestoDB是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
 这需要在整个安装过程中将hudi-presto-bundle 
jar放入presto_install/plugin/hive-hadoop2/中。
 
 Impala (3.4 or later)
diff --git a/content/docs/comparison.html b/content/docs/comparison.html
index 3fd1343..2a50a60 100644
--- a/content/docs/comparison.html
+++ b/content/docs/comparison.html
@@ -392,7 +392,7 @@ we expect Hudi to positioned at something that ingests 
parquet with superior per
 Hive transactions does not offer the read-optimized storage option or the 
incremental pulling, that Hudi does. In terms of implementation choices, Hudi 
leverages
 the full power of a processing framework like Spark, while Hive transactions 
feature is implemented underneath by Hive tasks/queries kicked off by user or 
the Hive metastore.
 Based on our production experience, embedding Hudi as a library into existing 
Spark pipelines was much easier and less operationally heavy, compared with the 
other approach.
-Hudi is also designed to work with non-hive enginers like Presto/Spark and 
will incorporate file formats other than parquet over time.
+Hudi is also designed to work with non-hive engines like PrestoDB/Spark and 
will incorporate file formats other than parquet over time.
 
 HBase
 
@@ -410,7 +410,7 @@ integration of Hudi library with Spark/Spark streaming 
DAGs. In case of Non-Spar
 and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In 
more conceptual level, data processing
 pipelines just consist of three components : source, processing, sink, with users ultimately running queries 
against the sink to use the results of the pipeline.
 Hudi can act as either a source or sink, that stores data on DFS. 
Applicability of Hudi to a given stream processing pipeline ultimately boils 
down to suitability
-of Presto/SparkSQL/Hive for your 

[GitHub] [hudi] bvaradar commented on issue #1954: [SUPPORT] DMS Caused by: java.lang.IllegalArgumentException: Partition key parts [] does not match with partition values

2020-08-24 Thread GitBox


bvaradar commented on issue #1954:
URL: https://github.com/apache/hudi/issues/1954#issuecomment-679305798


   @tooptoop4 : IIUC, Are you effectively changing a table from non-partitioned 
to partitioned ? The exception you added to the last comment was about a 
missing file which does not tie up with your comments. Can you elaborate on the 
steps to reproduce this. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] jpugliesi edited a comment on issue #2002: [SUPPORT] Inconsistent Commits between CLI and Incremental Query

2020-08-24 Thread GitBox


jpugliesi edited a comment on issue #2002:
URL: https://github.com/apache/hudi/issues/2002#issuecomment-679298548


   @bvaradar I suspected this may have been the case, but I was not able to 
find any documentation anywhere that states that a commit tracks the timestamp 
of when a _specific subset of records is changed_, as opposed to the timestamp 
of when a write operation was executed. Does such documentation exist, and if 
so, can you please point me to it?
   
   Since incremental query does not necessarily contain the full set of table 
commits, is there an alternative way to get the full tabe commit history via 
the Spark or some other API _besides the CLI_? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha opened a new pull request #2024: [MINOR] Update DOAP with 0.6.0 Release

2020-08-24 Thread GitBox


bhasudha opened a new pull request #2024:
URL: https://github.com/apache/hudi/pull/2024


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] jpugliesi commented on issue #2002: [SUPPORT] Inconsistent Commits between CLI and Incremental Query

2020-08-24 Thread GitBox


jpugliesi commented on issue #2002:
URL: https://github.com/apache/hudi/issues/2002#issuecomment-679298548


   @bvaradar I suspected this may have been the case, but I was not able to 
find any documentation anywhere that states that a commit tracks the timestamp 
of when a _specific subset of records is changed_, as opposed to the timestamp 
of when a write operation was executed. Does such documentation exist, and if 
so, can you please point me to it?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-24 Thread GitBox


n3nash commented on pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#issuecomment-679298263


   Could we structure this as a "BaseTableMetaClient", "HoodieTableMetaClient" 
and "HoodieTableIncrementalMetaClient" ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-424) Implement Hive Query Side Integration for querying tables containing bootstrap file slices

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-424:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Implement Hive Query Side Integration for querying tables containing 
> bootstrap file slices
> --
>
> Key: HUDI-424
> URL: https://issues.apache.org/jira/browse/HUDI-424
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.1
>
>  Time Spent: 336h
>  Remaining Estimate: 0h
>
> Support for Hive read-optimized and realtime queries 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-769) Write blog about HoodieMultiTableDeltaStreamer in cwiki

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-769:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Write blog about HoodieMultiTableDeltaStreamer in cwiki
> ---
>
> Key: HUDI-769
> URL: https://issues.apache.org/jira/browse/HUDI-769
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs, docs-chinese
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-575:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Support Async Compaction for spark streaming writes to hudi table
> -
>
> Key: HUDI-575
> URL: https://issues.apache.org/jira/browse/HUDI-575
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> Currenlty, only inline compaction is supported for Structured streaming 
> writes. 
>  
> We need to 
>  * Enable configuring async compaction for streaming writes 
>  * Implement a parallel compaction process like we did for delta streamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-806) Implement support for bootstrapping via Spark datasource API

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-806:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Implement support for bootstrapping via Spark datasource API
> 
>
> Key: HUDI-806
> URL: https://issues.apache.org/jira/browse/HUDI-806
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.1
>
>  Time Spent: 336h
>  Remaining Estimate: 0h
>
> This Jira tracks the work required to perform bootstrapping through Spark 
> data source API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-421) Cleanup bootstrap code and create PR for FileStystemView changes

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-421:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Cleanup bootstrap code and create PR for  FileStystemView changes
> -
>
> Key: HUDI-421
> URL: https://issues.apache.org/jira/browse/HUDI-421
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.1
>
>  Time Spent: 240h
>  Remaining Estimate: 0h
>
> FileSystemView needs changes to identify and handle bootstrap file slices. 
> Code changes are present in 
> [https://github.com/bvaradar/hudi/tree/vb_bootstrap] Needs cleanup before 
> they are ready to become PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-425) Implement support for bootstrapping in HoodieDeltaStreamer

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-425:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Implement support for bootstrapping in HoodieDeltaStreamer
> --
>
> Key: HUDI-425
> URL: https://issues.apache.org/jira/browse/HUDI-425
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: help-wanted
> Fix For: 0.6.1
>
>  Time Spent: 168h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1031) Document how to set job scheduling configs for Async compaction

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1031:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Document how to set job scheduling configs for Async compaction 
> 
>
> Key: HUDI-1031
> URL: https://issues.apache.org/jira/browse/HUDI-1031
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
>  
> In case of deltastreamer, Spark job scheduling configs are automatically set. 
> As the configs needs to be set before spark context is initiated, it is not 
> fully automated for Structured Streaming 
> [https://spark.apache.org/docs/latest/job-scheduling.html]
> We need to document how to set job scheduling configs for Spark Structured 
> streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-422) Cleanup bootstrap code and create write APIs for supporting bootstrap

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-422:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Cleanup bootstrap code and create write APIs for supporting bootstrap 
> --
>
> Key: HUDI-422
> URL: https://issues.apache.org/jira/browse/HUDI-422
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.1
>
>  Time Spent: 96h
>  Remaining Estimate: 0h
>
> Once refactor for HoodieWriteClient is done, we can cleanup and introduce 
> HoodieBootstrapClient as a separate PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-428) Web documentation for explaining how to bootstrap

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-428:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Web documentation for explaining how to bootstrap 
> --
>
> Key: HUDI-428
> URL: https://issues.apache.org/jira/browse/HUDI-428
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Need to provide examples (demo) to document bootstrapping



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-807) Spark DS Support for incremental queries for bootstrapped tables

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-807:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Spark DS Support for incremental queries for bootstrapped tables
> 
>
> Key: HUDI-807
> URL: https://issues.apache.org/jira/browse/HUDI-807
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.1
>
>  Time Spent: 120h
>  Remaining Estimate: 0h
>
> Investigate and figure out the changes required in Spark integration code to 
> make incremental queries work seamlessly for bootstrapped tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-423) Implement upsert functionality for handling updates to these bootstrap file slices

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-423:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Implement upsert functionality for handling updates to these bootstrap file 
> slices
> --
>
> Key: HUDI-423
> URL: https://issues.apache.org/jira/browse/HUDI-423
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core, Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.1
>
>  Time Spent: 168h
>  Remaining Estimate: 0h
>
> Needs support to handle upsert of these file-slices. For MOR tables, also 
> need compaction support. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-420) Automated end to end Integration Test

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-420:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Automated end to end Integration Test
> -
>
> Key: HUDI-420
> URL: https://issues.apache.org/jira/browse/HUDI-420
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.1
>
>  Time Spent: 72h
>  Remaining Estimate: 0h
>
> We need end to end test as part ITTestHoodieDemo to also include bootstrap 
> table cases.
> We can have a new table bootstrapped from the Hoodie table build in the demo 
> and ensure queries work and return same responses



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-418) Bootstrap Index - Implementation

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-418:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Bootstrap Index - Implementation
> 
>
> Key: HUDI-418
> URL: https://issues.apache.org/jira/browse/HUDI-418
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> An implementation for 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+:+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi#RFC-12:EfficientMigrationofLargeParquetTablestoApacheHudi-BootstrapIndex:]
>  is present in 
> [https://github.com/bvaradar/hudi/blob/vb_bootstrap/hudi-common/src/main/java/org/apache/hudi/common/consolidated/CompositeMapFile.java]
>  
> We need to make it solid with unit-tests and cleanup. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-114) Allow for clients to overwrite the payload implementation in hoodie.properties

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-114.
--

> Allow for clients to overwrite the payload implementation in hoodie.properties
> --
>
> Key: HUDI-114
> URL: https://issues.apache.org/jira/browse/HUDI-114
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: newbie, Writer Core
>Reporter: Nishith Agarwal
>Assignee: Pratyaksh Sharma
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, once the payload class is set once in hoodie.properties, it cannot 
> be changed. In some cases, if a code refactor is done and the jar updated, 
> one may need to pass the new payload class name.
> Also, fix picking up the payload name for datasource API. By default 
> HoodieAvroPayload is written whereas for datasource API default is 
> OverwriteLatestAvroPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-114) Allow for clients to overwrite the payload implementation in hoodie.properties

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-114:


> Allow for clients to overwrite the payload implementation in hoodie.properties
> --
>
> Key: HUDI-114
> URL: https://issues.apache.org/jira/browse/HUDI-114
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: newbie, Writer Core
>Reporter: Nishith Agarwal
>Assignee: Pratyaksh Sharma
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, once the payload class is set once in hoodie.properties, it cannot 
> be changed. In some cases, if a code refactor is done and the jar updated, 
> one may need to pass the new payload class name.
> Also, fix picking up the payload name for datasource API. By default 
> HoodieAvroPayload is written whereas for datasource API default is 
> OverwriteLatestAvroPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-114) Allow for clients to overwrite the payload implementation in hoodie.properties

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-114.

Resolution: Fixed

> Allow for clients to overwrite the payload implementation in hoodie.properties
> --
>
> Key: HUDI-114
> URL: https://issues.apache.org/jira/browse/HUDI-114
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: newbie, Writer Core
>Reporter: Nishith Agarwal
>Assignee: Pratyaksh Sharma
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, once the payload class is set once in hoodie.properties, it cannot 
> be changed. In some cases, if a code refactor is done and the jar updated, 
> one may need to pass the new payload class name.
> Also, fix picking up the payload name for datasource API. By default 
> HoodieAvroPayload is written whereas for datasource API default is 
> OverwriteLatestAvroPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-590) Cut a new Doc version 0.5.1 explicitly

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-590:


> Cut a new Doc version 0.5.1 explicitly
> --
>
> Key: HUDI-590
> URL: https://issues.apache.org/jira/browse/HUDI-590
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs, Release  Administrative
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The latest version of docs needs to be tagged as 0.5.1 explicitly in the 
> site. Follow instructions in 
> [https://github.com/apache/incubator-hudi/blob/asf-site/README.md#updating-site]
>  to create a new dir 0.5.1 under docs/_docs/ 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-590) Cut a new Doc version 0.5.1 explicitly

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-590.

Resolution: Fixed

> Cut a new Doc version 0.5.1 explicitly
> --
>
> Key: HUDI-590
> URL: https://issues.apache.org/jira/browse/HUDI-590
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs, Release  Administrative
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The latest version of docs needs to be tagged as 0.5.1 explicitly in the 
> site. Follow instructions in 
> [https://github.com/apache/incubator-hudi/blob/asf-site/README.md#updating-site]
>  to create a new dir 0.5.1 under docs/_docs/ 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-624) Split some of the code from PR for HUDI-479

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-624.
--

> Split some of the code from PR for HUDI-479 
> 
>
> Key: HUDI-624
> URL: https://issues.apache.org/jira/browse/HUDI-624
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: patch, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, 
> making it easier for review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   >