date:20200302

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer

2020-03-02 Thread GitBox

pratyakshsharma commented on a change in pull request #1362: HUDI-644 Enable 
user to get checkpoint from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#discussion_r386845759
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 ##
 @@ -529,6 +514,47 @@ private void registerAvroSchemas(SchemaProvider 
schemaProvider) {
 }
   }
 
+  /**
+   * Search delta streamer checkpoint from the previous commits.
+   *
+   * @param commitTimelineOpt HoodieTimeline object
+   * @return checkpoint metadata as String
+   */
+  private Option retrieveCheckpointFromCommits(Option 
commitTimelineOpt) throws Exception {
+Option lastCommit = commitTimelineOpt.get().lastInstant();
+if (lastCommit.isPresent()) {
+  HoodieCommitMetadata commitMetadata = HoodieCommitMetadata
+  
.fromBytes(commitTimelineOpt.get().getInstantDetails(lastCommit.get()).get(), 
HoodieCommitMetadata.class);
+  // user-defined checkpoint appeared and not equal to the user-defined 
checkpoint of the last commit
+  if (cfg.checkpoint != null && 
!cfg.checkpoint.equals(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))) {
+return Option.of(cfg.checkpoint);
+  }
+  int commitsToCheckCnt;
+  // search check point from previous commits in backward
+  if (cfg.searchCheckpoint) {
+commitsToCheckCnt = commitTimelineOpt.get().countInstants();
+  } else {
+commitsToCheckCnt = 1; // only check the last commit
+  }
+  Option curCommit;
+  for (int i = 0; i < commitsToCheckCnt; ++i) {
 
 Review comment:
   My bad. :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer

2020-03-02 Thread GitBox

pratyakshsharma commented on issue #1362: HUDI-644 Enable user to get 
checkpoint from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-593813910
 
 
   @garyli1019 still I feel all these challenges are arising because you are 
trying to ingest data in the same dataset using 2 different spark jobs. Few 
questions - 
   
   1. If the kafka cluster retention time is too long, have you tried using 
BULK_INSERT mode of Hudi?If not, you can tune parameters around spark and Hudi 
to increase source limit and then ingest the data. Else you can also try using 
DeltaStreamer in continuous mode. 
   2. Also I would like to know the reason behind switching everytime from 
homebrew spark to Hudi. Are you doing some POC on Hudi? Why don't you simply 
use DeltaStreamer and never switch to the other data source? The data loss will 
not happen if you simply rely on one of the data sources :) 
   
   I am a bit skeptical of trying to use 2 pipelines to write to same 
destination path. Additionally we have options available for taking backup of 
your hudi dataset or for migrating existing dataset to Hudi. Anyways if you 
strongly feel the need to write this checkPointGenerator, let us hear the 
opinion of @leesf and @vinothchandar as well on this before proceeding. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-553) Building/Running Hudi on higher java versions

2020-03-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-553:

Labels: pull-request-available  (was: )

> Building/Running Hudi on higher java versions
> -
>
> Key: HUDI-553
> URL: https://issues.apache.org/jira/browse/HUDI-553
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> [https://github.com/apache/incubator-hudi/issues/1235] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1369: [HUDI-553] Building/Running Hudi on higher java versions

2020-03-02 Thread GitBox

lamber-ken opened a new pull request #1369: [HUDI-553] Building/Running Hudi on 
higher java versions
URL: https://github.com/apache/incubator-hudi/pull/1369
 
 
   ## What is the purpose of the pull request
   
   Support building/running Hudi on higher java versions (8 & 11).
   
   ## Brief change log
   1. `package javax.xml.bind does not exist`
 - sun.misc.BASE64Encoder/Decoder (JDK 6,7,8)
 - javax.xml.bind.DatatypeConverter (JDK 6,7,8)
 - java.util.Base64.{Encoder, Decoder} (JDK 8,9,11)

   2. Need upgrade to `scala-2.11.12` https://github.com/sbt/zinc/issues/641
   ```
   [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
java.lang.Object in compiler mirror not found.
   [ERROR]at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:17)
   [ERROR]at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:18)
   [ERROR]at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:53)
   ```
   
   3. `java.lang.NoClassDefFoundError: javax/tools/ToolProvider` 
   https://github.com/davidB/scala-maven-plugin/issues/217
   ```
   [ERROR] error: java.lang.NoClassDefFoundError: javax/tools/ToolProvider
   [INFO]  at 
scala.reflect.io.JavaToolsPlatformArchive.iterator(ZipArchive.scala:301)
   [INFO]  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
   [INFO]  at scala.reflect.io.AbstractFile.foreach(AbstractFile.scala:92)
   [INFO]  at 
scala.tools.nsc.util.DirectoryClassPath.traverse(ClassPath.scala:277)
   [INFO]  at 
scala.tools.nsc.util.DirectoryClassPath.x$15$lzycompute(ClassPath.scala:299)
   [INFO]  at scala.tools.nsc.util.DirectoryClassPath.x$15(ClassPath.scala:299)
   ```
   
   ## Verify this pull request
   
   `Spark-2.4.4` doesn't support JDK 9 
https://issues.apache.org/jira/browse/SPARK-24417
   
   Using `spark-3.0.0-preview2-bin-hadoop2.7` and `hudi - 2.12`
   ```
   mvn clean package -DskipTests -Dscala-2.12
   ```
   ```
   /bin/spark-shell \
 --packages org.apache.spark:spark-avro_2.12:2.4.4 \
 --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   
   ## Committer checklist
   
- [X] Has a corresponding JIRA in PR title & commit

- [X] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Resolved] (HUDI-630) Add Impala support in querying page

2020-03-02 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-630.

Resolution: Fixed

> Add Impala support in querying page 
> 
>
> Key: HUDI-630
> URL: https://issues.apache.org/jira/browse/HUDI-630
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs, docs-chinese
>Reporter: Bhavani Sudha
>Assignee: Yanjia Gary Li
>Priority: Minor
>
> After next release of Impala (that supports Hudi) Hudi docs querying data 
> page needs to be updated. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-630) Add Impala support in querying page

2020-03-02 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-630:
---
Status: In Progress  (was: Open)

> Add Impala support in querying page 
> 
>
> Key: HUDI-630
> URL: https://issues.apache.org/jira/browse/HUDI-630
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs, docs-chinese
>Reporter: Bhavani Sudha
>Assignee: Yanjia Gary Li
>Priority: Minor
>
> After next release of Impala (that supports Hudi) Hudi docs querying data 
> page needs to be updated. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-630) Add Impala support in querying page

2020-03-02 Thread Bhavani Sudha (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049913#comment-17049913
 ] 

Bhavani Sudha commented on HUDI-630:


This is taken care by the PR - 
[https://github.com/apache/incubator-hudi/pull/1349] and 
https://github.com/apache/incubator-hudi/pull/1367

> Add Impala support in querying page 
> 
>
> Key: HUDI-630
> URL: https://issues.apache.org/jira/browse/HUDI-630
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs, docs-chinese
>Reporter: Bhavani Sudha
>Assignee: Yanjia Gary Li
>Priority: Minor
>
> After next release of Impala (that supports Hudi) Hudi docs querying data 
> page needs to be updated. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-630) Add Impala support in querying page

2020-03-02 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-630:
--

Assignee: Yanjia Gary Li  (was: Bhavani Sudha)

> Add Impala support in querying page 
> 
>
> Key: HUDI-630
> URL: https://issues.apache.org/jira/browse/HUDI-630
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs, docs-chinese
>Reporter: Bhavani Sudha
>Assignee: Yanjia Gary Li
>Priority: Minor
>
> After next release of Impala (that supports Hudi) Hudi docs querying data 
> page needs to be updated. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-589) Fix references to Views in some of the pages. Replace with Query instead

2020-03-02 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-589.

Fix Version/s: (was: 0.5.2)
   Resolution: Fixed

> Fix references to Views in some of the pages. Replace with Query instead
> 
>
> Key: HUDI-589
> URL: https://issues.apache.org/jira/browse/HUDI-589
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> querying_data.html still has some references to 'views'. This needs to be 
> replaced with 'queries'/'query types' appropriately.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-589) Fix references to Views in some of the pages. Replace with Query instead

2020-03-02 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-589:
---
Status: In Progress  (was: Open)

> Fix references to Views in some of the pages. Replace with Query instead
> 
>
> Key: HUDI-589
> URL: https://issues.apache.org/jira/browse/HUDI-589
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> querying_data.html still has some references to 'views'. This needs to be 
> replaced with 'queries'/'query types' appropriately.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] garyli1019 commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer

2020-03-02 Thread GitBox

garyli1019 commented on issue #1362: HUDI-644 Enable user to get checkpoint 
from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-593770649
 
 
   I think running the parallel jobs once sounds a little bit hacky. The best 
way should be to generate the checkpoint string and pass it to the delta 
streamer in the first run. In this way, I will need to write a checkpoint 
generator to scan all the files generated by Kafka connect. This is definitely 
doable but needs some effort. 
   So I think we can do this to help the users migrate to delta streamer:
   - checkPointGenerator helper functions help users generate the checkpoint 
from popular sink connectors(Kafka connect, Spark streaming e.t.c)
   - Allow the user to commit without using delta streamer to fix the gap if 
the checkpoint is difficult to generate.
   Any thoughts? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: MINOR add impala to matrix

2020-03-02 Thread bhavanisudha

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 8ecbcd3  MINOR add impala to matrix
8ecbcd3 is described below

commit 8ecbcd3d55acc4953ead2729b98d032d40b5292d
Author: garyli1019 
AuthorDate: Mon Mar 2 20:48:16 2020 -0800

MINOR add impala to matrix
---
 docs/_docs/2_3_querying_data.md | 4 
 1 file changed, 4 insertions(+)

diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index c4ab865..875b7f0 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -41,6 +41,8 @@ Following tables show whether a given query is supported on 
specific query engin
 |**Spark SQL**|Y|Y|
 |**Spark Datasource**|Y|Y|
 |**Presto**|Y|N|
+|**Impala**|Y|N|
+
 
 Note that `Read Optimized` queries are not applicable for COPY_ON_WRITE tables.
 
@@ -52,6 +54,8 @@ Note that `Read Optimized` queries are not applicable for 
COPY_ON_WRITE tables.
 |**Spark SQL**|Y|Y|Y|
 |**Spark Datasource**|N|N|Y|
 |**Presto**|N|N|Y|
+|**Impala**|N|N|N|
+
 
 In sections, below we will discuss specific setup to access different query 
types from different query engines.

[GitHub] [incubator-hudi] bhasudha merged pull request #1367: MINOR add impala to matrix

2020-03-02 Thread GitBox

bhasudha merged pull request #1367: MINOR add impala to matrix
URL: https://github.com/apache/incubator-hudi/pull/1367
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file

2020-03-02 Thread Bhavani Sudha (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049910#comment-17049910
 ] 

Bhavani Sudha commented on HUDI-305:


@[~309637554]  thanks for the innterest in this thread. 

[~bdscheller] is working on a fix. 

Regarding the path filter support, its already added in latest Presto master 
and is available in [https://github.com/prestodb/presto/pull/13818.] Please 
check and let me know if that helps.

> Presto MOR "_rt" queries only reads base parquet file 
> --
>
> Key: HUDI-305
> URL: https://issues.apache.org/jira/browse/HUDI-305
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Presto Integration
> Environment: On AWS EMR
>Reporter: Brandon Scheller
>Assignee: Bhavani Sudha Saktheeswaran
>Priority: Major
>
> Code example to reproduce.
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> val df = Seq(
>   ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_events_mor_1"
> var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update a record with event_name "event_name_123" => "event_name_changed"
> val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
> val df2 = df1.filter($"event_id" === "104")
> val df3 = df2.withColumn("event_name", lit("event_name_changed"))
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option("hoodie.compact.inline", "false")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Now when querying the real-time table from Hive, we have no issue seeing the 
> updated value:
> {code:java}
> hive> select event_name from hudi_events_mor_1_rt;
> OK
> event_name_900
> event_name_changed
> event_name_546
> event_name_678
> Time taken: 0.103 seconds, Fetched: 4 row(s)
> {code}
> But when querying the real-time table from Presto, we only read the base 
> parquet file and do not see the update that should be merged in from the log 
> file.
> {code:java}
> presto:default> select event_name from hudi_events_mor_1_rt;
>event_name
> 
>  event_name_900
>  event_name_123
>  event_name_546
>  event_name_678
> (4 rows)
> {code}
> Our current understanding of this issue is that while the 
> HoodieParquetRealtimeInputFormat correctly generates the splits. The 
> RealtimeCompactedRecordReader record reader is not used so it is not reading 
> the log file and only reading the base parquet file.
>

[GitHub] [incubator-hudi] satishkotha commented on issue #1368: [HUDI-650] Modify handleUpdate path to validate partitionPath

2020-03-02 Thread GitBox

satishkotha commented on issue #1368: [HUDI-650] Modify handleUpdate path to 
validate partitionPath
URL: https://github.com/apache/incubator-hudi/pull/1368#issuecomment-593767854
 
 
   @bvaradar I briefly talked to you about this earlier today. Please take a 
look and let me know what you think. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-650) HoodieTable handleUpdate to include partition paths

2020-03-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-650:

Labels: pull-request-available  (was: )

> HoodieTable handleUpdate to include partition paths
> ---
>
> Key: HUDI-650
> URL: https://issues.apache.org/jira/browse/HUDI-650
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
>
> HoodieTable handleUpdate takes in fileId and list of records.  It does not 
> validate all records belong to same partitionPath. This is error prone - 
> there is already at least one test that is passing in records that belong to 
> several partitions to this method. Fix to take partitionPath and also 
> validate all records belong to same partition path.
> This will be helpful for  adding 'insert overwrite' functionality as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] satishkotha opened a new pull request #1368: [HUDI-650] Modify handleUpdate path to validate partitionPath

2020-03-02 Thread GitBox

satishkotha opened a new pull request #1368: [HUDI-650] Modify handleUpdate 
path to validate partitionPath
URL: https://github.com/apache/incubator-hudi/pull/1368
 
 
   ## What is the purpose of the pull request
   
   HoodieTable handleUpdate takes in fileId and list of records. It does not 
validate all records belong to same partitionPath. This is error prone - there 
is already at least one test that is passing in records that belong to several 
partitions to this method. Fix to add partitionPath and also validate all 
records belong to same partition path. 
I'm not entirely sure this change is needed though. I think its cleaner to 
include partition path everywhere. It is a good safeguard, but maybe its 
unlikely this can happen in production. Sending it out since i already spent 
some time. I can discard if others think this is unnecessary.
   
   ## Brief change log
   - Do not assume first record partition can be used for all remaining records
   - Track partition path in Bucket and in Partitioner
   - Fail updates if they somehow end up in wrong partition
   - Fix broken test to validate requests fail
   - There is one other test commented out (looks like by mistake). Bring it 
back
   
   ## Verify this pull request
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   TestCopyOnWriteTable (especially testInsertUpsertWithHoodieAvroPayload)
   TestMergeOnReadTable
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] garyli1019 commented on issue #1367: MINOR add impala to matrix

2020-03-02 Thread GitBox

garyli1019 commented on issue #1367: MINOR add impala to matrix
URL: https://github.com/apache/incubator-hudi/pull/1367#issuecomment-593767316
 
 
   @bhasudha Please review. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] garyli1019 opened a new pull request #1367: MINOR add impala to matrix

2020-03-02 Thread GitBox

garyli1019 opened a new pull request #1367: MINOR add impala to matrix
URL: https://github.com/apache/incubator-hudi/pull/1367
 
 
   I am not sure if the Read Optimized Table will work if the format is MOR 
table. Will double check once I get the chance. 
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan edited a comment on issue #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

xushiyan edited a comment on issue #1360: [HUDI-344][RFC-09] Hudi Dataset 
Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#issuecomment-593753301
 
 
   @OpenOpened Thanks for the changes. I saw there are issues with existing 
Copier tests which need some further refactoring and fixing. 
   
   I'd like to set a realistic criteria for merging this PR: as we don't have 
to make the utility perfect in the first round so I'd give +1 as long as the 
functionality works in a basic manner: output can be generated and put an 
`@experimental` in the class docs without `@deprecate` the original Copier. 
Please kindly share your input. @leesf @vinothchandar thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

xushiyan commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386788061
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.table.TableFileSystemView;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameWriter;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.execution.datasources.DataSource;
+
+import scala.Tuple2;
+import scala.collection.JavaConversions;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Export the latest records of Hudi dataset to a set of external files (e.g., 
plain parquet files).
+ */
+
+public class HoodieSnapshotExporter {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSnapshotExporter.class);
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--source-base-path"}, description = "Base path for 
the source Hudi dataset to be snapshotted", required = true)
+String sourceBasePath = null;
+
+@Parameter(names = {"--target-base-path"}, description = "Base path for 
the target output files (snapshots)", required = true)
+String targetOutputPath = null;
+
+@Parameter(names = {"--snapshot-prefix"}, description = "Snapshot prefix 
or directory under the target base path in order to segregate different 
snapshots")
+String snapshotPrefix;
+
+@Parameter(names = {"--output-format"}, description = "e.g. Hudi or 
Parquet", required = true)
+String outputFormat;
+
+@Parameter(names = {"--output-partition-field"}, description = "A field to 
be used by Spark repartitioning")
+String outputPartitionField;
+  }
+
+  public int export(SparkSession spark, Config cfg) throws IOException {
+JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
+FileSystem fs = FSUtils.getFs(cfg.sourceBasePath, 
jsc.hadoopConfiguration());
+
+final SerializableConfiguration serConf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
+final HoodieTableMetaClient tableMetadata = new 
HoodieTableMetaClient(fs.getConf(), cfg.sourceBasePath);
+final TableFileSystemView.BaseFileOnlyView fsView = new 
HoodieTableFileSystemView(tableMetadata,
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants());
+// Get the latest commit
+Option latestCommit =
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
+if (!latestCommit.isPresent()) {
+  LOG.error("No commits present. Nothing to snapshot");
+  return -1;
+}
+final String latestCommitTimestamp = latestCommit.get().getTimestamp();
+LOG.info(String.format("Starting to snapshot latest version files which 
are also no-late-than %s.",
+latestCommitTimestamp));
+
+

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

xushiyan commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386786787
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.table.TableFileSystemView;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameWriter;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.execution.datasources.DataSource;
+
+import scala.Tuple2;
+import scala.collection.JavaConversions;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Export the latest records of Hudi dataset to a set of external files (e.g., 
plain parquet files).
+ */
+
+public class HoodieSnapshotExporter {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSnapshotExporter.class);
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--source-base-path"}, description = "Base path for 
the source Hudi dataset to be snapshotted", required = true)
+String sourceBasePath = null;
+
+@Parameter(names = {"--target-base-path"}, description = "Base path for 
the target output files (snapshots)", required = true)
+String targetOutputPath = null;
+
+@Parameter(names = {"--snapshot-prefix"}, description = "Snapshot prefix 
or directory under the target base path in order to segregate different 
snapshots")
+String snapshotPrefix;
+
+@Parameter(names = {"--output-format"}, description = "e.g. Hudi or 
Parquet", required = true)
+String outputFormat;
+
+@Parameter(names = {"--output-partition-field"}, description = "A field to 
be used by Spark repartitioning")
+String outputPartitionField;
+  }
+
+  public int export(SparkSession spark, Config cfg) throws IOException {
+JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
+FileSystem fs = FSUtils.getFs(cfg.sourceBasePath, 
jsc.hadoopConfiguration());
+
+final SerializableConfiguration serConf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
+final HoodieTableMetaClient tableMetadata = new 
HoodieTableMetaClient(fs.getConf(), cfg.sourceBasePath);
+final TableFileSystemView.BaseFileOnlyView fsView = new 
HoodieTableFileSystemView(tableMetadata,
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants());
+// Get the latest commit
+Option latestCommit =
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
+if (!latestCommit.isPresent()) {
+  LOG.error("No commits present. Nothing to snapshot");
+  return -1;
+}
+final String latestCommitTimestamp = latestCommit.get().getTimestamp();
+LOG.info(String.format("Starting to snapshot latest version files which 
are also no-late-than %s.",
+latestCommitTimestamp));
+
+

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

xushiyan commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386786763
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.table.TableFileSystemView;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameWriter;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.execution.datasources.DataSource;
+
+import scala.Tuple2;
+import scala.collection.JavaConversions;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Export the latest records of Hudi dataset to a set of external files (e.g., 
plain parquet files).
+ */
+
+public class HoodieSnapshotExporter {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSnapshotExporter.class);
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--source-base-path"}, description = "Base path for 
the source Hudi dataset to be snapshotted", required = true)
+String sourceBasePath = null;
+
+@Parameter(names = {"--target-base-path"}, description = "Base path for 
the target output files (snapshots)", required = true)
+String targetOutputPath = null;
+
+@Parameter(names = {"--snapshot-prefix"}, description = "Snapshot prefix 
or directory under the target base path in order to segregate different 
snapshots")
+String snapshotPrefix;
+
+@Parameter(names = {"--output-format"}, description = "e.g. Hudi or 
Parquet", required = true)
+String outputFormat;
+
+@Parameter(names = {"--output-partition-field"}, description = "A field to 
be used by Spark repartitioning")
+String outputPartitionField;
+  }
+
+  public int export(SparkSession spark, Config cfg) throws IOException {
+JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
+FileSystem fs = FSUtils.getFs(cfg.sourceBasePath, 
jsc.hadoopConfiguration());
+
+final SerializableConfiguration serConf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
+final HoodieTableMetaClient tableMetadata = new 
HoodieTableMetaClient(fs.getConf(), cfg.sourceBasePath);
+final TableFileSystemView.BaseFileOnlyView fsView = new 
HoodieTableFileSystemView(tableMetadata,
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants());
+// Get the latest commit
+Option latestCommit =
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
+if (!latestCommit.isPresent()) {
+  LOG.error("No commits present. Nothing to snapshot");
+  return -1;
+}
+final String latestCommitTimestamp = latestCommit.get().getTimestamp();
+LOG.info(String.format("Starting to snapshot latest version files which 
are also no-late-than %s.",
+latestCommitTimestamp));
+
+

[GitHub] [incubator-hudi] xushiyan commented on issue #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

xushiyan commented on issue #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot 
Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#issuecomment-593753301
 
 
   @OpenOpened Thanks for the changes. I saw there are issues with existing 
Copier tests which need some further refactoring and fixing. 
   
   I'd like to set a realistic criteria for merging this PR: as we don't have 
to make the utility perfect in the first round so I'd give +1 as long as the 
functionality works in a basic manner: output can be generated, no worry in 
perf and fully covered in unit tests. In addition, putting an `@experimental` 
in the class docs and do not `@deprecate` the original Copier. Please kindly 
share your input. @leesf @vinothchandar thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

xushiyan commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386783660
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.table.TableFileSystemView;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+
+import scala.Tuple2;
+import scala.collection.JavaConversions;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Export the latest records of Hudi dataset to a set of external files (e.g., 
plain parquet files).
+ */
+
+public class HoodieSnapshotExporter {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSnapshotExporter.class);
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--source-base-path", "-sbp"}, description = "Base 
path for the source Hudi dataset to be snapshotted", required = true)
+String basePath = null;
+
+@Parameter(names = {"--target-base-path", "-tbp"}, description = "Base 
path for the target output files (snapshots)", required = true)
+String outputPath = null;
+
+@Parameter(names = {"--snapshot-prefix", "-sp"}, description = "Snapshot 
prefix or directory under the target base path in order to segregate different 
snapshots")
+String snapshotPrefix;
+
+@Parameter(names = {"--output-format", "-of"}, description = "e.g. Hudi or 
Parquet", required = true)
+String outputFormat;
+
+@Parameter(names = {"--output-partition-field", "-opf"}, description = "A 
field to be used by Spark repartitioning")
+String outputPartitionField;
+  }
+
+  public void export(SparkSession spark, Config cfg) throws IOException {
+String sourceBasePath = cfg.basePath;
+String targetBasePath = cfg.outputPath;
+String snapshotPrefix = cfg.snapshotPrefix;
+String outputFormat = cfg.outputFormat;
+String outputPartitionField = cfg.outputPartitionField;
+JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
+FileSystem fs = FSUtils.getFs(sourceBasePath, jsc.hadoopConfiguration());
+
+final SerializableConfiguration serConf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
+final HoodieTableMetaClient tableMetadata = new 
HoodieTableMetaClient(fs.getConf(), sourceBasePath);
+final TableFileSystemView.BaseFileOnlyView fsView = new 
HoodieTableFileSystemView(tableMetadata,
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants());
+// Get the latest commit
+Option latestCommit =
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
+if (!latestCommit.isPresent()) {
+  LOG.warn("No commits present. Nothing to snapshot");
+  return;
+}
+final String latestCommitTimestamp = latestCommit.get().getTimestamp();
+LOG.info(String.format("Starting to snapshot latest version files which 
are also

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #205

2020-03-02 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.36 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

xushiyan commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386779367
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.table.TableFileSystemView;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameWriter;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+
+import scala.Tuple2;
+import scala.collection.JavaConversions;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Export the latest records of Hudi dataset to a set of external files (e.g., 
plain parquet files).
+ */
+
+public class HoodieSnapshotExporter {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSnapshotExporter.class);
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--source-base-path"}, description = "Base path for 
the source Hudi dataset to be snapshotted", required = true)
+String sourceBasePath = null;
+
+@Parameter(names = {"--target-base-path"}, description = "Base path for 
the target output files (snapshots)", required = true)
+String targetOutputPath = null;
+
+@Parameter(names = {"--snapshot-prefix"}, description = "Snapshot prefix 
or directory under the target base path in order to segregate different 
snapshots")
+String snapshotPrefix;
+
+@Parameter(names = {"--output-format"}, description = "e.g. Hudi or 
Parquet", required = true)
+String outputFormat;
+
+@Parameter(names = {"--output-partition-field"}, description = "A field to 
be used by Spark repartitioning")
+String outputPartitionField;
+  }
+
+  public void export(SparkSession spark, Config cfg) throws IOException {
+JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
+FileSystem fs = FSUtils.getFs(cfg.sourceBasePath, 
jsc.hadoopConfiguration());
+
+final SerializableConfiguration serConf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
+final HoodieTableMetaClient tableMetadata = new 
HoodieTableMetaClient(fs.getConf(), cfg.sourceBasePath);
+final TableFileSystemView.BaseFileOnlyView fsView = new 
HoodieTableFileSystemView(tableMetadata,
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants());
+// Get the latest commit
+Option latestCommit =
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
+if (!latestCommit.isPresent()) {
+  LOG.warn("No commits present. Nothing to snapshot");
+  return;
+}
+final String latestCommitTimestamp = latestCommit.get().getTimestamp();
+LOG.info(String.format("Starting to snapshot latest version files which 
are also no-late-than %s.",
+latestCommitTimestamp));
+
+List partitions = FSUtils.getAllPartitionPaths(fs, 
cfg.sourceBasePath, false);
+if (partitions.size() > 0) {
+  List dataFiles = new ArrayList<>();
+
+  if

[jira] [Commented] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file

2020-03-02 Thread liwei (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049846#comment-17049846
 ] 

liwei commented on HUDI-305:


hello [~bhasudha] ,

We also encountered this problem，is  there any progress on this issue ？

and   bhasudha has do "Support for PathFilter in DirectoryLister " in presto 
,can we fix the issue in presto .

[https://github.com/prestodb/presto/issues/13511]

 

> Presto MOR "_rt" queries only reads base parquet file 
> --
>
> Key: HUDI-305
> URL: https://issues.apache.org/jira/browse/HUDI-305
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Presto Integration
> Environment: On AWS EMR
>Reporter: Brandon Scheller
>Assignee: Bhavani Sudha Saktheeswaran
>Priority: Major
>
> Code example to reproduce.
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> val df = Seq(
>   ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_events_mor_1"
> var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update a record with event_name "event_name_123" => "event_name_changed"
> val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
> val df2 = df1.filter($"event_id" === "104")
> val df3 = df2.withColumn("event_name", lit("event_name_changed"))
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option("hoodie.compact.inline", "false")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Now when querying the real-time table from Hive, we have no issue seeing the 
> updated value:
> {code:java}
> hive> select event_name from hudi_events_mor_1_rt;
> OK
> event_name_900
> event_name_changed
> event_name_546
> event_name_678
> Time taken: 0.103 seconds, Fetched: 4 row(s)
> {code}
> But when querying the real-time table from Presto, we only read the base 
> parquet file and do not see the update that should be merged in from the log 
> file.
> {code:java}
> presto:default> select event_name from hudi_events_mor_1_rt;
>event_name
> 
>  event_name_900
>  event_name_123
>  event_name_546
>  event_name_678
> (4 rows)
> {code}
> Our current understanding of this issue is that while the 
> HoodieParquetRealtimeInputFormat correctly generates the splits. The 
> RealtimeCompactedRecordReader record reader is not used so it is not reading 
> the log file and only reading the base parquet file.
>  



--
This message was sent by Atlassian

[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-03-02 Thread GitBox

bwu2 edited a comment on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-593578500
 
 
   I'm not sure if this is a separate issue or not (seems very similar but not 
identical), so am leaving a comment here. Somehow, we have an empty `.clean` 
file (the commit was successful). This has caused:
   
   `20/03/02 18:17:56 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
.commit file: 20200221211054.clean
   java.io.IOException: Not an Avro data file`
   
   Note this was created by a production release (0.5.0) version of Hudi. We 
updated our Hudi version to 0.5.1 (and 0.5.2) and it's still not able to handle 
it. I guess we can delete the 0 byte offending `.clean` file and everything 
will work
   
   Yes, so deleting that file worked. I'm not sure if the `.clean` files are 
supposed to be archived as well as the `.commit` files (they are not, at the 
moment)?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-03-02 Thread GitBox

bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation 
type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#discussion_r386701101
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/io/TestHoodieCommitArchiveLog.java
 ##
 @@ -413,4 +417,32 @@ private void verifyInflightInstants(HoodieTableMetaClient 
metaClient, int expect
 assertEquals("Loaded inflight clean actions and the count should match", 
expectedTotalInstants,
 timeline.countInstants());
   }
+
+  @Test
+  public void testCommitMetadataConverter() {
+HoodieCommitMetadata hoodieCommitMetadata = new HoodieCommitMetadata();
+hoodieCommitMetadata.setOperationType(WriteOperationType.INSERT);
+
+HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder().withPath(basePath)
+
.withSchema(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA).withParallelism(2, 
2).forTable("test-commitMetadata-converter")
+
.withCompactionConfig(HoodieCompactionConfig.newBuilder().retainCommits(1).archiveCommitsWith(2,
 5).build())
+.build();
+metaClient = HoodieTableMetaClient.reload(metaClient);
+HoodieCommitArchiveLog archiveLog = new HoodieCommitArchiveLog(cfg, 
metaClient);
+
+Class clazz  = HoodieCommitArchiveLog.class;
+try {
+  Method commitMetadataConverter = 
clazz.getDeclaredMethod("commitMetadataConverter", HoodieCommitMetadata.class);
 
 Review comment:
   @hddong  One final comment : Can you make commitMetadataConverter() in 
HoodieCommitArchiveLog with default access (instead of private). This way, you 
dont need to deal with reflection.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha merged pull request #1366: [HUDI-589] Follow on fixes to querying_data page

2020-03-02 Thread GitBox

bhasudha merged pull request #1366: [HUDI-589] Follow on fixes to querying_data 
page
URL: https://github.com/apache/incubator-hudi/pull/1366
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: [HUDI-589] Follow on fixes to querying_data page

2020-03-02 Thread bhavanisudha

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 7e82fe2  [HUDI-589] Follow on fixes to querying_data page
7e82fe2 is described below

commit 7e82fe2f1f1137ae946039b80ad3246abc3af7a3
Author: Vinoth Chandar 
AuthorDate: Mon Mar 2 13:43:39 2020 -0800

[HUDI-589] Follow on fixes to querying_data page
---
 docs/_docs/2_3_querying_data.md | 90 -
 1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 1a2ae08..c4ab865 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -9,10 +9,10 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark SQL, Spark datasource and Presto.
+bundle has been installed, the table can be queried by popular query engines 
like Hive, Spark SQL, Spark Datasource API and Presto.
 
 Specifically, following Hive tables are registered based off [table 
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) 
-and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during 
write.   
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) configs passed 
during write.   
 
 If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: 
  - `hudi_trips` supports snapshot query and incremental query on the table 
backed by `HoodieParquetInputFormat`, exposing purely columnar data.
@@ -20,37 +20,39 @@ If `table name = hudi_trips` and `table type = 
COPY_ON_WRITE`, then we get:
 
 If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
  - `hudi_trips_rt` supports snapshot query and incremental query (providing 
near-real time data) on the table  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
- - `hudi_trips_ro` supports read optimized query on the table backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
- 
+ - `hudi_trips_ro` supports read optimized query on the table backed by 
`HoodieParquetInputFormat`, exposing purely columnar data stored in base files.
 
-As discussed in the concepts section, the one key primitive needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
+As discussed in the concepts section, the one key capability needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
 is obtaining a change stream/log from a table. Hudi tables can be queried 
incrementally, which means you can get ALL and ONLY the updated & new rows 
 since a specified instant time. This, together with upserts, is particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally queried (streams/facts),
 joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental queries 
are realized by querying one of the tables above, 
 with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the table. 
 
 
-## SUPPORT MATRIX
+## Support Matrix
+
+Following tables show whether a given query is supported on specific query 
engine.
 
-### COPY_ON_WRITE tables
+### Copy-On-Write tables
   
-||Snapshot|Incremental|Read Optimized|
-|||---|--|
-|**Hive**|Y|Y|N/A|
-|**Spark SQL**|Y|Y|N/A|
-|**Spark datasource**|Y|Y|N/A|
-|**Presto**|Y|N|N/A|
+|Query Engine|Snapshot Queries|Incremental Queries|
+|||---|
+|**Hive**|Y|Y|
+|**Spark SQL**|Y|Y|
+|**Spark Datasource**|Y|Y|
+|**Presto**|Y|N|
+
+Note that `Read Optimized` queries are not applicable for COPY_ON_WRITE tables.
 
-### MERGE_ON_READ tables
+### Merge-On-Read tables
 
-||Snapshot|Incremental|Read Optimized|
-|||---|--|
+|Query Engine|Snapshot Queries|Incremental Queries|Read Optimized Queries|
+|||---|--|
 |**Hive**|Y|Y|Y|
 |**Spark SQL**|Y|Y|Y|
-|**Spark datasource**|N|N|Y|
+|**Spark Datasource**|N|N|Y|
 |**Presto**|N|N|Y|
 
-
 In sections, below we will discuss specific setup to access different query 
types from different query engines. 
 
 ## Hive
@@ -103,13 +105,11 @@ would ensure Map Reduce execution is chosen for a Hive 
query, which combines par
 separated) and calls InputFormat.listStatus()

[GitHub] [incubator-hudi] vinothchandar opened a new pull request #1366: [HUDI-589] Follow on fixes to querying_data page

2020-03-02 Thread GitBox

vinothchandar opened a new pull request #1366: [HUDI-589] Follow on fixes to 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1366
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2020-03-02 Thread Bhavani Sudha (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049672#comment-17049672
 ] 

Bhavani Sudha commented on HUDI-651:


[~vinoth] I can take this one.

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
>

[jira] [Assigned] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2020-03-02 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-651:
--

Assignee: Bhavani Sudha  (was: Vinoth Chandar)

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml],

[jira] [Updated] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2020-03-02 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-651:
---
Status: Open  (was: New)

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
>

[jira] [Commented] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2020-03-02 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049656#comment-17049656
 ] 

Vinoth Chandar commented on HUDI-651:
-

cc [~vbalaji] [~bhasudha] FYI.. 

Can one of you take this one? its kind of important 

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
>

[jira] [Created] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2020-03-02 Thread Vinoth Chandar (Jira)

Vinoth Chandar created HUDI-651:
---

 Summary: Incremental Query on Hive via Spark SQL does not return 
expected results
 Key: HUDI-651
 URL: https://issues.apache.org/jira/browse/HUDI-651
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Spark Integration
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar
 Fix For: 0.6.0


Using the docker demo, I added two delta commits to a MOR table and was a 
hoping to incremental consume them like Hive QL.. Something amiss

{code}
scala> 
spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")

scala> 
spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")

scala> spark.sql("select distinct `_hoodie_commit_time` from 
stock_ticks_mor_rt").show(100, false)
+---+
|_hoodie_commit_time|
+---+
|20200302210010 |
|20200302210147 |
+---+


scala> sc.setLogLevel("INFO")

scala> spark.sql("select distinct `_hoodie_commit_time` from 
stock_ticks_mor_rt").show(100, false)
20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
version of codegened fast hashmap does not support this aggregate.
20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
version of codegened fast hashmap does not support this aggregate.
20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as values 
in memory (estimated size 292.3 KB, free 365.3 MB)
20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored as 
bytes in memory (estimated size 25.4 KB, free 365.3 MB)
20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie metadata 
from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
HoodieTableMetaClient from 
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, 
mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 
hdfs-default.xml, hdfs-site.xml, 
org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
file:/etc/hadoop/hive-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
(auth:SIMPLE)]]]
20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ(version=1) from 
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 1
20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
groups
20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
[[20200302210010__clean__COMPLETED], [20200302210010__deltacommit__COMPLETED], 
[20200302210147__clean__COMPLETED], [20200302210147__deltacommit__COMPLETED]]
20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
partition :2018/08/31, #FileGroups=1
20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to process 
after hoodie filter 1
20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie metadata 
from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
HoodieTableMetaClient from 
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, 
mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 
hdfs-default.xml, hdfs-site.xml, 
org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
file:/etc/hadoop/hive-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
(auth:SIMPLE)]]]
20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ(version=1) from

[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer

2020-03-02 Thread GitBox

garyli1019 commented on a change in pull request #1362: HUDI-644 Enable user to 
get checkpoint from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#discussion_r386608439
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 ##
 @@ -529,6 +514,47 @@ private void registerAvroSchemas(SchemaProvider 
schemaProvider) {
 }
   }
 
+  /**
+   * Search delta streamer checkpoint from the previous commits.
+   *
+   * @param commitTimelineOpt HoodieTimeline object
+   * @return checkpoint metadata as String
+   */
+  private Option retrieveCheckpointFromCommits(Option 
commitTimelineOpt) throws Exception {
+Option lastCommit = commitTimelineOpt.get().lastInstant();
+if (lastCommit.isPresent()) {
+  HoodieCommitMetadata commitMetadata = HoodieCommitMetadata
+  
.fromBytes(commitTimelineOpt.get().getInstantDetails(lastCommit.get()).get(), 
HoodieCommitMetadata.class);
+  // user-defined checkpoint appeared and not equal to the user-defined 
checkpoint of the last commit
+  if (cfg.checkpoint != null && 
!cfg.checkpoint.equals(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))) {
+return Option.of(cfg.checkpoint);
+  }
+  int commitsToCheckCnt;
+  // search check point from previous commits in backward
+  if (cfg.searchCheckpoint) {
+commitsToCheckCnt = commitTimelineOpt.get().countInstants();
+  } else {
+commitsToCheckCnt = 1; // only check the last commit
+  }
+  Option curCommit;
+  for (int i = 0; i < commitsToCheckCnt; ++i) {
 
 Review comment:
   I am using `nthFromLastInstant(i)` so I think this is starting form 
backward. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] garyli1019 commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer

2020-03-02 Thread GitBox

garyli1019 commented on issue #1362: HUDI-644 Enable user to get checkpoint 
from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-593579929
 
 
   @pratyakshsharma Thanks for reviewing this PR.
   I can say more about my use cases:
   
   - I am using Kafka connect to sink Kafka to HDFS every 30 minutes, 
partitioned by the arrival time(year=xx/month=xx/day=xx/hour=xx)
   - The homebrew spark Datasource I built is using (the current time - the 
last Hudi commit timestamp) to find a `timeWindow` and use it to load the data 
generated by Kafka connect.
   - I will be easy to switch to delta streamer with DFS source, but a little 
bit tricky to switch to Kafka, because there is a delay caused by the Kafka 
connect. 
   
   So right now, if I switch to delta streamer directly ingesting from Kafka, I 
will start from the `LATEST` checkpoint, and `EARLIEST` is not possible because 
my Kafka cluster retention time is pretty long. 
   The data loss I mentioned in the few hours gap between the commit I switched 
from my homebrew Spark data source reader to delta streamer. All I need to do 
is in this commit, I run the delta streamer first to store the `LATEST` 
checkpoint, then run my data source reader to read the data in those few hours 
gap. Only need one parallel run here then I will be good to go with the delta 
streamer.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bwu2 commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-03-02 Thread GitBox

bwu2 commented on issue #1128: [HUDI-453] Fix throw failed to archive commits 
error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-593578500
 
 
   I'm not sure if this is a separate issue or not (seems very similar but not 
identical), so am leaving a comment here. Somehow, we have an empty `.clean` 
file (the commit was successful). This has caused:
   
   `20/03/02 18:17:56 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
.commit file: 20200221211054.clean
   java.io.IOException: Not an Avro data file`
   
   Note this was created by a production release (0.5.0) version of Hudi. We 
updated our Hudi version to 0.5.1 (and 0.5.2) and it's still not able to handle 
it. I guess we can delete the 0 byte offending `.clean` file and everything 
will work. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

2020-03-02 Thread GitBox

bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress 
bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-593571838
 
 
   Marking as WIP for now. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: [HUDI-589][DOCS] Fix querying_data page (#1333)

2020-03-02 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 840b07e  [HUDI-589][DOCS] Fix querying_data page (#1333)
840b07e is described below

commit 840b07ee4452e1f0654a32cb32cd7bed3279edcf
Author: Bhavani Sudha Saktheeswaran 
AuthorDate: Mon Mar 2 11:09:41 2020 -0800

[HUDI-589][DOCS] Fix querying_data page (#1333)

- Added support matrix for COW and MOR tables
- Change reference from (`views`|`pulls`) to `queries`
- And minor restructuring
---
 docs/_docs/2_3_querying_data.md | 131 ++--
 1 file changed, 73 insertions(+), 58 deletions(-)

diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 0ee5e17..1a2ae08 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -9,7 +9,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark and Presto.
+bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark SQL, Spark datasource and Presto.
 
 Specifically, following Hive tables are registered based off [table 
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) 
 and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during 
write.   
@@ -24,31 +24,49 @@ If `table name = hudi_trips` and `table type = 
MERGE_ON_READ`, then we get:
  
 
 As discussed in the concepts section, the one key primitive needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a table). Hudi 
tables can be pulled incrementally, which means you can get ALL and ONLY the 
updated & new rows 
-since a specified instant time. This, together with upserts, are particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally pulled (streams/facts),
-joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is 
realized by querying one of the tables above, 
+is obtaining a change stream/log from a table. Hudi tables can be queried 
incrementally, which means you can get ALL and ONLY the updated & new rows 
+since a specified instant time. This, together with upserts, is particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally queried (streams/facts),
+joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental queries 
are realized by querying one of the tables above, 
 with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the table. 
 
-In sections, below we will discuss how to access these query types from 
different query engines.
+
+## SUPPORT MATRIX
+
+### COPY_ON_WRITE tables
+  
+||Snapshot|Incremental|Read Optimized|
+|||---|--|
+|**Hive**|Y|Y|N/A|
+|**Spark SQL**|Y|Y|N/A|
+|**Spark datasource**|Y|Y|N/A|
+|**Presto**|Y|N|N/A|
+
+### MERGE_ON_READ tables
+
+||Snapshot|Incremental|Read Optimized|
+|||---|--|
+|**Hive**|Y|Y|Y|
+|**Spark SQL**|Y|Y|Y|
+|**Spark datasource**|N|N|Y|
+|**Presto**|N|N|Y|
+
+
+In sections, below we will discuss specific setup to access different query 
types from different query engines. 
 
 ## Hive
 
-In order for Hive to recognize Hudi tables and query correctly, the 
HiveServer2 needs to be provided with the 
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
-in its [aux jars 
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
 This will ensure the input format 
+In order for Hive to recognize Hudi tables and query correctly, 
+ - the HiveServer2 needs to be provided with the 
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` in its [aux jars 
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
 This will ensure the input format 
 classes with its dependencies are available for query planning & execution. 
+ - For MERGE_ON_READ tables, additionally the bundle needs to be put on the 
hadoop/hive installation across the cluster, so that queries can pick up the 
custom RecordReader as well.
 
-### Read optimized query
-In addition to setup above, for beeline cli access, the `hive.input.format`

[GitHub] [incubator-hudi] vinothchandar merged pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-03-02 Thread GitBox

vinothchandar merged pull request #1333: [HUDI-589][DOCS] Fix querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1333: [HUDI-589][DOCS] Fix querying_data page

2020-03-02 Thread GitBox

vinothchandar commented on issue #1333: [HUDI-589][DOCS] Fix querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#issuecomment-593566035
 
 
   lets open a JIRA for this.. 
   
   @bhasudha if you don't mind, I ll land this and open a follow up with minor 
fixes.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on issue #1333: [HUDI-589][DOCS] Fix querying_data page

2020-03-02 Thread GitBox

bhasudha commented on issue #1333: [HUDI-589][DOCS] Fix querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#issuecomment-593562940
 
 
   @garyli1019  Thanks for the Impala related docs. When time permits, please 
add Impala's support in the Support Matrix I have added in this page for each 
query engine and Hudi Table type.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

2020-03-02 Thread GitBox

bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress 
bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-593560005
 
 
   @lamber-ken @leesf @nsivabalan : Yes, the additional string conversion is 
not needed. So, I refactored a little bit to use correct bloom-filter 
serialization method (based on whether compression is enabled or not). 
   
   @lamber-ken : I am observing the same behavior when comparing compression vs 
non-compression case. I see that compression performs poorly based on the bloom 
filter utilization (number of keys stored in bloom-filter).  I see that snappy 
also behaves in the same way (although poorly compared to gzip).  I would need 
to investigate further on this.
   
   Result 
   
   ```
   test random keys
   original size: 4792548
   compress size (utilization=10%) : 2150956, CompressToOriginal=44
   compress size (utilization=20%) : 3078736, CompressToOriginal=64
   compress size (utilization=30%) : 3638548, CompressToOriginal=75
   compress size (utilization=40%) : 3977508, CompressToOriginal=82
   compress size (utilization=50%) : 4258972, CompressToOriginal=88
   compress size (utilization=60%) : 4490484, CompressToOriginal=93
   compress size (utilization=70%) : 4647776, CompressToOriginal=96
   compress size (utilization=80%) : 4750028, CompressToOriginal=99
   compress size (utilization=90%) : 4794040, CompressToOriginal=100
   
   test sequential keys
   original size: 4792548
   Using Byte[] - compress size (utilization=10%) : 2150852, 
CompressToOriginal=44
   Using Byte[] - compress size (utilization=20%) : 3078332, 
CompressToOriginal=64
   Using Byte[] - compress size (utilization=30%) : 3639000, 
CompressToOriginal=75
   Using Byte[] - compress size (utilization=40%) : 3977764, 
CompressToOriginal=82
   Using Byte[] - compress size (utilization=50%) : 4258544, 
CompressToOriginal=88
   Using Byte[] - compress size (utilization=60%) : 4490372, 
CompressToOriginal=93
   Using Byte[] - compress size (utilization=70%) : 4647832, 
CompressToOriginal=96
   Using Byte[] - compress size (utilization=80%) : 4749928, 
CompressToOriginal=99
   Using Byte[] - compress size (utilization=90%) : 4794040, 
CompressToOriginal=100
   
   Process finished with exit code 0
   
   ```
   
   Test - Code : 
   ```
   @Test
 public void testit() {
   int[] utilization = new int[] { 10, 20, 30, 40, 50, 60, 70, 80, 90};
   
   System.out.println("test random keys");
   int originalSize = 0;
   for (int i = 0; i < utilization.length; i++) {
 SimpleBloomFilter filter = new SimpleBloomFilter(100, 0.01, 
Hash.MURMUR_HASH);
 int numKeys = 1 * utilization[i];
 for (int j = 0; j < numKeys; j++) {
   String key = UUID.randomUUID().toString();
   filter.add(key);
 }
   
 if (i == 0) {
   originalSize = filter.serializeToString().length();
   System.out.println("original size: " + 
filter.serializeToString().length());
 }
 int compressedSize = 
GzipCompressionUtils.compress(filter.serializeToBytes()).length();
 System.out.println("compress size (utilization=" + utilization[i] + 
"%) : "
 +  compressedSize + ", CompressToOriginal=" + (compressedSize * 
100/originalSize));
   }
   
   System.out.println("\ntest sequential keys");
   
   for (int i = 0; i < utilization.length; i++) {
 SimpleBloomFilter filter = new SimpleBloomFilter(100, 0.01, 
Hash.MURMUR_HASH);
 int numKeys = 1 * utilization[i];
 for (int j = 0; j < numKeys; j++) {
   String key = "key-" + j;
   filter.add(key);
 }
 if (i == 0) {
   originalSize = filter.serializeToString().length();
   System.out.println("original size: " + 
filter.serializeToString().length());
 }
 int compressedSize = 
GzipCompressionUtils.compress(filter.serializeToBytes()).length();
 System.out.println("Using Byte[] - compress size (utilization=" + 
utilization[i] + "%) : "
 + compressedSize + ", CompressToOriginal=" + (compressedSize * 
100/originalSize));
   }
 }
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-650) HoodieTable handleUpdate to include partition paths

2020-03-02 Thread satish (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-650:

Fix Version/s: (was: 0.5.2)

> HoodieTable handleUpdate to include partition paths
> ---
>
> Key: HUDI-650
> URL: https://issues.apache.org/jira/browse/HUDI-650
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: satish
>Assignee: satish
>Priority: Minor
>
> HoodieTable handleUpdate takes in fileId and list of records.  It does not 
> validate all records belong to same partitionPath. This is error prone - 
> there is already at least one test that is passing in records that belong to 
> several partitions to this method. Fix to take partitionPath and also 
> validate all records belong to same partition path.
> This will be helpful for  adding 'insert overwrite' functionality as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-650) HoodieTable handleUpdate to include partition paths

2020-03-02 Thread satish (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-650:

Component/s: (was: CLI)
 Code Cleanup

> HoodieTable handleUpdate to include partition paths
> ---
>
> Key: HUDI-650
> URL: https://issues.apache.org/jira/browse/HUDI-650
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: satish
>Assignee: satish
>Priority: Minor
> Fix For: 0.5.2
>
>
> HoodieTable handleUpdate takes in fileId and list of records.  It does not 
> validate all records belong to same partitionPath. This is error prone - 
> there is already at least one test that is passing in records that belong to 
> several partitions to this method. Fix to take partitionPath and also 
> validate all records belong to same partition path.
> This will be helpful for  adding 'insert overwrite' functionality as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-650) HoodieTable handleUpdate to include partition paths

2020-03-02 Thread satish (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-650:

Status: Open  (was: New)

> HoodieTable handleUpdate to include partition paths
> ---
>
> Key: HUDI-650
> URL: https://issues.apache.org/jira/browse/HUDI-650
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: satish
>Assignee: satish
>Priority: Minor
>
> HoodieTable handleUpdate takes in fileId and list of records.  It does not 
> validate all records belong to same partitionPath. This is error prone - 
> there is already at least one test that is passing in records that belong to 
> several partitions to this method. Fix to take partitionPath and also 
> validate all records belong to same partition path.
> This will be helpful for  adding 'insert overwrite' functionality as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-650) HoodieTable handleUpdate to include partition paths

2020-03-02 Thread satish (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-650:

Description: 
HoodieTable handleUpdate takes in fileId and list of records.  It does not 
validate all records belong to same partitionPath. This is error prone - there 
is already at least one test that is passing in records that belong to several 
partitions to this method. Fix to take partitionPath and also validate all 
records belong to same partition path.

This will be helpful for  adding 'insert overwrite' functionality as well.

  was:HoodieMergeHandle has 


> HoodieTable handleUpdate to include partition paths
> ---
>
> Key: HUDI-650
> URL: https://issues.apache.org/jira/browse/HUDI-650
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>
> HoodieTable handleUpdate takes in fileId and list of records.  It does not 
> validate all records belong to same partitionPath. This is error prone - 
> there is already at least one test that is passing in records that belong to 
> several partitions to this method. Fix to take partitionPath and also 
> validate all records belong to same partition path.
> This will be helpful for  adding 'insert overwrite' functionality as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-650) HoodieTable handleUpdate to include partition paths

2020-03-02 Thread satish (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-650:

Labels:   (was: pull-request-available)

> HoodieTable handleUpdate to include partition paths
> ---
>
> Key: HUDI-650
> URL: https://issues.apache.org/jira/browse/HUDI-650
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: satish
>Assignee: satish
>Priority: Minor
> Fix For: 0.5.2
>
>
> HoodieTable handleUpdate takes in fileId and list of records.  It does not 
> validate all records belong to same partitionPath. This is error prone - 
> there is already at least one test that is passing in records that belong to 
> several partitions to this method. Fix to take partitionPath and also 
> validate all records belong to same partition path.
> This will be helpful for  adding 'insert overwrite' functionality as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-650) HoodieTable handleUpdate to include partition paths

2020-03-02 Thread satish (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-650:

Summary: HoodieTable handleUpdate to include partition paths  (was: Update 
HoodieMergeHandle to include partition paths)

> HoodieTable handleUpdate to include partition paths
> ---
>
> Key: HUDI-650
> URL: https://issues.apache.org/jira/browse/HUDI-650
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>
> HoodieMergeHandle has 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-650) Update HoodieMergeHandle to include partition paths

2020-03-02 Thread satish (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-650:

Summary: Update HoodieMergeHandle to include partition paths  (was: Update 
handles to include partition paths)

> Update HoodieMergeHandle to include partition paths
> ---
>
> Key: HUDI-650
> URL: https://issues.apache.org/jira/browse/HUDI-650
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-650) Update HoodieMergeHandle to include partition paths

2020-03-02 Thread satish (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-650:

Description: HoodieMergeHandle has 

> Update HoodieMergeHandle to include partition paths
> ---
>
> Key: HUDI-650
> URL: https://issues.apache.org/jira/browse/HUDI-650
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>
> HoodieMergeHandle has 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-650) Update handles to include partition paths

2020-03-02 Thread satish (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-650:

Description: (was: Hudi CLI has 'show archived commits' command which 
is not very helpful

 
{code:java}
->show archived commits
===> Showing only 10 archived commits <===
    
    | CommitTime    | CommitType|
    |===|
    | 2019033304| commit    |
    | 20190323220154| commit    |
    | 20190323220154| commit    |
    | 20190323224004| commit    |
    | 20190323224013| commit    |
    | 20190323224229| commit    |
    | 20190323224229| commit    |
    | 20190323232849| commit    |
    | 20190323233109| commit    |
    | 20190323233109| commit    |
 {code}
Modify or introduce new command to make it easy to debug

 )

> Update handles to include partition paths
> -
>
> Key: HUDI-650
> URL: https://issues.apache.org/jira/browse/HUDI-650
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-650) Update handles to include partition paths

2020-03-02 Thread satish (Jira)

satish created HUDI-650:
---

 Summary: Update handles to include partition paths
 Key: HUDI-650
 URL: https://issues.apache.org/jira/browse/HUDI-650
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: CLI
Reporter: satish
Assignee: satish
 Fix For: 0.5.2


Hudi CLI has 'show archived commits' command which is not very helpful

 
{code:java}
->show archived commits
===> Showing only 10 archived commits <===
    
    | CommitTime    | CommitType|
    |===|
    | 2019033304| commit    |
    | 20190323220154| commit    |
    | 20190323220154| commit    |
    | 20190323224004| commit    |
    | 20190323224013| commit    |
    | 20190323224229| commit    |
    | 20190323224229| commit    |
    | 20190323232849| commit    |
    | 20190323233109| commit    |
    | 20190323233109| commit    |
 {code}
Modify or introduce new command to make it easy to debug

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] hddong commented on issue #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-03-02 Thread GitBox

hddong commented on issue #1157: [HUDI-332]Add operation type 
(insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#issuecomment-593433072
 
 
   @bvaradar As you said keep operationType string here, and i had resolve all 
conflict.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] hddong edited a comment on issue #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-03-02 Thread GitBox

hddong edited a comment on issue #1157: [HUDI-332]Add operation type 
(insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#issuecomment-593433072
 
 
   @bvaradar As you said keep operationType string here, and I had resolve all 
conflict.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1354: [WIP][HUDI-581] NOTICE need more work as it missing content form included 3rd party ALv2 licensed NOTICE files

2020-03-02 Thread GitBox

codecov-io edited a comment on issue #1354: [WIP][HUDI-581] NOTICE need more 
work as it missing content form included 3rd party ALv2 licensed NOTICE files
URL: https://github.com/apache/incubator-hudi/pull/1354#issuecomment-593259405
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1354?src=pr=h1) 
Report
   > :exclamation: No coverage uploaded for pull request base 
(`master@4e7fcde`). [Click here to learn what that 
means](https://docs.codecov.io/docs/error-reference#section-missing-base-commit).
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1354/graphs/tree.svg?width=650=VTTXabwbs2=150=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1354?src=pr=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1354   +/-   ##
   =
 Coverage  ?   67.09%   
 Complexity?  221   
   =
 Files ?  333   
 Lines ?16216   
 Branches  ? 1659   
   =
 Hits  ?10880   
 Misses? 4598   
 Partials  ?  738
   ```
   
   
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1354?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1354?src=pr=footer).
 Last update 
[4e7fcde...cea03e1](https://codecov.io/gh/apache/incubator-hudi/pull/1354?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

codecov-io edited a comment on issue #1360: [HUDI-344][RFC-09] Hudi Dataset 
Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#issuecomment-593050937
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1360?src=pr=h1) 
Report
   > :exclamation: No coverage uploaded for pull request base 
(`master@b7f35be`). [Click here to learn what that 
means](https://docs.codecov.io/docs/error-reference#section-missing-base-commit).
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1360/graphs/tree.svg?width=650=VTTXabwbs2=150=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1360?src=pr=tree)
   
   ```diff
   @@   Coverage Diff@@
   ## master   #1360   +/-   ##
   
 Coverage  ?   0.64%   
 Complexity?   2   
   
 Files ? 287   
 Lines ?   14319   
 Branches  ?1465   
   
 Hits  ?  92   
 Misses?   14224   
 Partials  ?   3
   ```
   
   
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1360?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1360?src=pr=footer).
 Last update 
[b7f35be...e917358](https://codecov.io/gh/apache/incubator-hudi/pull/1360?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer

2020-03-02 Thread GitBox

pratyakshsharma commented on a change in pull request #1362: HUDI-644 Enable 
user to get checkpoint from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#discussion_r386281673
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 ##
 @@ -529,6 +514,47 @@ private void registerAvroSchemas(SchemaProvider 
schemaProvider) {
 }
   }
 
+  /**
+   * Search delta streamer checkpoint from the previous commits.
+   *
+   * @param commitTimelineOpt HoodieTimeline object
+   * @return checkpoint metadata as String
+   */
+  private Option retrieveCheckpointFromCommits(Option 
commitTimelineOpt) throws Exception {
+Option lastCommit = commitTimelineOpt.get().lastInstant();
+if (lastCommit.isPresent()) {
+  HoodieCommitMetadata commitMetadata = HoodieCommitMetadata
+  
.fromBytes(commitTimelineOpt.get().getInstantDetails(lastCommit.get()).get(), 
HoodieCommitMetadata.class);
+  // user-defined checkpoint appeared and not equal to the user-defined 
checkpoint of the last commit
+  if (cfg.checkpoint != null && 
!cfg.checkpoint.equals(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))) {
+return Option.of(cfg.checkpoint);
+  }
+  int commitsToCheckCnt;
+  // search check point from previous commits in backward
+  if (cfg.searchCheckpoint) {
+commitsToCheckCnt = commitTimelineOpt.get().countInstants();
+  } else {
+commitsToCheckCnt = 1; // only check the last commit
+  }
+  Option curCommit;
+  for (int i = 0; i < commitsToCheckCnt; ++i) {
 
 Review comment:
   Here you are trying to iterate from the very beginning, what about the case 
where commit 2 and commit 5 are done using DeltaStreamer and both of them have 
CHECKPOINT_KEY. In this case commit 5 has the latest checkpoint to use but as 
per your logic, you end up considering commit 2 checkpoint. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386280565
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.table.TableFileSystemView;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+
+import scala.Tuple2;
+import scala.collection.JavaConversions;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Export the latest records of Hudi dataset to a set of external files (e.g., 
plain parquet files).
+ */
+
+public class HoodieSnapshotExporter {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSnapshotExporter.class);
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--source-base-path", "-sbp"}, description = "Base 
path for the source Hudi dataset to be snapshotted", required = true)
+String basePath = null;
+
+@Parameter(names = {"--target-base-path", "-tbp"}, description = "Base 
path for the target output files (snapshots)", required = true)
+String outputPath = null;
+
+@Parameter(names = {"--snapshot-prefix", "-sp"}, description = "Snapshot 
prefix or directory under the target base path in order to segregate different 
snapshots")
+String snapshotPrefix;
+
+@Parameter(names = {"--output-format", "-of"}, description = "e.g. Hudi or 
Parquet", required = true)
+String outputFormat;
+
+@Parameter(names = {"--output-partition-field", "-opf"}, description = "A 
field to be used by Spark repartitioning")
+String outputPartitionField;
+  }
+
+  public void export(SparkSession spark, Config cfg) throws IOException {
+String sourceBasePath = cfg.basePath;
+String targetBasePath = cfg.outputPath;
+String snapshotPrefix = cfg.snapshotPrefix;
+String outputFormat = cfg.outputFormat;
+String outputPartitionField = cfg.outputPartitionField;
+JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
+FileSystem fs = FSUtils.getFs(sourceBasePath, jsc.hadoopConfiguration());
+
+final SerializableConfiguration serConf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
+final HoodieTableMetaClient tableMetadata = new 
HoodieTableMetaClient(fs.getConf(), sourceBasePath);
+final TableFileSystemView.BaseFileOnlyView fsView = new 
HoodieTableFileSystemView(tableMetadata,
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants());
+// Get the latest commit
+Option latestCommit =
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
+if (!latestCommit.isPresent()) {
+  LOG.warn("No commits present. Nothing to snapshot");
+  return;
+}
+final String latestCommitTimestamp = latestCommit.get().getTimestamp();
+LOG.info(String.format("Starting to snapshot latest version files which 
are also

[GitHub] [incubator-hudi] OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386279308
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.table.TableFileSystemView;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+
+import scala.Tuple2;
+import scala.collection.JavaConversions;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Export the latest records of Hudi dataset to a set of external files (e.g., 
plain parquet files).
+ */
+
+public class HoodieSnapshotExporter {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSnapshotExporter.class);
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--source-base-path", "-sbp"}, description = "Base 
path for the source Hudi dataset to be snapshotted", required = true)
+String basePath = null;
+
+@Parameter(names = {"--target-base-path", "-tbp"}, description = "Base 
path for the target output files (snapshots)", required = true)
+String outputPath = null;
+
+@Parameter(names = {"--snapshot-prefix", "-sp"}, description = "Snapshot 
prefix or directory under the target base path in order to segregate different 
snapshots")
+String snapshotPrefix;
+
+@Parameter(names = {"--output-format", "-of"}, description = "e.g. Hudi or 
Parquet", required = true)
+String outputFormat;
+
+@Parameter(names = {"--output-partition-field", "-opf"}, description = "A 
field to be used by Spark repartitioning")
+String outputPartitionField;
+  }
+
+  public void export(SparkSession spark, Config cfg) throws IOException {
+String sourceBasePath = cfg.basePath;
+String targetBasePath = cfg.outputPath;
+String snapshotPrefix = cfg.snapshotPrefix;
+String outputFormat = cfg.outputFormat;
+String outputPartitionField = cfg.outputPartitionField;
+JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
+FileSystem fs = FSUtils.getFs(sourceBasePath, jsc.hadoopConfiguration());
+
+final SerializableConfiguration serConf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
+final HoodieTableMetaClient tableMetadata = new 
HoodieTableMetaClient(fs.getConf(), sourceBasePath);
+final TableFileSystemView.BaseFileOnlyView fsView = new 
HoodieTableFileSystemView(tableMetadata,
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants());
+// Get the latest commit
+Option latestCommit =
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
+if (!latestCommit.isPresent()) {
+  LOG.warn("No commits present. Nothing to snapshot");
+  return;
+}
+final String latestCommitTimestamp = latestCommit.get().getTimestamp();
+LOG.info(String.format("Starting to snapshot latest version files which 
are also

[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer

2020-03-02 Thread GitBox

pratyakshsharma commented on issue #1362: HUDI-644 Enable user to get 
checkpoint from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-593306470
 
 
   @garyli1019 If I understand it correctly, you are talking of a use case 
where you are using HoodieDeltaStreamer along with using spark data source as a 
backup. Why do you want to have two different pipelines writing to the same 
destination path? 
   
   If you really want to have a backup to prevent any data loss, you can write 
to a separate path using spark data source and continue using DeltaStreamer to 
write to Hudi dataset. In case of any issues, you can always use 
CHECKPOINT_RESET_KEY to ingest the data from your back up path into your Hudi 
dataset path. We have support for kafka as well as DFS source for this purpose. 
   
   Also what is the source for your homebrew spark? If it is also consuming 
from kafka, then I do not see any case where using DeltaStreamer can result in 
data loss. Can you please explain why do you want to use two pipelines for 
writing to the same destination path? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386277992
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+
+public class TestHoodieSnapshotExporter {
 
 Review comment:
   There was little or no connection, so I chose not to inherit it. I have 
modified the latest code.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386276963
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+
+public class TestHoodieSnapshotExporter {
+  private static String TEST_WRITE_TOKEN = "1-0-1";
+
+  private SparkSession spark = null;
+  private HoodieTestDataGenerator dataGen = null;
+  private String basePath = null;
+  private String outputPath = null;
+  private String rootPath = null;
+  private FileSystem fs = null;
+  private Map commonOpts;
+  private HoodieSnapshotExporter.Config cfg;
+  private JavaSparkContext jsc = null;
+
+  @Before
+  public void initialize() throws IOException {
+spark = SparkSession.builder()
+.appName("Hoodie Datasource test")
+.master("local[2]")
+.config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
+.getOrCreate();
+jsc = new JavaSparkContext(spark.sparkContext());
+dataGen = new HoodieTestDataGenerator();
+TemporaryFolder folder = new TemporaryFolder();
+folder.create();
+basePath = folder.getRoot().getAbsolutePath();
+fs = FSUtils.getFs(basePath, spark.sparkContext().hadoopConfiguration());
+commonOpts = new HashMap();
+
+commonOpts.put("hoodie.insert.shuffle.parallelism", "4");
+commonOpts.put("hoodie.upsert.shuffle.parallelism", "4");
+commonOpts.put(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), 
"_row_key");
+commonOpts.put(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"partition");
+commonOpts.put(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), 
"timestamp");
+commonOpts.put(HoodieWriteConfig.TABLE_NAME, "hoodie_test");
+
+
+cfg = new HoodieSnapshotExporter.Config();
+
+cfg.sourceBasePath = basePath;
+cfg.targetOutputPath = outputPath = basePath + "/target";
+cfg.outputFormat = "json";
+cfg.outputPartitionField = "partition";
+
+  }
+
+  @After
+  public void cleanup() throws Exception {
+if (spark != null) {
+  spark.stop();
+}
+  }
+
+  @Test
+  public void testSnapshotExporter() throws IOException {
+// Insert Operation
+List records = 
DataSourceTestUtils.convertToStringList(dataGen.generateInserts("000", 100));
+Dataset inputDF = spark.read().json(new 
JavaSparkContext(spark.sparkContext()).parallelize(records, 2));
+inputDF.write().format("hudi")
+.options(commonOpts)
+.option(DataSourceWriteOptions.OPERATION_OPT_KEY(), 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL())
+.mode(SaveMode.Overwrite)
+.save(basePath);
+long sourceCount = inputDF.count();
+
+HoodieSnapshotExporter hoodieSnapshotExporter = new 
HoodieSnapshotExporter();
+hoodieSnapshotExporter.export(spark, cfg);
+
+long targetCount = spark.read().json(outputPath).count();
+
+assertTrue(sourceCount == targetCount);
+
+// Test snapshotPrefix
+long filterCount = inputDF.where("partition == '2015/03/16'").count();
+cfg.snapshotPrefix =

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1354: [WIP][HUDI-581] NOTICE need more work as it missing content form included 3rd party ALv2 licensed NOTICE files

2020-03-02 Thread GitBox

codecov-io edited a comment on issue #1354: [WIP][HUDI-581] NOTICE need more 
work as it missing content form included 3rd party ALv2 licensed NOTICE files
URL: https://github.com/apache/incubator-hudi/pull/1354#issuecomment-593259405
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1354?src=pr=h1) 
Report
   > :exclamation: No coverage uploaded for pull request base 
(`master@4e7fcde`). [Click here to learn what that 
means](https://docs.codecov.io/docs/error-reference#section-missing-base-commit).
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1354/graphs/tree.svg?width=650=VTTXabwbs2=150=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1354?src=pr=tree)
   
   ```diff
   @@   Coverage Diff@@
   ## master   #1354   +/-   ##
   
 Coverage  ?   67.1%   
 Complexity? 223   
   
 Files ? 333   
 Lines ?   16216   
 Branches  ?1659   
   
 Hits  ?   10881   
 Misses?4598   
 Partials  ? 737
   ```
   
   
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1354?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1354?src=pr=footer).
 Last update 
[4e7fcde...88f0454](https://codecov.io/gh/apache/incubator-hudi/pull/1354?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386261497
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.table.TableFileSystemView;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+
+import scala.Tuple2;
+import scala.collection.JavaConversions;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Export the latest records of Hudi dataset to a set of external files (e.g., 
plain parquet files).
+ */
+
+public class HoodieSnapshotExporter {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSnapshotExporter.class);
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--source-base-path", "-sbp"}, description = "Base 
path for the source Hudi dataset to be snapshotted", required = true)
+String basePath = null;
+
+@Parameter(names = {"--target-base-path", "-tbp"}, description = "Base 
path for the target output files (snapshots)", required = true)
+String outputPath = null;
+
+@Parameter(names = {"--snapshot-prefix", "-sp"}, description = "Snapshot 
prefix or directory under the target base path in order to segregate different 
snapshots")
+String snapshotPrefix;
+
+@Parameter(names = {"--output-format", "-of"}, description = "e.g. Hudi or 
Parquet", required = true)
+String outputFormat;
+
+@Parameter(names = {"--output-partition-field", "-opf"}, description = "A 
field to be used by Spark repartitioning")
+String outputPartitionField;
+  }
+
+  public void export(SparkSession spark, Config cfg) throws IOException {
+String sourceBasePath = cfg.basePath;
+String targetBasePath = cfg.outputPath;
+String snapshotPrefix = cfg.snapshotPrefix;
+String outputFormat = cfg.outputFormat;
+String outputPartitionField = cfg.outputPartitionField;
+JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
+FileSystem fs = FSUtils.getFs(sourceBasePath, jsc.hadoopConfiguration());
+
+final SerializableConfiguration serConf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
+final HoodieTableMetaClient tableMetadata = new 
HoodieTableMetaClient(fs.getConf(), sourceBasePath);
+final TableFileSystemView.BaseFileOnlyView fsView = new 
HoodieTableFileSystemView(tableMetadata,
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants());
+// Get the latest commit
+Option latestCommit =
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
+if (!latestCommit.isPresent()) {
+  LOG.warn("No commits present. Nothing to snapshot");
+  return;
+}
+final String latestCommitTimestamp = latestCommit.get().getTimestamp();
+LOG.info(String.format("Starting to snapshot latest version files which 
are also

[GitHub] [incubator-hudi] OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386258466
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/DataSourceTestUtils.java
 ##
 @@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.common.TestRawTripPayload;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Test utils for data source tests.
+ */
+public class DataSourceTestUtils {
+
+  public static Option convertToString(HoodieRecord record) {
+try {
+  String str = ((TestRawTripPayload) record.getData()).getJsonData();
+  str = "{" + str.substring(str.indexOf("\"timestamp\":"));
+  // Remove the last } bracket
+  str = str.substring(0, str.length() - 1);
+  return Option.of(str + ", \"partition\": \"" + record.getPartitionPath() 
+ "\"}");
+} catch (IOException e) {
+  return Option.empty();
+}
+  }
+
+  public static List convertToStringList(List records) {
+return 
records.stream().map(DataSourceTestUtils::convertToString).filter(Option::isPresent).map(Option::get)
+.collect(Collectors.toList());
+  }
+
+  public static List convertKeysToStringList(List keys) {
 
 Review comment:
   DataSourceTestUtils actually exists under the hudi-spark test module, but we 
can't access it. I think we can move to the hudi-common package in the future 
to reuse the code as possible.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi Dataset Snapshot Exporter

2020-03-02 Thread GitBox

OpenOpened commented on a change in pull request #1360: [HUDI-344][RFC-09] Hudi 
Dataset Snapshot Exporter
URL: https://github.com/apache/incubator-hudi/pull/1360#discussion_r386257479
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.table.TableFileSystemView;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameWriter;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+
+import scala.Tuple2;
+import scala.collection.JavaConversions;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Export the latest records of Hudi dataset to a set of external files (e.g., 
plain parquet files).
+ */
+
+public class HoodieSnapshotExporter {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSnapshotExporter.class);
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--source-base-path"}, description = "Base path for 
the source Hudi dataset to be snapshotted", required = true)
+String sourceBasePath = null;
+
+@Parameter(names = {"--target-base-path"}, description = "Base path for 
the target output files (snapshots)", required = true)
+String targetOutputPath = null;
+
+@Parameter(names = {"--snapshot-prefix"}, description = "Snapshot prefix 
or directory under the target base path in order to segregate different 
snapshots")
+String snapshotPrefix;
+
+@Parameter(names = {"--output-format"}, description = "e.g. Hudi or 
Parquet", required = true)
+String outputFormat;
+
+@Parameter(names = {"--output-partition-field"}, description = "A field to 
be used by Spark repartitioning")
+String outputPartitionField;
+  }
+
+  public void export(SparkSession spark, Config cfg) throws IOException {
+JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
+FileSystem fs = FSUtils.getFs(cfg.sourceBasePath, 
jsc.hadoopConfiguration());
+
+final SerializableConfiguration serConf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
+final HoodieTableMetaClient tableMetadata = new 
HoodieTableMetaClient(fs.getConf(), cfg.sourceBasePath);
+final TableFileSystemView.BaseFileOnlyView fsView = new 
HoodieTableFileSystemView(tableMetadata,
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants());
+// Get the latest commit
+Option latestCommit =
+
tableMetadata.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant();
+if (!latestCommit.isPresent()) {
+  LOG.warn("No commits present. Nothing to snapshot");
+  return;
+}
+final String latestCommitTimestamp = latestCommit.get().getTimestamp();
+LOG.info(String.format("Starting to snapshot latest version files which 
are also no-late-than %s.",
+latestCommitTimestamp));
+
+List partitions = FSUtils.getAllPartitionPaths(fs, 
cfg.sourceBasePath, false);
+if (partitions.size() > 0) {
+  List dataFiles = new ArrayList<>();
+
+

[GitHub] [incubator-hudi] codecov-io commented on issue #1151: [HUDI-476] Add hudi-examples module

2020-03-02 Thread GitBox

codecov-io commented on issue #1151: [HUDI-476] Add hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#issuecomment-593277561
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1151?src=pr=h1) 
Report
   > Merging 
[#1151](https://codecov.io/gh/apache/incubator-hudi/pull/1151?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/2d040145810b8b14c59c5882f9115698351039d1?src=pr=desc)
 will **decrease** coverage by `66.45%`.
   > The diff coverage is `0%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1151/graphs/tree.svg?width=650=VTTXabwbs2=150=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1151?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #1151   +/-   ##
   
   - Coverage 67.09%   0.64%   -66.46% 
   + Complexity  223   2  -221 
   
 Files   333 287   -46 
 Lines 16216   14320 -1896 
 Branches   16591465  -194 
   
   - Hits  10880  92-10788 
   - Misses 4598   14225 +9627 
   + Partials738   3  -735
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1151?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...rg/apache/hudi/common/model/HoodieAvroPayload.java](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUF2cm9QYXlsb2FkLmphdmE=)
 | `0% <0%> (-84.62%)` | `0 <0> (ø)` | |
   | 
[...che/hudi/common/table/timeline/dto/LogFileDTO.java](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9Mb2dGaWxlRFRPLmphdmE=)
 | `0% <0%> (-100%)` | `0% <0%> (ø)` | |
   | 
[...apache/hudi/common/model/HoodieDeltaWriteStat.java](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZURlbHRhV3JpdGVTdGF0LmphdmE=)
 | `0% <0%> (-100%)` | `0% <0%> (ø)` | |
   | 
[...org/apache/hudi/common/model/HoodieFileFormat.java](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUZpbGVGb3JtYXQuamF2YQ==)
 | `0% <0%> (-100%)` | `0% <0%> (ø)` | |
   | 
[...g/apache/hudi/execution/BulkInsertMapFunction.java](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhlY3V0aW9uL0J1bGtJbnNlcnRNYXBGdW5jdGlvbi5qYXZh)
 | `0% <0%> (-100%)` | `0% <0%> (ø)` | |
   | 
[.../common/util/queue/IteratorBasedQueueProducer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvcXVldWUvSXRlcmF0b3JCYXNlZFF1ZXVlUHJvZHVjZXIuamF2YQ==)
 | `0% <0%> (-100%)` | `0% <0%> (ø)` | |
   | 
[...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=)
 | `0% <0%> (-100%)` | `0% <0%> (ø)` | |
   | 
[...e/hudi/common/table/timeline/dto/FileGroupDTO.java](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9GaWxlR3JvdXBEVE8uamF2YQ==)
 | `0% <0%> (-100%)` | `0% <0%> (ø)` | |
   | 
[...apache/hudi/timeline/service/handlers/Handler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvaGFuZGxlcnMvSGFuZGxlci5qYXZh)
 | `0% <0%> (-100%)` | `0% <0%> (ø)` | |
   | 
[.../common/util/queue/FunctionBasedQueueProducer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvcXVldWUvRnVuY3Rpb25CYXNlZFF1ZXVlUHJvZHVjZXIuamF2YQ==)
 | `0% <0%> (-100%)` | `0% <0%> (ø)` | |
   | ... and [287 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1151/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1151?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1151?src=pr=footer).
 Last update

72 matches

Mail list logo