[GitHub] [incubator-hudi] bhasudha commented on issue #1653: [SUPPORT]: Hudi Deltastreamer OffsetoutofRange Exception reading from Kafka topic (12 partitions)

2020-05-21 Thread GitBox


bhasudha commented on issue #1653:
URL: https://github.com/apache/incubator-hudi/issues/1653#issuecomment-632516157


   @prashanthpdesai  sorry I assumed you were referring to your own check 
pointing.  Your understanding is right. Checkpoints are written to hoodie 
commit metadata after each round of DeltaStreamer run.
   
   The Exception you described seems to be possible if the offfsets supplied is 
larger or smaller than what the server has for a given partition. I am 
suspecting if this could be be because of retention policy of the kafka topic 
kicking in. It should be easy to check this. I think  some command like this 
```kafka-topics.sh --bootstrap-server server_ip:9092 --describe --topic 
topic_name``` will help print the topic config.  Can we start to debug from 
there?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-21 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-905:
---

Assignee: Yanjia Gary Li

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #285

2020-05-21 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.35 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
or

[GitHub] [incubator-hudi] hddong commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

2020-05-21 Thread GitBox


hddong commented on a change in pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#discussion_r428056297



##
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String 
basePath, String tableNa
   }
 
   private static int deduplicatePartitionPath(JavaSparkContext jsc, String 
duplicatedPartitionPath,
-  String repairedOutputPath, String basePath, String dryRun) {
+  String repairedOutputPath, String basePath, boolean dryRun, String 
dedupeType) {
 DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, 
repairedOutputPath, new SQLContext(jsc),
-FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
-job.fixDuplicates(Boolean.parseBoolean(dryRun));
+FSUtils.getFs(basePath, jsc.hadoopConfiguration()), 
getDedupeType(dedupeType));
+job.fixDuplicates(dryRun);
 return 0;
   }
 
+  private static Enumeration.Value getDedupeType(String type) {
+switch (type) {
+  case "insertType":
+return DeDupeType.insertType();
+  case "updateType":
+return DeDupeType.updateType();
+  case "upsertType":
+return DeDupeType.upsertType();
+  default:
+throw new IllegalArgumentException("Please provide valid dedupe 
type!");
+}
+  }
+

Review comment:
   Can use `DeDupeType.withName("insertType")` instead.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] wangxianghu commented on pull request #1652: [HUDI-918] Fix kafkaOffsetGen can not read kafka data bug

2020-05-21 Thread GitBox


wangxianghu commented on pull request #1652:
URL: https://github.com/apache/incubator-hudi/pull/1652#issuecomment-632445691


   Hi @UZi5136225,  It may be better to give some reminders on the description 
of "--source-limit"



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] afeldman1 commented on issue #933: Support for multiple level partitioning in Hudi

2020-05-21 Thread GitBox


afeldman1 commented on issue #933:
URL: https://github.com/apache/incubator-hudi/issues/933#issuecomment-632436139


   @vinothchandar Yes, I can do that. When you say the "`writing_data` page", 
are you referring to adding them to the wiki, to the DataSourceWriteOptions 
object, or to both? I'm not seeing a `writing_data` page on the wiki link you 
posted above.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-919) Run hudi-cli ITTest in docker.

2020-05-21 Thread hong dongdong (Jira)
hong dongdong created HUDI-919:
--

 Summary: Run hudi-cli ITTest in docker.
 Key: HUDI-919
 URL: https://issues.apache.org/jira/browse/HUDI-919
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: CLI
Reporter: hong dongdong
Assignee: hong dongdong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] hddong commented on pull request #1574: [HUDI-701]Add unit test for HDFSParquetImportCommand

2020-05-21 Thread GitBox


hddong commented on pull request #1574:
URL: https://github.com/apache/incubator-hudi/pull/1574#issuecomment-632434767


   @garyli1019 : I had try docker before, it usually use `execStartCmd` to exec 
cmd directly.
   But for hudi-cli, we need exec cmd in interactive mode. I will try it again 
later, but need some time. If any solution or suggestion, please let me know.
   Back to the failed test, It's due to spark job failed, I suggest you have a 
look of detail log throwed by spark job which above the assert log.
   There are two things necessary conditions : 1、 runnable spark in local env 
2、SPARK_HOME
   additionnal notice:  if spark ues default config, use command `mkdir 
/tmp/spark-events/` to create a tmp directory and do not use brew installation 
of spark.
   
   If there's still failed, please let me know. Here or email to me.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on issue #1649: [SUPPORT] Not more than one spark.sql is working on Hoodie Parquet format

2020-05-21 Thread GitBox


bvaradar commented on issue #1649:
URL: https://github.com/apache/incubator-hudi/issues/1649#issuecomment-632433768


   Does the path : s3a://gat-datalake-refined-dev/reports/player/dat/2020/04/23 
actually exist. Have you enabled eventual consistency guard ? 
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] sungjuly commented on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


sungjuly commented on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632430324


   At Udemy ([https://www.udemy.com/]) we're using Apache Hudi(0.5.0) on AWS 
EMR (5.29.0) to ingest MySQL change data capture. Thank you for open sourcing a 
great project. Congratulations to TLP. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on issue #1646: [SUPPORT]: Unable to query Hive table through Spark SQL

2020-05-21 Thread GitBox


bvaradar commented on issue #1646:
URL: https://github.com/apache/incubator-hudi/issues/1646#issuecomment-632430489


   You can take a look at 
https://hudi.apache.org/docs/docker_demo.html#step-4-b-run-spark-sql-queries 
for a working demo spark sql setup. There is no need to include 
hudi-hadoop-mr-bundle here. You can also double check if the class is present 
in hudi-spark-bundle. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] wangxianghu removed a comment on pull request #1652: [HUDI-918] Fix kafkaOffsetGen can not read kafka data bug

2020-05-21 Thread GitBox


wangxianghu removed a comment on pull request #1652:
URL: https://github.com/apache/incubator-hudi/pull/1652#issuecomment-632429037


   Hi @garyli1019, @UZi5136225 means when the configed "sourceLimit" is lesser 
than the partitions of kafka, kafkaOffetsGen will consume no data



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] UZi5136225 commented on pull request #1652: [HUDI-918] Fix kafkaOffsetGen can not read kafka data bug

2020-05-21 Thread GitBox


UZi5136225 commented on pull request #1652:
URL: https://github.com/apache/incubator-hudi/pull/1652#issuecomment-632430145


   The specific reason is caused by the upward transformation
   @garyli1019 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] wangxianghu edited a comment on pull request #1652: [HUDI-918] Fix kafkaOffsetGen can not read kafka data bug

2020-05-21 Thread GitBox


wangxianghu edited a comment on pull request #1652:
URL: https://github.com/apache/incubator-hudi/pull/1652#issuecomment-632429037


   Hi @garyli1019, @UZi5136225 means when the configed "sourceLimit" is lesser 
than the partitions of kafka, kafkaOffetsGen will consume no data



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] wangxianghu edited a comment on pull request #1652: [HUDI-918] Fix kafkaOffsetGen can not read kafka data bug

2020-05-21 Thread GitBox


wangxianghu edited a comment on pull request #1652:
URL: https://github.com/apache/incubator-hudi/pull/1652#issuecomment-632429037


   Hi @garyli1019, @UZi5136225 means when the configed "sourceLimit" is lesser 
than the partitions of kafka, kafkaOffetsGen will consume no data.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-918) Fix kafkaOffsetGen can not read kafka data bug

2020-05-21 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-918:
---
Description: 
When the sourcelimit is less than the number of Kafka partitions, Hudi cannot 
get the data

Steps to reproduce:

1、Use deltastreamer to consume data from kafka
2、Set the value of sourceLimit to be less than the value of kafka partition
3、INFO DeltaSync:313 - No new data, source checkpoint has not changed. Nothing 
to commit. Old checkpoint

  was:When the sourcelimit is less than the number of Kafka partitions, Hudi 
cannot get the data


> Fix kafkaOffsetGen can not read kafka data bug
> --
>
> Key: HUDI-918
> URL: https://issues.apache.org/jira/browse/HUDI-918
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When the sourcelimit is less than the number of Kafka partitions, Hudi cannot 
> get the data
> Steps to reproduce:
> 1、Use deltastreamer to consume data from kafka
> 2、Set the value of sourceLimit to be less than the value of kafka partition
> 3、INFO DeltaSync:313 - No new data, source checkpoint has not changed. 
> Nothing to commit. Old checkpoint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] UZi5136225 commented on pull request #1652: [HUDI-918] Fix kafkaOffsetGen can not read kafka data bug

2020-05-21 Thread GitBox


UZi5136225 commented on pull request #1652:
URL: https://github.com/apache/incubator-hudi/pull/1652#issuecomment-632429031


   Steps to reproduce:
   
   1、Use deltastreamer to consume data from kafka
   2、Set the value of sourceLimit to be less than the value of kafka partition
   3、INFO DeltaSync:313 - No new data, source checkpoint has not changed. 
Nothing to commit. Old checkpoint



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar closed issue #1641: [SUPPORT] Failed to merge old record into new file for key xxx from old file 123.parquet to new file 456.parquet

2020-05-21 Thread GitBox


bvaradar closed issue #1641:
URL: https://github.com/apache/incubator-hudi/issues/1641


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] wangxianghu commented on pull request #1652: [HUDI-918] Fix kafkaOffsetGen can not read kafka data bug

2020-05-21 Thread GitBox


wangxianghu commented on pull request #1652:
URL: https://github.com/apache/incubator-hudi/pull/1652#issuecomment-632429037


   Hi @garyli1019, @UZi5136225 means when the configed "sourceLimit" is lesser 
than the partitions of kafka, kafkaOffetsGen will consumer no data.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-918) Fix kafkaOffsetGen can not read kafka data bug

2020-05-21 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-918:
-
Summary: Fix kafkaOffsetGen can not read kafka data bug  (was: 
deltastreamer bug is  no new data)

> Fix kafkaOffsetGen can not read kafka data bug
> --
>
> Key: HUDI-918
> URL: https://issues.apache.org/jira/browse/HUDI-918
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When the sourcelimit is less than the number of Kafka partitions, Hudi cannot 
> get the data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-05-21 Thread GitBox


codecov-commenter edited a comment on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-632410484


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=h1) 
Report
   > Merging 
[#1602](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/802d16c8c9793156ef7fef0c59088040800fe025&el=desc)
 will **increase** coverage by `0.01%`.
   > The diff coverage is `50.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1602/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1602  +/-   ##
   
   + Coverage 18.33%   18.35%   +0.01% 
 Complexity  855  855  
   
 Files   344  344  
 Lines 1516715149  -18 
 Branches   1512 1509   -3 
   
   - Hits   2781 2780   -1 
   + Misses1203312016  -17 
 Partials353  353  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `7.14% <ø> (+1.39%)` | `4.00 <0.00> (ø)` | |
   | 
[...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh)
 | `55.07% <50.00%> (-0.33%)` | `15.00 <1.00> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=footer).
 Last update 
[802d16c...df6e291](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-commenter commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-05-21 Thread GitBox


codecov-commenter commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-632410484


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=h1) 
Report
   > Merging 
[#1602](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/802d16c8c9793156ef7fef0c59088040800fe025&el=desc)
 will **increase** coverage by `0.01%`.
   > The diff coverage is `50.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1602/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1602  +/-   ##
   
   + Coverage 18.33%   18.35%   +0.01% 
 Complexity  855  855  
   
 Files   344  344  
 Lines 1516715149  -18 
 Branches   1512 1509   -3 
   
   - Hits   2781 2780   -1 
   + Misses1203312016  -17 
 Partials353  353  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `7.14% <ø> (+1.39%)` | `4.00 <0.00> (ø)` | |
   | 
[...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/incubator-hudi/pull/1602/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh)
 | `55.07% <50.00%> (-0.33%)` | `15.00 <1.00> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=footer).
 Last update 
[802d16c...df6e291](https://codecov.io/gh/apache/incubator-hudi/pull/1602?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] prashanthpdesai edited a comment on issue #1653: [SUPPORT]: Hudi Deltastreamer OffsetoutofRange Exception reading from Kafka topic (12 partitions)

2020-05-21 Thread GitBox


prashanthpdesai edited a comment on issue #1653:
URL: https://github.com/apache/incubator-hudi/issues/1653#issuecomment-632406451


   @bhasudha : I presume that offset for each partitions will be stored the 
commit metadata in HDFS path, since its first run we used auto.offset.reset 
earliest to consume it from beginning . for incremental run offset will be 
picked up automatically rite from the HDFS path ? please correct me if my 
understanding is wrong. 
   Command we used to run in kubenetes pod:
   
   spark-submit "--props",
   System.getenv("KAFKA_PROP_FILE"),
   "--schemaprovider-class", 
"org.apache.hudi.utilities.schema.SchemaRegistryProvider", 
   "--source-class","org.apache.hudi.utilities.sources.AvroKafkaSource", 
   "--target-base-path",System.getenv("HUDI_OUTPUT_LOC"), 
   "--target-table", "mcm.hudi.deltacow", 
   "--table-type", "COPY_ON_WRITE", 
   "--checkpoint", 
   "--commit-on-errors", 
   "--op", "UPSERT",
   "--source-ordering-field", "modifiedDt".
   
   For first run ,
   #kafka prop file
   hoodie.deltastreamer.source.kafka.topic=enriched-output-changelog
   hoodie.auto.commit=false
   enable.auto.commit=false
   auto.offset.reset=earliest 
   
   we did try to both earliest and latest as well but we ended up with same 
exception. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] prashanthpdesai edited a comment on issue #1653: [SUPPORT]: Hudi Deltastreamer OffsetoutofRange Exception reading from Kafka topic (12 partitions)

2020-05-21 Thread GitBox


prashanthpdesai edited a comment on issue #1653:
URL: https://github.com/apache/incubator-hudi/issues/1653#issuecomment-632406451


   @bhasudha : I presume that offset for each partitions will be stored the 
commit metadata in HDFS path, since its first run we used auto.offset.reset 
earliest to consume it from beginning . for incremental run offset will be 
picked up automatically rite from the HDFS path ? please correct me if my 
understanding is wrong. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] prashanthpdesai commented on issue #1653: [SUPPORT]: Hudi Deltastreamer OffsetoutofRange Exception reading from Kafka topic (12 partitions)

2020-05-21 Thread GitBox


prashanthpdesai commented on issue #1653:
URL: https://github.com/apache/incubator-hudi/issues/1653#issuecomment-632406451


   @bhasudha : I presume that offset for each partitions will be stored the 
commit metadata in HDFS path, since its first run we used auto.offset.reset 
earliest to consume it from beginning . for incremental run offset will be 
picked up automatically rite ? please correct me if my understanding is wrong. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bhasudha commented on issue #1653: [SUPPORT]: Hudi Deltastreamer OffsetoutofRange Exception reading from Kafka topic (12 partitions)

2020-05-21 Thread GitBox


bhasudha commented on issue #1653:
URL: https://github.com/apache/incubator-hudi/issues/1653#issuecomment-632405067


   @prashanthpdesai  quick questions.
   
   Where do you checkpoint the offsets between mini batches? and how do you 
configure that to deltastreamer? 
   
   Do you have the offset that you used to run this batch ()which failed with 
out of range) ? 
   If yes can you check if that msg offset is indeed present in the kafka topic 
(to avoid the possibility of your kafka retention policy deleting those msgs) . 
You can get smallest offset available for a topic partition by running this 
kafka command line
   `bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 
 --topic  --time -2
   `



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-05-21 Thread GitBox


garyli1019 commented on a change in pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#discussion_r428971840



##
File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
##
@@ -70,7 +70,9 @@
 public class HoodieTestDataGenerator {
 
   // based on examination of sample file, the schema produces the following 
per record size
-  public static final int SIZE_PER_RECORD = 50 * 1024;
+  public static final int SIZE_PER_RECORD = (int) (1.2 * 1024);

Review comment:
   thanks for reviewing. comments addressed. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] prashanthpdesai opened a new issue #1653: [SUPPORT]: Hudi Deltastreamer OffsetoutofRange Exception reading from Kafka topic (12 partitions)

2020-05-21 Thread GitBox


prashanthpdesai opened a new issue #1653:
URL: https://github.com/apache/incubator-hudi/issues/1653


   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   Yes 
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
range with no configured reset policy for partitions: 
{enriched-output-changelog-0=0}
   
   A clear and concise description of the problem.
   Running Deltastreamer to consume from partitioned(12) compacted topic and 
persisting the data in MapR Platform , while consuming we are getting this 
exception of Offset being out of range and our Pod is being terminated . 
   we did try running in continuous mode with MOR storage type initially , did 
face the offset exception and we could able to achieve it running by cleaning 
up the Kafka topic and restarted our pod .
   Our requirement slightly changed so we don't need to run it continuous , so 
scheduling mini batch (every 2 hours) so facing same offset exception now. 
Downstream consumers doesn't require to consume real time so switched to 
COPY_ON_WRITE and non continuous mode with checkpoint in place to run in mini 
batch. 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   the Files need to be cut equally more likely default size(128MB) and persist 
into our HDFS location.
   
   **Environment Description**
   
   * Hudi version :
   0.5.2
   
   * Spark version :
2.2.1
   
   * Hive version :
   
   * Hadoop version :
   maps 6.01 and Hadoop 2.7 
   
   * Storage (HDFS/S3/GCS..) :
   HDFS
   
   * Running on Docker? (yes/no) :
   yes 
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1616: [HUDI-786] Fixing read beyond inline length in InlineFS

2020-05-21 Thread GitBox


nsivabalan commented on a change in pull request #1616:
URL: https://github.com/apache/incubator-hudi/pull/1616#discussion_r428970158



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/fs/inline/InLineFsDataInputStream.java
##
@@ -56,24 +56,29 @@ public long getPos() throws IOException {
 
   @Override
   public int read(long position, byte[] buffer, int offset, int length) throws 
IOException {
+if ((length - offset) > this.length) {
+  throw new IOException("Attempting to read past inline content");
+}
 return outerStream.read(startOffset + position, buffer, offset, length);
   }
 
   @Override
   public void readFully(long position, byte[] buffer, int offset, int length) 
throws IOException {
+if ((length - offset) > this.length) {
+  throw new IOException("Attempting to read past inline content");
+}
 outerStream.readFully(startOffset + position, buffer, offset, length);
   }
 
   @Override
   public void readFully(long position, byte[] buffer)
   throws IOException {
-outerStream.readFully(startOffset + position, buffer, 0, buffer.length);
+readFully(position, buffer, 0, buffer.length);
   }
 
   @Override
   public boolean seekToNewSource(long targetPos) throws IOException {

Review comment:
   @bvaradar ping. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1647: [HUDI-867]: fixed IllegalArgumentException from graphite metrics in deltaStreamer continuous mode

2020-05-21 Thread GitBox


nsivabalan commented on a change in pull request #1647:
URL: https://github.com/apache/incubator-hudi/pull/1647#discussion_r428967198



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
##
@@ -416,10 +425,12 @@ public DeltaSync getDeltaSync() {
   jssc.setLocalProperty("spark.scheduler.pool", 
SchedulerConfGenerator.DELTASYNC_POOL_NAME);
 }
 try {
+  int iteration = 1;
   while (!isShutdownRequested()) {
 try {
   long start = System.currentTimeMillis();
-  Option scheduledCompactionInstant = deltaSync.syncOnce();
+  HoodieMetrics.setTableName(cfg.metricsTableName + "_" + 
iteration);

Review comment:
   sorry, I don't quite get why we need to set table name here? Next line 
will in turn rely on the arg passed for tablename. So, don't really understand 
why we need static fix (i.e. setTableName). From the diff, I see that 
cfg.tableName is set passed into  DeltaSync.syncOnce(tblName) and 
HoodieDeltaStreamerMetrics(HoodieWriteConfig tableName). Can you help me 
understand the case where the static set method for table name is required. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan commented on pull request #1648: [HUDI-916]: added support for multiple input formats in TimestampBasedKeyGenerator

2020-05-21 Thread GitBox


nsivabalan commented on pull request #1648:
URL: https://github.com/apache/incubator-hudi/pull/1648#issuecomment-632392315


   @pratyakshsharma : was this patch already reviewed as part of[ 
#1597](https://github.com/apache/incubator-hudi/pull/1597)? or do I need to 
review it from scratch 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-767) Support transformation when export to Hudi

2020-05-21 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113612#comment-17113612
 ] 

sivabalan narayanan commented on HUDI-767:
--

done. 

> Support transformation when export to Hudi
> --
>
> Key: HUDI-767
> URL: https://issues.apache.org/jira/browse/HUDI-767
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.6.1
>
>
> Main logic described in 
> https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410
> In HoodieSnapshotExporter, we could extend the feature to include 
> transformation when --output-format hudi, using a custom Transformer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nsivabalan commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-05-21 Thread GitBox


nsivabalan commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-632389785


   @garyli1019 : thanks for clarifying. that was helpful. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-05-21 Thread GitBox


nsivabalan commented on a change in pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#discussion_r428959265



##
File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
##
@@ -70,7 +70,9 @@
 public class HoodieTestDataGenerator {
 
   // based on examination of sample file, the schema produces the following 
per record size
-  public static final int SIZE_PER_RECORD = 50 * 1024;
+  public static final int SIZE_PER_RECORD = (int) (1.2 * 1024);

Review comment:
   since you are at it, can we fix the naming. can you add suffix the units 
for size. whether its bits or bytes. 

##
File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
##
@@ -70,7 +70,9 @@
 public class HoodieTestDataGenerator {
 
   // based on examination of sample file, the schema produces the following 
per record size
-  public static final int SIZE_PER_RECORD = 50 * 1024;
+  public static final int SIZE_PER_RECORD = (int) (1.2 * 1024);

Review comment:
   same for BLOOM_FILTER_SIZE





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1433: [HUDI-728]: Implement custom key generator

2020-05-21 Thread GitBox


nsivabalan commented on a change in pull request #1433:
URL: https://github.com/apache/incubator-hudi/pull/1433#discussion_r428954499



##
File path: 
hudi-spark/src/test/java/org/apache/hudi/keygen/TestSimpleKeyGenerator.java
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+public class TestSimpleKeyGenerator extends TestKeyGeneratorUtilities {
+
+  private TypedProperties getCommonProps() {
+TypedProperties properties = new TypedProperties();
+properties.put(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), 
"_row_key");
+properties.put(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY(), 
"true");
+return properties;
+  }
+
+  private TypedProperties getPropertiesWithoutPartitionPathProp() {
+return getCommonProps();
+  }
+
+  private TypedProperties getPropertiesWithoutRecordKeyProp() {
+TypedProperties properties = new TypedProperties();
+properties.put(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"timestamp");
+return properties;
+  }
+
+  private TypedProperties getWrongRecordKeyFieldProps() {
+TypedProperties properties = new TypedProperties();
+properties.put(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"timestamp");
+properties.put(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), 
"_wrong_key");
+return properties;
+  }
+
+  private TypedProperties getComplexRecordKeyProp() {
+TypedProperties properties = new TypedProperties();
+properties.put(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"timestamp");
+properties.put(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), 
"_row_key,pii_col");
+return properties;
+  }
+
+  private TypedProperties getProps() {
+TypedProperties properties = getCommonProps();
+properties.put(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"timestamp");
+return properties;
+  }
+
+  @Test
+  public void testNullPartitionPathFields() {
+Assertions.assertThrows(IllegalArgumentException.class, () -> new 
SimpleKeyGenerator(getPropertiesWithoutPartitionPathProp()));
+  }
+
+  @Test
+  public void testNullRecordKeyFields() {
+Assertions.assertThrows(IllegalArgumentException.class, () -> new 
SimpleKeyGenerator(getPropertiesWithoutRecordKeyProp()));
+  }
+
+  @Test
+  public void testWrongRecordKeyField() {
+SimpleKeyGenerator keyGenerator = new 
SimpleKeyGenerator(getWrongRecordKeyFieldProps());
+Assertions.assertThrows(HoodieKeyException.class, () -> 
keyGenerator.getRecordKey(getRecord()));
+  }
+
+  @Test
+  public void testComplexRecordKeyField() {
+SimpleKeyGenerator keyGenerator = new 
SimpleKeyGenerator(getComplexRecordKeyProp());
+Assertions.assertThrows(HoodieKeyException.class, () -> 
keyGenerator.getRecordKey(getRecord()));
+  }
+
+  @Test
+  public void testHappyFlow() {
+SimpleKeyGenerator keyGenerator = new SimpleKeyGenerator(getProps());
+HoodieKey key = keyGenerator.getKey(getRecord());
+Assertions.assertEquals(key.getRecordKey(), "key1");
+Assertions.assertEquals(key.getPartitionPath(), "timestamp=4357686");
+  }
+}

Review comment:
   here as well

##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
##
@@ -154,4 +153,4 @@ private long convertLongTimeToMillis(Long partitionVal) {
 
 return MILLISECONDS.convert(partitionVal, timeUnit);
   }
-}
+}

Review comment:
   need a new line

##
File path: 
hudi-spark/src/test/java/org/apache/hudi/keygen/TestCustomKeyGenerator.java
##
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "Licen

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1433: [HUDI-728]: Implement custom key generator

2020-05-21 Thread GitBox


nsivabalan commented on a change in pull request #1433:
URL: https://github.com/apache/incubator-hudi/pull/1433#discussion_r428953503



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.config.TypedProperties;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.exception.HoodieDeltaStreamerException;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * This is a generic implementation of KeyGenerator where users can configure 
record key as a single field or a combination of fields.
+ * Similarly partition path can be configured to have multiple fields or only 
one field. This class expects value for prop
+ * "hoodie.datasource.write.partitionpath.field" in a specific format. For 
example:
+ *
+ * properties.put("hoodie.datasource.write.partitionpath.field", 
"field1:PartitionKeyType1,field2:PartitionKeyType2").
+ *
+ * The complete partition path is created as / and so on.
+ *
+ * Few points to consider:
+ * 1. If you want to customise some partition path field on a timestamp basis, 
you can use field1:timestampBased
+ * 2. If you simply want to have the value of your configured field in the 
partition path, use field1:simple
+ * 3. If you want your table to be non partitioned, simply leave it as blank.
+ *
+ * RecordKey is internally generated using either SimpleKeyGenerator or 
ComplexKeyGenerator.
+ */
+public class CustomKeyGenerator extends KeyGenerator {
+
+  protected final List recordKeyFields;
+  protected final List partitionPathFields;
+  protected final TypedProperties properties;
+  private static final String DEFAULT_PARTITION_PATH_SEPARATOR = "/";
+  private static final String SPLIT_REGEX = ":";
+
+  /**
+   * Used as a part of config in CustomKeyGenerator.java.
+   */
+  public enum PartitionKeyType {
+SIMPLE, TIMESTAMP
+  }
+
+  public CustomKeyGenerator(TypedProperties props) {
+super(props);
+this.properties = props;
+this.recordKeyFields = 
Arrays.stream(props.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(",")).map(String::trim).collect(Collectors.toList());
+this.partitionPathFields =
+  
Arrays.stream(props.getString(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()).split(",")).map(String::trim).collect(Collectors.toList());
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+//call function to get the record key
+String recordKey = getRecordKey(record);
+//call function to get the partition key based on the type for that 
partition path field
+String partitionPath = getPartitionPath(record);
+return new HoodieKey(recordKey, partitionPath);
+  }
+
+  public String getPartitionPath(GenericRecord record) {
+if (partitionPathFields == null) {
+  throw new HoodieKeyException("Unable to find field names for partition 
path in cfg");
+}
+
+String partitionPathField;
+StringBuilder partitionPath = new StringBuilder();
+
+//Corresponds to no partition case
+if (partitionPathFields.size() == 1 && 
partitionPathFields.get(0).isEmpty()) {

Review comment:
   ok, my bad. Just checked the trim docs, it will just trim leading and 
trailing whitespaces if any. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch hudi_test_suite_refactor updated (566f245 -> a048bf3)

2020-05-21 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard 566f245  [HUDI-394] Provide a basic implementation of test suite
 add a048bf3  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (566f245)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (a048bf3)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../test/java/org/apache/hudi/testsuite/dag/ComplexDagGenerator.java  | 4 ++--
 .../java/org/apache/hudi/testsuite/job/TestHoodieTestSuiteJob.java| 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)



[GitHub] [incubator-hudi] garyli1019 commented on pull request #1574: [HUDI-701]Add unit test for HDFSParquetImportCommand

2020-05-21 Thread GitBox


garyli1019 commented on pull request #1574:
URL: https://github.com/apache/incubator-hudi/pull/1574#issuecomment-632339112


   Hi @hddong , thanks for your contribution on these tests.
   There are some tests failed in my local build in `hudi-cli` module. I 
believe it could be related to the run time environment. Do you think we should 
set up this test in a docker environment?
   
   example stack tracing for `testConvertWithInsert()`
   ```
   expected:  but was: 
   Comparison Failure: 
   Expected :true
   Actual   :false
   
   
   
   
   java.lang.NullPointerException
at 
org.apache.hudi.cli.integ.ITTestHDFSParquetImportCommand.lambda$testConvertWithInsert$1(ITTestHDFSParquetImportCommand.java:100)
at org.junit.jupiter.api.AssertAll.lambda$assertAll$1(AssertAll.java:68)
at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at 
java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:373)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.junit.jupiter.api.AssertAll.assertAll(AssertAll.java:77)
at org.junit.jupiter.api.AssertAll.assertAll(AssertAll.java:44)
at org.junit.jupiter.api.Assertions.assertAll(Assertions.java:2856)
at 
org.apache.hudi.cli.integ.ITTestHDFSParquetImportCommand.testConvertWithInsert(ITTestHDFSParquetImportCommand.java:98)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:686)
at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:208)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:137)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:71)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:135)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:125)
at 
org.junit.platform.engine.support.hierarchical.Node.around(Node.java:135)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:123)
at 
org.junit.platform.engine.support.hie

[GitHub] [incubator-hudi] maduxi edited a comment on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


maduxi edited a comment on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632323358


   We are using it at an online casino based in Malta. We are using it in 
production, but only for a small part of our dataset. It's a large table that 
has frequent updates, and with hudi we are able to update it frequently and 
still not face any issue while querying.
   I just have a challenge now, and it's to make it available to the analysts 
in jupyter using livy. I was able to add the hudi jar to livy, but the 
httpclient lib required by hudi breaks livy, and the one provided by livy 
breaks hudi compatibility.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] maduxi commented on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


maduxi commented on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632323358


   We are using it at an online casino based on Malta. We are using it in 
production, but only for a small part of our dataset. It's a large table that 
has frequent updates, and with hudi we are able to update it frequently and 
still not face any issue while querying.
   I just have a challenge now, and it's to make it available to the analysts 
in jupyter using livy. I was able to add the hudi jar to livy, but the 
httpclient lib required by hudi breaks livy, and the one provided by livy 
breaks hudi compatibility.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on pull request #1651: [MINOR] add impala release and spark partition discovery

2020-05-21 Thread GitBox


lamber-ken commented on pull request #1651:
URL: https://github.com/apache/incubator-hudi/pull/1651#issuecomment-632314208


   @garyli1019 it would be nice to add more context about the pr next time : )



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on pull request #1651: [MINOR] add impala release and spark partition discovery

2020-05-21 Thread GitBox


lamber-ken commented on pull request #1651:
URL: https://github.com/apache/incubator-hudi/pull/1651#issuecomment-632313692


   👍 @garyli1019 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] garyli1019 commented on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


garyli1019 commented on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632312446


   @vinothchandar Will do once I clear some internal process



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] prashanthpdesai commented on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


prashanthpdesai commented on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632311231


   @vinothchandar : sure , yes we are still in pre prod . 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


vinothchandar commented on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632305767


   @garyli1019 do you mind sharing your company name/logo and is it okay to 
list this on powered_by? 
   
   @prashanthpdesai Let's offline the small file issue.. Interested to 
understand why it does not work for you in MOR.. As I understand it, you are 
still in pre-prod? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] prashanthpdesai edited a comment on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


prashanthpdesai edited a comment on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632263515


   we are trying to use HUDI Deltastreamer to read from compacted Kafka topic 
in production environment and pull the messages incrementally and persist the 
data in MapR Platform in regular interval , tried initially with MOR with 
continuous mode of streamer faced small file issue , now planning to run it in 
minibatch(every 2 hours) with COPY_ON_WRITE to avoid compaction etc. we are 
facing offsetoutofrange exception ,tried with both option of auto.offset.reset= 
earliest as well as latest encountered the same offsetoutofrange exception.  we 
notice that in log that auto.offset.reset is overriding to none 
   WARN kafka010.KafkaUtils: overriding enable.auto.commit to false for executor
   WARN kafka010.KafkaUtils: overriding auto.offset.reset to none for executor 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] prashanthpdesai edited a comment on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


prashanthpdesai edited a comment on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632263515


   we are trying to use HUDI Deltastreamer to read from compacted Kafka topic 
in production environment and pull the messages incrementally and persist the 
data in MapR Platform in regular interval , tried initially with MOR with 
continuous mode of streamer faced small file issue , now planning to run it in 
minibatch(every 2 hours) with COPY_ON_WRITE to avoid compaction etc. we are 
facing offsetoutofrange exception 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] prashanthpdesai commented on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


prashanthpdesai commented on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632263515


   we are trying to use HUDI Deltastreamer to read from compacted Kafka topic 
in production environment and pull the messages incrementally, tried initially 
with MOR with continuous mode of streamer faced small file issue , now planning 
to run it in minbatch(every 2 hours) with COPY_ON_WRITE to avoid compaction 
etc. we are facing offsetoutofrange exception.  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] garyli1019 edited a comment on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


garyli1019 edited a comment on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632255322


   We have been using HUDI to manage a data lake with 500+TB manufacturing data 
for almost a year now. In the IoT world, late arrival and update is a very 
common scenario and HUDI can handle it perfectly for us.
   We use Impala to query the data. The small file handling with easy 
partitioning feature of HUDI let us build an efficient structure to make the 
query on the fly.
   In addition, the incremental pulling makes the expensive batch jobs like 
aggregating BI dashboards and maintaining a large graph database much more 
efficient with the custom merging feature between the historical data and 
change data.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1644: [HUDI-811] Restructure test packages in hudi-common

2020-05-21 Thread GitBox


codecov-commenter edited a comment on pull request #1644:
URL: https://github.com/apache/incubator-hudi/pull/1644#issuecomment-631673778


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1644?src=pr&el=h1) 
Report
   > Merging 
[#1644](https://codecov.io/gh/apache/incubator-hudi/pull/1644?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/802d16c8c9793156ef7fef0c59088040800fe025&el=desc)
 will **not change** coverage.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1644/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1644?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1644   +/-   ##
   =
 Coverage 18.33%   18.33%   
 Complexity  855  855   
   =
 Files   344  344   
 Lines 1516715167   
 Branches   1512 1512   
   =
 Hits   2781 2781   
 Misses1203312033   
 Partials353  353   
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1644?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...che/hudi/common/table/log/HoodieLogFileReader.java](https://codecov.io/gh/apache/incubator-hudi/pull/1644/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGaWxlUmVhZGVyLmphdmE=)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1644?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1644?src=pr&el=footer).
 Last update 
[802d16c...f2d05ee](https://codecov.io/gh/apache/incubator-hudi/pull/1644?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] garyli1019 commented on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


garyli1019 commented on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632255322


   We have been using HUDI to manage a data lake with 500+TB manufacturing data 
for almost a year now. In the IoT world, late arrival and update is a very 
common scenario and HUDI can handle it perfectly for us.
   We use Impala to query the data. The small file handling with easy 
partitioning feature of HUDI let us build an efficient structure to make the 
query on the fly.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-05-21 Thread GitBox


garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-632241626


   
![image](https://user-images.githubusercontent.com/23007841/82587156-85ee5e80-9b4d-11ea-839f-798633fdd1a4.png)
   Hi @vinothchandar , unfortunately this bug happened to me again. Do you 
think the current patch make sense?
   The only change was to make `totalBytesWritten > 0` to `totalBytesWritten > 
hoodieWriteConfig.getParquetSmallFileLimit()` so we can skip using small commit 
to do the estimation.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] garyli1019 commented on pull request #1652: [HUDI-918] HUDI small bug

2020-05-21 Thread GitBox


garyli1019 commented on pull request #1652:
URL: https://github.com/apache/incubator-hudi/pull/1652#issuecomment-632238289


   Hi @UZi5136225 , thanks for submitting this PR. I am not sure if I 
understand the bug you are referring to. Would you explain a little bit more?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch hudi_test_suite_refactor updated (894ab75 -> 566f245)

2020-05-21 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard 894ab75  [HUDI-394] Provide a basic implementation of test suite
 add 566f245  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (894ab75)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (566f245)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../test/java/org/apache/hudi/testsuite/dag/ComplexDagGenerator.java| 2 +-
 .../test/java/org/apache/hudi/testsuite/job/TestHoodieTestSuiteJob.java | 1 +
 .../java/org/apache/hudi/utilities/testutils/UtilitiesTestBase.java | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)



[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1644: [HUDI-811] Restructure test packages in hudi-common

2020-05-21 Thread GitBox


xushiyan commented on a change in pull request #1644:
URL: https://github.com/apache/incubator-hudi/pull/1644#discussion_r428757663



##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/util/collection/TestRocksDBDAO.java
##
@@ -45,7 +45,7 @@
 /**
  * Tests RocksDB manager {@link RocksDBDAO}.
  */
-public class TestRocksDBManager {
+public class TestRocksDBDAO {

Review comment:
   @yanghua I'm trying to align it with the class under test, which is 
`RocksDBDAO`. So the pattern is to prefix it with `Test`. Given that we only 
have `RocksDBDAO` class not `RocksDBManager`, figuring we'd better align it 
with the test name to avoid confusion. Sound good?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1644: [HUDI-811] Restructure test packages in hudi-common

2020-05-21 Thread GitBox


xushiyan commented on a change in pull request #1644:
URL: https://github.com/apache/incubator-hudi/pull/1644#discussion_r428760599



##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/fs/inline/TestInLineFileSystemHFileInLining.java
##
@@ -40,18 +42,18 @@
 import java.util.Set;
 import java.util.UUID;
 
-import static org.apache.hudi.common.fs.inline.FileSystemTestUtils.FILE_SCHEME;
-import static org.apache.hudi.common.fs.inline.FileSystemTestUtils.RANDOM;
-import static 
org.apache.hudi.common.fs.inline.FileSystemTestUtils.getPhantomFile;
-import static 
org.apache.hudi.common.fs.inline.FileSystemTestUtils.getRandomOuterInMemPath;
+import static org.apache.hudi.common.testutils.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.common.testutils.FileSystemTestUtils.RANDOM;
+import static 
org.apache.hudi.common.testutils.FileSystemTestUtils.getPhantomFile;
+import static 
org.apache.hudi.common.testutils.FileSystemTestUtils.getRandomOuterInMemPath;
 import static org.junit.jupiter.api.Assertions.assertArrayEquals;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertNotEquals;
 
 /**
  * Tests {@link InLineFileSystem} to inline HFile.
  */
-public class TestHFileInLining {
+public class TestInLineFileSystemHFileInLining {

Review comment:
   @yanghua `TestInLineFileSystem` already exists in the same package. This 
is testing `TestInLineFileSystem` for a group of cases for HFile inlining.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1644: [HUDI-811] Restructure test packages in hudi-common

2020-05-21 Thread GitBox


xushiyan commented on a change in pull request #1644:
URL: https://github.com/apache/incubator-hudi/pull/1644#discussion_r428757663



##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/util/collection/TestRocksDBDAO.java
##
@@ -45,7 +45,7 @@
 /**
  * Tests RocksDB manager {@link RocksDBDAO}.
  */
-public class TestRocksDBManager {
+public class TestRocksDBDAO {

Review comment:
   @yanghua I'm trying to align it with the class under test, which is 
`RocksDBDAO`. So the pattern is to prefix it with `Test`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] broussea1901 edited a comment on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


broussea1901 edited a comment on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632176944


   We're currently deploying HUDI 0.5.0 in a EU Bank. Not in Prod yet. HUDI 
will be used to provide ACID ability for data ingestion batches and streams 
needing such feature (sourcing update files & CDC streams to HDFS).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] broussea1901 commented on issue #661: Tracking ticket for reporting Hudi usages from the community

2020-05-21 Thread GitBox


broussea1901 commented on issue #661:
URL: https://github.com/apache/incubator-hudi/issues/661#issuecomment-632176944


   We're currently deploying HUDI 0.5.0 in a EU Bank. Not in Prod yet. HUDI 
will be used to provide ACID ability for data ingestion batch and stream 
needing such feature (sourcing update files & CDC streams to HDFS).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] UZi5136225 opened a new pull request #1652: [HUDI-918] HUDI small bug

2020-05-21 Thread GitBox


UZi5136225 opened a new pull request #1652:
URL: https://github.com/apache/incubator-hudi/pull/1652


   
   ## What is the purpose of the pull request
   
   sourceLimit should not be less than the number of kafka partitions
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-918) deltastreamer bug is no new data

2020-05-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-918:

Labels: pull-request-available  (was: )

> deltastreamer bug is  no new data
> -
>
> Key: HUDI-918
> URL: https://issues.apache.org/jira/browse/HUDI-918
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When the sourcelimit is less than the number of Kafka partitions, Hudi cannot 
> get the data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-918) deltastreamer bug is no new data

2020-05-21 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-918:
---
Summary: deltastreamer bug is  no new data  (was: Hudi can't get data)

> deltastreamer bug is  no new data
> -
>
> Key: HUDI-918
> URL: https://issues.apache.org/jira/browse/HUDI-918
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.6.0
>
>
> When the sourcelimit is less than the number of Kafka partitions, Hudi cannot 
> get the data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-707) Add unit test for StatsCommand

2020-05-21 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-707:
--
Fix Version/s: 0.6.0

> Add unit test for StatsCommand
> --
>
> Key: HUDI-707
> URL: https://issues.apache.org/jira/browse/HUDI-707
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: CLI, Testing
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated: [HUDI-707] Add unit test for StatsCommand (#1645)

2020-05-21 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 802d16c  [HUDI-707] Add unit test for StatsCommand (#1645)
802d16c is described below

commit 802d16c8c9793156ef7fef0c59088040800fe025
Author: hongdd 
AuthorDate: Thu May 21 18:28:04 2020 +0800

[HUDI-707] Add unit test for StatsCommand (#1645)
---
 .../apache/hudi/cli/HoodieTableHeaderFields.java   |  17 ++
 .../org/apache/hudi/cli/commands/StatsCommand.java |  31 ++--
 .../cli/commands/TestArchivedCommitsCommand.java   |   2 +-
 .../hudi/cli/commands/TestRepairsCommand.java  |   2 +-
 .../apache/hudi/cli/commands/TestStatsCommand.java | 176 +
 .../common/HoodieTestCommitMetadataGenerator.java  |  28 +++-
 .../apache/hudi/common/model/HoodieTestUtils.java  |  23 +++
 7 files changed, 260 insertions(+), 19 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
index 77f486b..7432732 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
@@ -96,4 +96,21 @@ public class HoodieTableHeaderFields {
   public static final String HEADER_TOTAL_PARTITIONS = "Total Partitions";
   public static final String HEADER_DELETED_FILE = "Deleted File";
   public static final String HEADER_SUCCEEDED = "Succeeded";
+
+  /**
+   * Fields of Stats.
+   */
+  public static final String HEADER_COMMIT_TIME = "CommitTime";
+  public static final String HEADER_TOTAL_UPSERTED = "Total Upserted";
+  public static final String HEADER_TOTAL_WRITTEN = "Total Written";
+  public static final String HEADER_WRITE_AMPLIFICATION_FACTOR = "Write 
Amplification Factor";
+  public static final String HEADER_HISTOGRAM_MIN = "Min";
+  public static final String HEADER_HISTOGRAM_10TH = "10th";
+  public static final String HEADER_HISTOGRAM_50TH = "50th";
+  public static final String HEADER_HISTOGRAM_AVG = "avg";
+  public static final String HEADER_HISTOGRAM_95TH = "95th";
+  public static final String HEADER_HISTOGRAM_MAX = "Max";
+  public static final String HEADER_HISTOGRAM_NUM_FILES = "NumFiles";
+  public static final String HEADER_HISTOGRAM_STD_DEV = "StdDev";
+
 }
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java
index e5be0e4..72cf6c0 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java
@@ -20,6 +20,7 @@ package org.apache.hudi.cli.commands;
 
 import org.apache.hudi.cli.HoodieCLI;
 import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.HoodieTableHeaderFields;
 import org.apache.hudi.cli.TableHeader;
 import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
@@ -54,7 +55,7 @@ import java.util.stream.Collectors;
 @Component
 public class StatsCommand implements CommandMarker {
 
-  private static final int MAX_FILES = 100;
+  public static final int MAX_FILES = 100;
 
   @CliCommand(value = "stats wa", help = "Write Amplification. Ratio of how 
many records were upserted to how many "
   + "records were actually written")
@@ -92,12 +93,14 @@ public class StatsCommand implements CommandMarker {
 }
 rows.add(new Comparable[] {"Total", totalRecordsUpserted, 
totalRecordsWritten, waf});
 
-TableHeader header = new 
TableHeader().addTableHeaderField("CommitTime").addTableHeaderField("Total 
Upserted")
-.addTableHeaderField("Total Written").addTableHeaderField("Write 
Amplification Factor");
+TableHeader header = new 
TableHeader().addTableHeaderField(HoodieTableHeaderFields.HEADER_COMMIT_TIME)
+.addTableHeaderField(HoodieTableHeaderFields.HEADER_TOTAL_UPSERTED)
+.addTableHeaderField(HoodieTableHeaderFields.HEADER_TOTAL_WRITTEN)
+
.addTableHeaderField(HoodieTableHeaderFields.HEADER_WRITE_AMPLIFICATION_FACTOR);
 return HoodiePrintHelper.print(header, new HashMap<>(), sortByField, 
descending, limit, headerOnly, rows);
   }
 
-  private Comparable[] printFileSizeHistogram(String instantTime, Snapshot s) {
+  public Comparable[] printFileSizeHistogram(String instantTime, Snapshot s) {
 return new Comparable[] {instantTime, s.getMin(), s.getValue(0.1), 
s.getMedian(), s.getMean(), s.get95thPercentile(),
 s.getMax(), s.size(), s.getStdDev()};
   }
@@ -138,6 +141,20 @@ public class StatsCommand implements CommandMarker {
 Snapshot s = globalHistogram.getSnapshot();
 rows.add(printFileSizeHistogram("ALL", s));
 
+TableHeader header = new TableHeader()
+.addTableHeaderField(HoodieT

[GitHub] [incubator-hudi] yanghua merged pull request #1645: [HUDI-707]Add unit test for StatsCommand

2020-05-21 Thread GitBox


yanghua merged pull request #1645:
URL: https://github.com/apache/incubator-hudi/pull/1645


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-707) Add unit test for StatsCommand

2020-05-21 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-707:
--
Status: Open  (was: New)

> Add unit test for StatsCommand
> --
>
> Key: HUDI-707
> URL: https://issues.apache.org/jira/browse/HUDI-707
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: CLI, Testing
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-707) Add unit test for StatsCommand

2020-05-21 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-707.
-
Resolution: Done

Done via master branch: 802d16c8c9793156ef7fef0c59088040800fe025

> Add unit test for StatsCommand
> --
>
> Key: HUDI-707
> URL: https://issues.apache.org/jira/browse/HUDI-707
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: CLI, Testing
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-918) Hudi can't get data

2020-05-21 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-918:
---
Status: Open  (was: New)

> Hudi can't get data
> ---
>
> Key: HUDI-918
> URL: https://issues.apache.org/jira/browse/HUDI-918
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.6.0
>
>
> When the sourcelimit is less than the number of Kafka partitions, Hudi cannot 
> get the data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-918) Hudi can't get data

2020-05-21 Thread liujinhui (Jira)
liujinhui created HUDI-918:
--

 Summary: Hudi can't get data
 Key: HUDI-918
 URL: https://issues.apache.org/jira/browse/HUDI-918
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: DeltaStreamer
Reporter: liujinhui
Assignee: liujinhui
 Fix For: 0.6.0


When the sourcelimit is less than the number of Kafka partitions, Hudi cannot 
get the data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-861) Add Github and Twitter Widget on Hudi's official website

2020-05-21 Thread hong dongdong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hong dongdong updated HUDI-861:
---
Status: In Progress  (was: Open)

> Add Github and Twitter Widget on Hudi's official website
> 
>
> Key: HUDI-861
> URL: https://issues.apache.org/jira/browse/HUDI-861
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: vinoyang
>Assignee: hong dongdong
>Priority: Major
>
> In order to further strengthen the influence of the Hudi community. I suggest 
> that we can embed Github and Twitter widgets on Hudi's official website as 
> Apahce ignite does. [https://ignite.apache.org/]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-861) Add Github and Twitter Widget on Hudi's official website

2020-05-21 Thread hong dongdong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hong dongdong updated HUDI-861:
---
Status: Open  (was: New)

> Add Github and Twitter Widget on Hudi's official website
> 
>
> Key: HUDI-861
> URL: https://issues.apache.org/jira/browse/HUDI-861
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: vinoyang
>Assignee: hong dongdong
>Priority: Major
>
> In order to further strengthen the influence of the Hudi community. I suggest 
> that we can embed Github and Twitter widgets on Hudi's official website as 
> Apahce ignite does. [https://ignite.apache.org/]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] hddong commented on pull request #1645: [HUDI-707]Add unit test for StatsCommand

2020-05-21 Thread GitBox


hddong commented on pull request #1645:
URL: https://github.com/apache/incubator-hudi/pull/1645#issuecomment-631977694


   @yanghua Thanks for your review, had address them.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1645: [HUDI-707]Add unit test for StatsCommand

2020-05-21 Thread GitBox


codecov-commenter edited a comment on pull request #1645:
URL: https://github.com/apache/incubator-hudi/pull/1645#issuecomment-631841340


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=h1) 
Report
   > Merging 
[#1645](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/6a0aa9a645d11ed7b50e18aa0563dafcd9d145f7&el=desc)
 will **not change** coverage.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1645/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1645   +/-   ##
   =
 Coverage 18.33%   18.33%   
 Complexity  855  855   
   =
 Files   344  344   
 Lines 1516715167   
 Branches   1512 1512   
   =
 Hits   2781 2781   
 Misses1203312033   
 Partials353  353   
   ```
   
   
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=footer).
 Last update 
[6a0aa9a...5abfd69](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-commenter edited a comment on pull request #1645: [HUDI-707]Add unit test for StatsCommand

2020-05-21 Thread GitBox


codecov-commenter edited a comment on pull request #1645:
URL: https://github.com/apache/incubator-hudi/pull/1645#issuecomment-631841340


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=h1) 
Report
   > Merging 
[#1645](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/6a0aa9a645d11ed7b50e18aa0563dafcd9d145f7&el=desc)
 will **not change** coverage.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1645/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1645   +/-   ##
   =
 Coverage 18.33%   18.33%   
 Complexity  855  855   
   =
 Files   344  344   
 Lines 1516715167   
 Branches   1512 1512   
   =
 Hits   2781 2781   
 Misses1203312033   
 Partials353  353   
   ```
   
   
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=footer).
 Last update 
[6a0aa9a...5abfd69](https://codecov.io/gh/apache/incubator-hudi/pull/1645?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1644: [HUDI-811] Restructure test packages in hudi-common

2020-05-21 Thread GitBox


yanghua commented on a change in pull request #1644:
URL: https://github.com/apache/incubator-hudi/pull/1644#discussion_r428506563



##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/fs/inline/TestInLineFileSystemHFileInLining.java
##
@@ -40,18 +42,18 @@
 import java.util.Set;
 import java.util.UUID;
 
-import static org.apache.hudi.common.fs.inline.FileSystemTestUtils.FILE_SCHEME;
-import static org.apache.hudi.common.fs.inline.FileSystemTestUtils.RANDOM;
-import static 
org.apache.hudi.common.fs.inline.FileSystemTestUtils.getPhantomFile;
-import static 
org.apache.hudi.common.fs.inline.FileSystemTestUtils.getRandomOuterInMemPath;
+import static org.apache.hudi.common.testutils.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.common.testutils.FileSystemTestUtils.RANDOM;
+import static 
org.apache.hudi.common.testutils.FileSystemTestUtils.getPhantomFile;
+import static 
org.apache.hudi.common.testutils.FileSystemTestUtils.getRandomOuterInMemPath;
 import static org.junit.jupiter.api.Assertions.assertArrayEquals;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertNotEquals;
 
 /**
  * Tests {@link InLineFileSystem} to inline HFile.
  */
-public class TestHFileInLining {
+public class TestInLineFileSystemHFileInLining {

Review comment:
   Can we rename to `TestInLineFileSystem`?

##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/fs/inline/TestInLineFileSystemHFileInLining.java
##
@@ -60,7 +62,7 @@
   private int maxRows = 100 + RANDOM.nextInt(1000);
   private Path generatedPath;
 
-  public TestHFileInLining() {
+  public TestInLineFileSystemHFileInLining() {

Review comment:
   ditto

##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/util/collection/TestRocksDBDAO.java
##
@@ -45,7 +45,7 @@
 /**
  * Tests RocksDB manager {@link RocksDBDAO}.
  */
-public class TestRocksDBManager {
+public class TestRocksDBDAO {

Review comment:
   I would recommend renaming `RocksDBDAO ` to be `RocksDBManager`. WDYT? 
Generally, `xxxDAO` exists in the web project. cc @vinothchandar 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1645: [HUDI-707]Add unit test for StatsCommand

2020-05-21 Thread GitBox


yanghua commented on a change in pull request #1645:
URL: https://github.com/apache/incubator-hudi/pull/1645#discussion_r428490566



##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestStatsCommand.java
##
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import com.codahale.metrics.Histogram;
+import com.codahale.metrics.Snapshot;
+import com.codahale.metrics.UniformReservoir;
+import org.apache.hudi.cli.AbstractShellIntegrationTest;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.HoodieTableHeaderFields;
+import org.apache.hudi.cli.TableHeader;
+import org.apache.hudi.cli.common.HoodieTestCommitMetadataGenerator;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
+import org.apache.hudi.common.util.Option;
+
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.springframework.shell.core.CommandResult;
+
+import java.io.File;
+import java.io.IOException;
+import java.text.DecimalFormat;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Test class of {@link org.apache.hudi.cli.commands.StatsCommand}.
+ */
+public class TestStatsCommand extends AbstractShellIntegrationTest {
+
+  private String tablePath;
+
+  @BeforeEach
+  public void init() throws IOException {
+String tableName = "test_table";
+tablePath = basePath + File.separator + tableName;
+
+HoodieCLI.conf = jsc.hadoopConfiguration();
+// Create table and connect
+new TableCommand().createTable(
+tablePath, "test_table", HoodieTableType.COPY_ON_WRITE.name(),
+"", TimelineLayoutVersion.VERSION_1, 
"org.apache.hudi.common.model.HoodieAvroPayload");
+  }
+
+  /**
+   * Test case for command 'stats wa'.
+   */
+  @Test
+  public void testWriteAmplificationStats() {
+// generate data and metadata
+Map data = new LinkedHashMap();

Review comment:
   -> `Map data = new LinkedHashMap<>();`

##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestStatsCommand.java
##
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import com.codahale.metrics.Histogram;
+import com.codahale.metrics.Snapshot;
+import com.codahale.metrics.UniformReservoir;
+import org.apache.hudi.cli.AbstractShellIntegrationTest;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.HoodieTableHeaderFields;
+import org.apache.hudi.cli.TableHeader;
+import org.apache.hudi.cli.common.HoodieTestCommitMetadataGenerator;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
+import org.apache.hudi.common.util.Option;
+
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api

[jira] [Updated] (HUDI-917) Calculation of 'stats wa' need to be modified

2020-05-21 Thread yaojingyi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yaojingyi updated HUDI-917:
---
Attachment: (was: image-2020-05-21-14-21-33-244.png)

> Calculation of 'stats wa' need to be modified
> -
>
> Key: HUDI-917
> URL: https://issues.apache.org/jira/browse/HUDI-917
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI
>Reporter: yaojingyi
>Priority: Major
> Attachments: image-2020-05-21-14-22-39-624.png
>
>
> 'Write Amplification Factor' = 'Total Written' / 'Total Upserted' 
> 'Total Written' is always increase.
> 'Total Upserted' is not always increase. When the newest commit is insert new 
> data,'Total Upserted' will be 0.
> It leads the result are not in line with our understanding. 
>  
> if I insert 3 times, update 1 times, then insert again(10 rows per time)
> 'stats wa ' result as follows:
> !image-2020-05-21-14-22-39-624.png!
>  
> 'Total Written' need change to get the difference between adjacent commit.
> I found that the numsWrites always increase. It's the reason of this.
>  
> I'll try to fix it.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-917) Calculation of 'stats wa' need to be modified

2020-05-21 Thread yaojingyi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yaojingyi updated HUDI-917:
---
Attachment: (was: image-2020-05-21-14-10-03-871.png)

> Calculation of 'stats wa' need to be modified
> -
>
> Key: HUDI-917
> URL: https://issues.apache.org/jira/browse/HUDI-917
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI
>Reporter: yaojingyi
>Priority: Major
> Attachments: image-2020-05-21-14-22-39-624.png
>
>
> 'Write Amplification Factor' = 'Total Written' / 'Total Upserted' 
> 'Total Written' is always increase.
> 'Total Upserted' is not always increase. When the newest commit is insert new 
> data,'Total Upserted' will be 0.
> It leads the result are not in line with our understanding. 
>  
> if I insert 3 times, update 1 times, then insert again(10 rows per time)
> 'stats wa ' result as follows:
> !image-2020-05-21-14-22-39-624.png!
>  
> 'Total Written' need change to get the difference between adjacent commit.
> I found that the numsWrites always increase. It's the reason of this.
>  
> I'll try to fix it.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)