[jira] [Comment Edited] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007908#comment-17007908
 ] 

lamber-ken edited comment on HUDI-486 at 1/4/20 4:23 AM:
-

Let me summarize the reproduce steps.

1, use git checkout commit *e1e5fe33249bf511486073dd9cf48e5b7ea14816* 

2, build source
{code:java}
mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true 
{code}
3, setup docker 
{code:java}
cd docker && ./setup_demo.sh{code}
4, generate data
{code:java}
cat demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
{code}
5, go into the container
{code:java}
docker exec -it adhoc-2 /bin/bash 
{code}
6, comsume kafka data && sync to hive
{code:java}
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE \
--storage-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path /user/hive/warehouse/stock_ticks_cow \
--target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties 
\
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

/var/hoodie/ws/hudi-hive/run_sync_tool.sh \
--jdbc-url jdbc:hive2://hiveserver:1 \
--user hive \
--pass hive \
--partitioned-by dt \
--base-path /user/hive/warehouse/stock_ticks_cow \
--database default \
--table stock_ticks_cow
{code}
7, create incr_pull.txt
{code:java}
select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
{code}
8, execute org.apache.hudi.utilities.HiveIncrementalPuller
{code:java}
java -cp 
./spark/jars/commons-cli-1.2.jar:./spark/jars/htrace-core-3.1.0-incubating.jar:./spark/jars/hadoop-hdfs-2.7.3.jar:./hive/lib/hive-exec-2.3.3.jar:./hive/lib/hive-common-2.3.3.jar:./hive/lib/hive-jdbc-2.3.3.jar:./hive/lib/hive-service-2.3.3.jar:./hive/lib/hive-service-rpc-2.3.3.jar:./spark/jars/httpcore-4.4.10.jar:./spark/jars/slf4j-api-1.7.16.jar:./spark/jars/hadoop-auth-2.7.3.jar:./hive/lib/commons-lang-2.6.jar:./spark/jars/commons-configuration-1.6.jar:./spark/jars/commons-collections-3.2.2.jar:./spark/jars/hadoop-common-2.7.3.jar:./hive/lib/antlr-runtime-3.5.2.jar:./spark/jars/log4j-1.2.17.jar:./hive/lib/commons-logging-1.2.jar:./hive/lib/commons-io-2.4.jar:$HUDI_UTILITIES_BUNDLE
 \ org.apache.hudi.utilities.HiveIncrementalPuller \ --hiveUrl 
jdbc:hive2://hiveserver:1 \ --hiveUser hive \ --hivePass hive \ 
--extractSQLFile /var/hoodie/ws/docker/demo/config/incr_pull.txt \ --sourceDb 
default \ --sourceTable stock_ticks_cow \ --targetDb default \ --tmpdb default 
\ --targetTable tempTable \ --fromCommitTime 0 \ --maxCommits 1
{code}


was (Author: lamber-ken):
Let me summarize the reproduce steps.

1, use git checkout commit e1e5fe33249bf511486073dd9cf48e5b7ea14816 

2, build source
{code:java}
mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true 
{code}
3, setup docker 
{code:java}
cd docker && ./setup_demo.sh{code}
4, generate data
{code:java}
cat demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
{code}
5, go into the container
{code:java}
docker exec -it adhoc-2 /bin/bash 
{code}
6, comsume kafka data && sync to hive
{code:java}
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE \
--storage-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path /user/hive/warehouse/stock_ticks_cow \
--target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties 
\
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

/var/hoodie/ws/hudi-hive/run_sync_tool.sh \
--jdbc-url jdbc:hive2://hiveserver:1 \
--user hive \
--pass hive \
--partitioned-by dt \
--base-path /user/hive/warehouse/stock_ticks_cow \
--database default \
--table stock_ticks_cow
{code}
7, create incr_pull.txt
{code:java}
select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
{code}
8, execute org.apache.hudi.utilities.HiveIncrementalPuller
{code:java}
java -cp 
./spark/jars/commons-cli-1.2.jar:./spark/jars/htrace-core-3.1.0-incubating.jar:./spark/jars/hadoop-hdfs-2.7.3.jar:./hive/lib/hive-exec-2.3.3.jar:./hive/lib/hive-common-2.3.3.jar:./hive/lib/hive-jdbc-2.3.3.jar:./hive/lib/hive-service-2.3.3.jar:./hive/lib/hive-service-rpc-2.3.3.jar:./spark/jars/httpcore-4.4.10.jar:./spark/jars/slf4j-api-1.7.16.jar:./spark/jars/hadoop-auth-2.7.3.jar:./hive/lib/commons-lang-2.6.jar:./spark/jars/commons-configuration-1.6.jar:./spark/jars/commons-collections-3.2.2.jar:./spark/jars/hadoop-common-2.7.3.jar:./hive/lib/antlr-runtime-3.5.2.jar:./spark/jars/log4j-1.2.17.jar:./hive/lib/commons-logging-1.2.jar:./hive/lib/common

[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007908#comment-17007908
 ] 

lamber-ken commented on HUDI-486:
-

Let me summarize the reproduce steps.

1, use git checkout commit e1e5fe33249bf511486073dd9cf48e5b7ea14816 

2, build source
{code:java}
mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true 
{code}
3, setup docker 
{code:java}
cd docker && ./setup_demo.sh{code}
4, generate data
{code:java}
cat demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
{code}
5, go into the container
{code:java}
docker exec -it adhoc-2 /bin/bash 
{code}
6, comsume kafka data && sync to hive
{code:java}
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE \
--storage-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path /user/hive/warehouse/stock_ticks_cow \
--target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties 
\
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

/var/hoodie/ws/hudi-hive/run_sync_tool.sh \
--jdbc-url jdbc:hive2://hiveserver:1 \
--user hive \
--pass hive \
--partitioned-by dt \
--base-path /user/hive/warehouse/stock_ticks_cow \
--database default \
--table stock_ticks_cow
{code}
7, create incr_pull.txt
{code:java}
select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
{code}
8, execute org.apache.hudi.utilities.HiveIncrementalPuller
{code:java}
java -cp 
./spark/jars/commons-cli-1.2.jar:./spark/jars/htrace-core-3.1.0-incubating.jar:./spark/jars/hadoop-hdfs-2.7.3.jar:./hive/lib/hive-exec-2.3.3.jar:./hive/lib/hive-common-2.3.3.jar:./hive/lib/hive-jdbc-2.3.3.jar:./hive/lib/hive-service-2.3.3.jar:./hive/lib/hive-service-rpc-2.3.3.jar:./spark/jars/httpcore-4.4.10.jar:./spark/jars/slf4j-api-1.7.16.jar:./spark/jars/hadoop-auth-2.7.3.jar:./hive/lib/commons-lang-2.6.jar:./spark/jars/commons-configuration-1.6.jar:./spark/jars/commons-collections-3.2.2.jar:./spark/jars/hadoop-common-2.7.3.jar:./hive/lib/antlr-runtime-3.5.2.jar:./spark/jars/log4j-1.2.17.jar:./hive/lib/commons-logging-1.2.jar:./hive/lib/commons-io-2.4.jar:$HUDI_UTILITIES_BUNDLE
 \ org.apache.hudi.utilities.HiveIncrementalPuller \ --hiveUrl 
jdbc:hive2://hiveserver:1 \ --hiveUser hive \ --hivePass hive \ 
--extractSQLFile /var/hoodie/ws/docker/demo/config/incr_pull.txt \ --sourceDb 
default \ --sourceTable stock_ticks_cow \ --targetDb default \ --tmpdb default 
\ --targetTable tempTable \ --fromCommitTime 0 \ --maxCommits 1
{code}

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #149

2020-01-03 Thread Apache Jenkins Server
See 


Changes:

[nagarwal] [HUDI-118]: Options provided for passing properties to Cleaner,


--
[...truncated 2.18 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark [jar]
[INFO] hudi-utilities [jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle  [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle   

[jira] [Updated] (HUDI-432) Benchmark HFile for scan vs seek

2020-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-432:
-
Description: 
We want to benchmark HFile scan vs seek as we intend to use HFile to record 
indexing. HFile will be used inline in hudi log for index purposes. 

So, as part of benchmarking, we want to see when does scan out performs seek. 

This is our experiment set up.

keysToRead = no of keys to be looked up. // differs for different exp runs like 
100k, 200k, 500k, 1M. 

N = no of iterations

 
{code:java}
1M entries were written to a single HFile as key value pairs. 
Also, stored the keys in a separate file(key_file).
keyList = read all keys from key_file
for N no of iterations
{
shuffle keyList 
trim the list to keysToRead 
start timer HFile 
read benchmark(scan/seek) 
end timer
}
found avg for all timers captured
{code}
 

 

Result:

Scan outperforms seek somewhere around 350k to 400k look ups out of 1M entries 
with optimized configs.

  !Screen Shot 2020-01-03 at 6.44.25 PM.png!

Results can be found here: [^HFile benchmark.xlsx]

Source for benchmarking can be found here: 

[https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]

> Benchmark HFile for scan vs seek
> 
>
> Key: HUDI-432
> URL: https://issues.apache.org/jira/browse/HUDI-432
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.5.2
>
> Attachments: HFile benchmark.xlsx, Screen Shot 2020-01-03 at 6.44.25 
> PM.png
>
>
> We want to benchmark HFile scan vs seek as we intend to use HFile to record 
> indexing. HFile will be used inline in hudi log for index purposes. 
> So, as part of benchmarking, we want to see when does scan out performs seek. 
> This is our experiment set up.
> keysToRead = no of keys to be looked up. // differs for different exp runs 
> like 100k, 200k, 500k, 1M. 
> N = no of iterations
>  
> {code:java}
> 1M entries were written to a single HFile as key value pairs. 
> Also, stored the keys in a separate file(key_file).
> keyList = read all keys from key_file
> for N no of iterations
> {
> shuffle keyList 
> trim the list to keysToRead 
> start timer HFile 
> read benchmark(scan/seek) 
> end timer
> }
> found avg for all timers captured
> {code}
>  
>  
> Result:
> Scan outperforms seek somewhere around 350k to 400k look ups out of 1M 
> entries with optimized configs.
>   !Screen Shot 2020-01-03 at 6.44.25 PM.png!
> Results can be found here: [^HFile benchmark.xlsx]
> Source for benchmarking can be found here: 
> [https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-432) Benchmark HFile for scan vs seek

2020-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-432:
-
Attachment: HFile benchmark.xlsx

> Benchmark HFile for scan vs seek
> 
>
> Key: HUDI-432
> URL: https://issues.apache.org/jira/browse/HUDI-432
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.5.2
>
> Attachments: HFile benchmark.xlsx, Screen Shot 2020-01-03 at 6.44.25 
> PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-432) Benchmark HFile for scan vs seek

2020-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-432:
-
Attachment: Screen Shot 2020-01-03 at 6.44.25 PM.png

> Benchmark HFile for scan vs seek
> 
>
> Key: HUDI-432
> URL: https://issues.apache.org/jira/browse/HUDI-432
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.5.2
>
> Attachments: Screen Shot 2020-01-03 at 6.44.25 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-432) Benchmark HFile for scan vs seek

2020-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-432:
-
Summary: Benchmark HFile for scan vs seek  (was: Benchmark Inline Parquet 
Logging)

> Benchmark HFile for scan vs seek
> 
>
> Key: HUDI-432
> URL: https://issues.apache.org/jira/browse/HUDI-432
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-497) Add end to end test for HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-497:
---

Assignee: lamber-ken

> Add end to end test for HiveIncrementalPuller
> -
>
> Key: HUDI-497
> URL: https://issues.apache.org/jira/browse/HUDI-497
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Add end to end test for HiveIncrementalPuller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] wangxianghu opened a new pull request #1179: [HUDI-460] Redo hudi-integ-test log statements using SLF4J

2020-01-03 Thread GitBox
wangxianghu opened a new pull request #1179: [HUDI-460] Redo hudi-integ-test 
log statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1179
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Redo hudi-integ-test log statements using SLF4J*
   
   ## Brief change log
   
   *Redo hudi-integ-test log statements using SLF4J*
   
   ## Verify this pull request
   
   This pull request is should be covered by existing tests.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-498) End to end to check HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007884#comment-17007884
 ] 

lamber-ken commented on HUDI-498:
-

hi [~Pratyaksh], I moved releated issues here, work together to fix these :)

> End to end to check HiveIncrementalPuller
> -
>
> Key: HUDI-498
> URL: https://issues.apache.org/jira/browse/HUDI-498
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Incremental Pull
>Reporter: lamber-ken
>Priority: Major
>
> End to end to check HiveIncrementalPuller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] wangxianghu closed pull request #1163: [HUDI-460] Redo hudi-integ-test log statements using SLF4J

2020-01-03 Thread GitBox
wangxianghu closed pull request #1163: [HUDI-460] Redo hudi-integ-test log 
statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1163
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] wangxianghu commented on issue #1163: [HUDI-460] Redo hudi-integ-test log statements using SLF4J

2020-01-03 Thread GitBox
wangxianghu commented on issue #1163: [HUDI-460] Redo hudi-integ-test log 
statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1163#issuecomment-570747956
 
 
   @leesf  Ok, No problem.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (HUDI-497) Add end to end test for HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-497:
---

Assignee: (was: lamber-ken)

> Add end to end test for HiveIncrementalPuller
> -
>
> Key: HUDI-497
> URL: https://issues.apache.org/jira/browse/HUDI-497
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: lamber-ken
>Priority: Major
>
> Add end to end test for HiveIncrementalPuller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-497) Add end to end test for HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-497:

Parent: HUDI-498
Issue Type: Sub-task  (was: Bug)

> Add end to end test for HiveIncrementalPuller
> -
>
> Key: HUDI-497
> URL: https://issues.apache.org/jira/browse/HUDI-497
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Add end to end test for HiveIncrementalPuller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-484) NPE in HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-484:

Parent: HUDI-498
Issue Type: Sub-task  (was: Bug)

> NPE in HiveIncrementalPuller
> 
>
> Key: HUDI-484
> URL: https://issues.apache.org/jira/browse/HUDI-484
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
> Attachments: Screenshot 2019-12-30 at 4.43.51 PM.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we try to use HiveIncrementalPuller class to incrementally pull changes 
> from hive, it throws NPE as it is unable to find IncrementalPull.sqltemplate 
> in the bundled jar. 
> Screenshot attached which shows the exception. 
> The jar contains the template. 
> Steps to reproduce - 
>  # copy hive-jdbc-2.3.1.jar, log4j-1.2.17.jar to docker/demo/config folder
>  # run cd docker && ./setup_demo.sh
>  # cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks 
> -P
>  #  {{docker exec -it adhoc-2 /bin/bash}}
>  #  {{spark-submit --class 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
> org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts 
> --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
> stock_ticks_cow --props /var/demo/config/kafka-source.properties 
> --schemaprovider-class 
> org.apache.hudi.utilities.schema.FilebasedSchemaProvider}}
>  #  {{/var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url 
> jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt 
> --base-path /user/hive/warehouse/stock_ticks_cow --database default --table 
> stock_ticks_cow}}
>  # java -cp 
> /var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:$HUDI_UTILITIES_BUNDLE
>  org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl 
> jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive 
> --extractSQLFile /var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb 
> default --sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable 
> --fromCommitTime 0 --maxCommits 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-486:

Parent: (was: HUDI-484)
Issue Type: Bug  (was: Sub-task)

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-486:

Parent: HUDI-498
Issue Type: Sub-task  (was: Bug)

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-496) Fix wrong check logic in HoodieIncrementalPull

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-496:

Parent: (was: HUDI-484)
Issue Type: Bug  (was: Sub-task)

> Fix wrong check logic in HoodieIncrementalPull
> --
>
> Key: HUDI-496
> URL: https://issues.apache.org/jira/browse/HUDI-496
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
>  Wrong check logic, right statement can not go throungh.
> {code:java}
> select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
> symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
> {code}
>  
> Exception Stackstrace
> {code:java}
> Exception in thread "main" 
> org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
> Incremental SQL does not have clause `_hoodie_commit_time` > 
> '%targetBasePath', which means its not pulling incrementally
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-497) Add end to end test for HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-497:

Parent: (was: HUDI-484)
Issue Type: Bug  (was: Sub-task)

> Add end to end test for HiveIncrementalPuller
> -
>
> Key: HUDI-497
> URL: https://issues.apache.org/jira/browse/HUDI-497
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Add end to end test for HiveIncrementalPuller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-496) Fix wrong check logic in HoodieIncrementalPull

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-496:

Parent: HUDI-498
Issue Type: Sub-task  (was: Bug)

> Fix wrong check logic in HoodieIncrementalPull
> --
>
> Key: HUDI-496
> URL: https://issues.apache.org/jira/browse/HUDI-496
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
>  Wrong check logic, right statement can not go throungh.
> {code:java}
> select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
> symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
> {code}
>  
> Exception Stackstrace
> {code:java}
> Exception in thread "main" 
> org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
> Incremental SQL does not have clause `_hoodie_commit_time` > 
> '%targetBasePath', which means its not pulling incrementally
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-485) Check for where clause is wrong in HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-485:

Parent: HUDI-498
Issue Type: Sub-task  (was: Bug)

> Check for where clause is wrong in HiveIncrementalPuller
> 
>
> Key: HUDI-485
> URL: https://issues.apache.org/jira/browse/HUDI-485
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull, newbie
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> HiveIncrementalPuller checks the clause in incrementalSqlFile like this -> 
> if (!incrementalSQL.contains("`_hoodie_commit_time` > '%targetBasePath'"))
> { LOG.info("Incremental SQL : " + incrementalSQL + " does not contain 
> `_hoodie_commit_time` > %targetBasePath. Please add " + "this clause for 
> incremental to work properly."); throw new HoodieIncrementalPullSQLException( 
> "Incremental SQL does not have clause `_hoodie_commit_time` > 
> '%targetBasePath', which " + "means its not pulling incrementally"); }
> Basically we are trying to add a placeholder here which is later replaced 
> with config.fromCommitTime here - 
> incrementalPullSQLtemplate.add("incrementalSQL", 
> String.format(incrementalSQL, config.fromCommitTime));
> Hence, the above check needs to replaced with `_hoodie_commit_time` > %s



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-485) Check for where clause is wrong in HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-485:

Parent: (was: HUDI-484)
Issue Type: Bug  (was: Sub-task)

> Check for where clause is wrong in HiveIncrementalPuller
> 
>
> Key: HUDI-485
> URL: https://issues.apache.org/jira/browse/HUDI-485
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull, newbie
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> HiveIncrementalPuller checks the clause in incrementalSqlFile like this -> 
> if (!incrementalSQL.contains("`_hoodie_commit_time` > '%targetBasePath'"))
> { LOG.info("Incremental SQL : " + incrementalSQL + " does not contain 
> `_hoodie_commit_time` > %targetBasePath. Please add " + "this clause for 
> incremental to work properly."); throw new HoodieIncrementalPullSQLException( 
> "Incremental SQL does not have clause `_hoodie_commit_time` > 
> '%targetBasePath', which " + "means its not pulling incrementally"); }
> Basically we are trying to add a placeholder here which is later replaced 
> with config.fromCommitTime here - 
> incrementalPullSQLtemplate.add("incrementalSQL", 
> String.format(incrementalSQL, config.fromCommitTime));
> Hence, the above check needs to replaced with `_hoodie_commit_time` > %s



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-498) End to end to check HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)
lamber-ken created HUDI-498:
---

 Summary: End to end to check HiveIncrementalPuller
 Key: HUDI-498
 URL: https://issues.apache.org/jira/browse/HUDI-498
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Incremental Pull
Reporter: lamber-ken


End to end to check HiveIncrementalPuller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] XuQianJin-Stars opened a new pull request #1178: [HUDI-454] Redo hudi-cli log statements using SLF4J

2020-01-03 Thread GitBox
XuQianJin-Stars opened a new pull request #1178: [HUDI-454] Redo hudi-cli log 
statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1178
 
 
   ## What is the purpose of the pull request
   
   Redo hudi-cli log statements using SLF4J.
   
   ## Brief change log
   
   Modify AnnotationLocation checkstyle rule in checkstyle.xml
   
   ## Verify this pull request
   
   Use existing tests to verify the original module.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] XuQianJin-Stars closed pull request #1152: [HUDI-454] Redo hudi-cli log statements using SLF4J

2020-01-03 Thread GitBox
XuQianJin-Stars closed pull request #1152: [HUDI-454] Redo hudi-cli log 
statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1152
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (HUDI-497) Add end to end test for HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-497:
---

Assignee: lamber-ken

> Add end to end test for HiveIncrementalPuller
> -
>
> Key: HUDI-497
> URL: https://issues.apache.org/jira/browse/HUDI-497
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Add end to end test for HiveIncrementalPuller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] XuQianJin-Stars opened a new pull request #1177: [HUDI-463] Redo hudi-utilities log statements using SLF4J

2020-01-03 Thread GitBox
XuQianJin-Stars opened a new pull request #1177: [HUDI-463] Redo hudi-utilities 
log statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1177
 
 
   ## What is the purpose of the pull request
   
   Redo hudi-utilities log statements using SLF4J.
   
   ## Brief change log
   
   Modify AnnotationLocation checkstyle rule in checkstyle.xml
   
   ## Verify this pull request
   
   Use existing tests to verify the original module.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-497) Add end to end test for HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)
lamber-ken created HUDI-497:
---

 Summary: Add end to end test for HiveIncrementalPuller
 Key: HUDI-497
 URL: https://issues.apache.org/jira/browse/HUDI-497
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Incremental Pull
Reporter: lamber-ken


Add end to end test for HiveIncrementalPuller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-496) Fix wrong check logic in HoodieIncrementalPull

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-496:

Description: 
 Wrong check logic, right statement can not go throungh.
{code:java}
select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
{code}
 

Exception Stackstrace
{code:java}
Exception in thread "main" 
org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
Incremental SQL does not have clause `_hoodie_commit_time` > '%targetBasePath', 
which means its not pulling incrementally
at 
org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
at 
org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
at 
org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)

{code}
 

 

  was:
 Wrong check logic, right statement can not go throungh.
{code:java}
select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
{code}
 

Exception Stackstrace
{code:java}
Exception in thread "main" 
org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
Incremental SQL does not have clause `_hoodie_commit_time` > '%targetBasePath', 
which means its not pulling incrementally
at 
org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
at 
org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
at 
org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)

{code}
 

 


> Fix wrong check logic in HoodieIncrementalPull
> --
>
> Key: HUDI-496
> URL: https://issues.apache.org/jira/browse/HUDI-496
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
>  Wrong check logic, right statement can not go throungh.
> {code:java}
> select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
> symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
> {code}
>  
> Exception Stackstrace
> {code:java}
> Exception in thread "main" 
> org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
> Incremental SQL does not have clause `_hoodie_commit_time` > 
> '%targetBasePath', which means its not pulling incrementally
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-496) Fix wrong check logic in HoodieIncrementalPull

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-496:

Description: 
 Wrong check logic, right statement can not go throungh.
{code:java}
select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
{code}
 

Exception Stackstrace
{code:java}
Exception in thread "main" 
org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
Incremental SQL does not have clause `_hoodie_commit_time` > '%targetBasePath', 
which means its not pulling incrementally
at 
org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
at 
org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
at 
org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)

{code}
 

 

  was:
 

Wrong check logic, right statement can not go throungh.

 
{code:java}
select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
{code}
Exception Stackstrace
{code:java}
Exception in thread "main" 
org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
Incremental SQL does not have clause `_hoodie_commit_time` > '%targetBasePath', 
which means its not pulling incrementally
at 
org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
at 
org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
at 
org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)

{code}
 

 


> Fix wrong check logic in HoodieIncrementalPull
> --
>
> Key: HUDI-496
> URL: https://issues.apache.org/jira/browse/HUDI-496
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
>  Wrong check logic, right statement can not go throungh.
> {code:java}
> select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
> symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
> {code}
>  
> Exception Stackstrace
> {code:java}
> Exception in thread "main" 
> org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
> Incremental SQL does not have clause `_hoodie_commit_time` > 
> '%targetBasePath', which means its not pulling incrementally
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-496) Fix wrong check logic in HoodieIncrementalPull

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-496:

Parent: HUDI-484
Issue Type: Sub-task  (was: Bug)

> Fix wrong check logic in HoodieIncrementalPull
> --
>
> Key: HUDI-496
> URL: https://issues.apache.org/jira/browse/HUDI-496
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
>  
> Wrong check logic, right statement can not go throungh.
>  
> {code:java}
> select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
> symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
> {code}
> Exception Stackstrace
> {code:java}
> Exception in thread "main" 
> org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
> Incremental SQL does not have clause `_hoodie_commit_time` > 
> '%targetBasePath', which means its not pulling incrementally
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-496) Fix wrong check logic in HoodieIncrementalPull

2020-01-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-496:
---

Assignee: lamber-ken

> Fix wrong check logic in HoodieIncrementalPull
> --
>
> Key: HUDI-496
> URL: https://issues.apache.org/jira/browse/HUDI-496
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
>  
> Wrong check logic, right statement can not go throungh.
>  
> {code:java}
> select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
> symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
> {code}
> Exception Stackstrace
> {code:java}
> Exception in thread "main" 
> org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
> Incremental SQL does not have clause `_hoodie_commit_time` > 
> '%targetBasePath', which means its not pulling incrementally
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
> at 
> org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-496) Fix wrong check logic in HoodieIncrementalPull

2020-01-03 Thread lamber-ken (Jira)
lamber-ken created HUDI-496:
---

 Summary: Fix wrong check logic in HoodieIncrementalPull
 Key: HUDI-496
 URL: https://issues.apache.org/jira/browse/HUDI-496
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Incremental Pull
Reporter: lamber-ken


 

Wrong check logic, right statement can not go throungh.

 
{code:java}
select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
{code}
Exception Stackstrace
{code:java}
Exception in thread "main" 
org.apache.hudi.utilities.exception.HoodieIncrementalPullSQLException: 
Incremental SQL does not have clause `_hoodie_commit_time` > '%targetBasePath', 
which means its not pulling incrementally
at 
org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:192)
at 
org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
at 
org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)

{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] XuQianJin-Stars closed pull request #1168: [HUDI-463] Redo hudi-utilities log statements using SLF4J

2020-01-03 Thread GitBox
XuQianJin-Stars closed pull request #1168: [HUDI-463] Redo hudi-utilities log 
statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1168
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] XuQianJin-Stars opened a new pull request #1168: [HUDI-463] Redo hudi-utilities log statements using SLF4J

2020-01-03 Thread GitBox
XuQianJin-Stars opened a new pull request #1168: [HUDI-463] Redo hudi-utilities 
log statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1168
 
 
   ## What is the purpose of the pull request
   
   Redo hudi-utilities log statements using SLF4J.
   
   ## Brief change log
   
   Modify AnnotationLocation checkstyle rule in checkstyle.xml
   
   ## Verify this pull request
   
   Use existing tests to verify the original module.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] XuQianJin-Stars closed pull request #1168: [HUDI-463] Redo hudi-utilities log statements using SLF4J

2020-01-03 Thread GitBox
XuQianJin-Stars closed pull request #1168: [HUDI-463] Redo hudi-utilities log 
statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1168
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1115: [HUDI-392] Introduce DIstributedTestDataSource to generate test data

2020-01-03 Thread GitBox
yanghua commented on issue #1115: [HUDI-392] Introduce 
DIstributedTestDataSource to generate test data
URL: https://github.com/apache/incubator-hudi/pull/1115#issuecomment-570743819
 
 
   > @yanghua I was on a holiday break, apologies for the late response. Have 
you tried to run the test-suite ? If the current data generation methodology 
meets our needs, we might not require the DistributedTestDataSource. If not, we 
can tweek the current implementation or bring in the DistributedSource, wdyt ?
   
   Hi @n3nash No need to say apology, happy holiday. Yes, I have run the test 
suite several times. It works fine.
   
   IMO, the `DistributedTestDataSource` will not block the test suite. 
Actually, I think the test payload generation is a little confused currently. I 
was thinking about how to refactor it. However, the work was broken by other 
things about integrating with Azure pipeline and designing how to integrate 
Hudi with Flink.
   
   The more details about integrating with Azure can be found here:
- 
https://github.com/apachehudi-ci/incubator-hudi/blob/master/azure-pipelines.yml
- https://dev.azure.com/vinoyang/Hudi/_build?definitionId=2
   
   It has not be done.
   
   cc @vinothchandar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #1152: [HUDI-454] Redo hudi-cli log statements using SLF4J

2020-01-03 Thread GitBox
leesf commented on issue #1152: [HUDI-454] Redo hudi-cli log statements using 
SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1152#issuecomment-570736785
 
 
   @XuQianJin-Stars Thanks for opening this PR, Could you please reopen and 
merge this PR to redo-log branch? more context can be found 
https://lists.apache.org/thread.html/9dc1f3a590413a5224a1a5ad835353e11b2b754e1ec7ad1ca0a55053%40%3Cdev.hudi.apache.org%3E,
 Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #1161: [HUDI-457]Redo hudi-common log statements using SLF4J

2020-01-03 Thread GitBox
leesf commented on issue #1161: [HUDI-457]Redo hudi-common log statements using 
SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1161#issuecomment-570736629
 
 
   @sev7e0 Thanks for opening this PR, Could you please reopen and merge this 
PR to redo-log branch? more context can be found 
https://lists.apache.org/thread.html/9dc1f3a590413a5224a1a5ad835353e11b2b754e1ec7ad1ca0a55053%40%3Cdev.hudi.apache.org%3E,
 Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #1163: [HUDI-460] Redo hudi-integ-test log statements using SLF4J

2020-01-03 Thread GitBox
leesf commented on issue #1163: [HUDI-460] Redo hudi-integ-test log statements 
using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1163#issuecomment-570736469
 
 
   @wangxianghu Thanks for opening this PR, Could you please reopen and merge 
this PR to redo-log branch? more context can be found 
https://lists.apache.org/thread.html/9dc1f3a590413a5224a1a5ad835353e11b2b754e1ec7ad1ca0a55053%40%3Cdev.hudi.apache.org%3E


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #1168: [HUDI-463] Redo hudi-utilities log statements using SLF4J

2020-01-03 Thread GitBox
leesf commented on issue #1168: [HUDI-463] Redo hudi-utilities log statements 
using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1168#issuecomment-570736304
 
 
   @XuQianJin-Stars Thanks for opening this PR, Could you please reopen and 
merge this PR to redo-log branch? more context can be found 
https://lists.apache.org/thread.html/9dc1f3a590413a5224a1a5ad835353e11b2b754e1ec7ad1ca0a55053%40%3Cdev.hudi.apache.org%3E


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] zhedoubushishi commented on a change in pull request #1175: [HUDI-495] Update deprecated HBase API

2020-01-03 Thread GitBox
zhedoubushishi commented on a change in pull request #1175: [HUDI-495] Update 
deprecated HBase API
URL: https://github.com/apache/incubator-hudi/pull/1175#discussion_r362999490
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -287,13 +289,10 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
   hbaseConnection = getHBaseConnection();
 }
   }
-  HTable hTable = null;
-  try {
-hTable = (HTable) 
hbaseConnection.getTable(TableName.valueOf(tableName));
+  try (BufferedMutator mutator = 
hbaseConnection.getBufferedMutator(TableName.valueOf(tableName))) {
 
 Review comment:
   Sure. Here is the doc: 
https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html.
   ```
   The try-with-resources statement is a try statement that declares one or 
more resources.
   A resource is an object that must be closed after the program is finished 
with it.
   The try-with-resources statement ensures that each resource is closed at the 
end of the statement.
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (ff1113f -> 290278f)

2020-01-03 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from ff1113f  [HUDI-492]Fix show env all in hudi-cli
 add 290278f  [HUDI-118]: Options provided for passing properties to 
Cleaner, compactor and importer commands

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/cli/commands/CleansCommand.java| 37 +-
 .../hudi/cli/commands/CompactionCommand.java   | 20 --
 .../cli/commands/HDFSParquetImportCommand.java | 14 ++--
 .../org/apache/hudi/cli/commands/SparkMain.java| 82 +++---
 .../org/apache/hudi/utilities/HoodieCompactor.java |  2 +-
 .../org/apache/hudi/utilities/UtilHelpers.java |  8 ++-
 6 files changed, 138 insertions(+), 25 deletions(-)



[GitHub] [incubator-hudi] n3nash merged pull request #1080: [HUDI-118]: Options provided for passing properties to Cleaner, compactor and importer commands

2020-01-03 Thread GitBox
n3nash merged pull request #1080: [HUDI-118]: Options provided for passing 
properties to Cleaner, compactor and importer commands
URL: https://github.com/apache/incubator-hudi/pull/1080
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (e1e5fe3 -> ff1113f)

2020-01-03 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from e1e5fe3  [MINOR] Fix error usage of String.format (#1169)
 add ff1113f  [HUDI-492]Fix show env all in hudi-cli

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/cli/commands/SparkEnvCommand.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[GitHub] [incubator-hudi] n3nash merged pull request #1172: [HUDI-492] Fix show env all in hudi-cli

2020-01-03 Thread GitBox
n3nash merged pull request #1172: [HUDI-492] Fix show env all in hudi-cli
URL: https://github.com/apache/incubator-hudi/pull/1172
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1175: [HUDI-495] Update deprecated HBase API

2020-01-03 Thread GitBox
n3nash commented on a change in pull request #1175: [HUDI-495] Update 
deprecated HBase API
URL: https://github.com/apache/incubator-hudi/pull/1175#discussion_r362996436
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -287,13 +289,10 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
   hbaseConnection = getHBaseConnection();
 }
   }
-  HTable hTable = null;
-  try {
-hTable = (HTable) 
hbaseConnection.getTable(TableName.valueOf(tableName));
+  try (BufferedMutator mutator = 
hbaseConnection.getBufferedMutator(TableName.valueOf(tableName))) {
 
 Review comment:
   Could you comment with a small snippet of the doc mentioning this is the new 
API ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1154: [HUDI-406] Added default partition path in TimestampBasedKeyGenerator

2020-01-03 Thread GitBox
n3nash commented on a change in pull request #1154: [HUDI-406] Added default 
partition path in TimestampBasedKeyGenerator
URL: https://github.com/apache/incubator-hudi/pull/1154#discussion_r362995932
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
 ##
 @@ -52,29 +52,18 @@
  */
 public class DataSourceUtils {
 
-  /**
-   * Obtain value of the provided nullable field as string, denoted by dot 
notation. e.g: a.b.c
-   */
-  public static String getNullableNestedFieldValAsString(GenericRecord record, 
String fieldName) {
-try {
-  return getNestedFieldValAsString(record, fieldName);
-} catch (HoodieException e) {
-  return null;
-}
-  }
-
   /**
* Obtain value of the provided field as string, denoted by dot notation. 
e.g: a.b.c
*/
-  public static String getNestedFieldValAsString(GenericRecord record, String 
fieldName) {
-Object obj = getNestedFieldVal(record, fieldName);
-return obj.toString();
+  public static String getNestedFieldValAsString(GenericRecord record, String 
fieldName, boolean returnNullIfNotFound) {
+Object obj = getNestedFieldVal(record, fieldName, returnNullIfNotFound);
+return (obj == null) ? null : obj.toString();
 
 Review comment:
   can we do (obj == null) ? obj : obj.toString() ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-01-03 Thread GitBox
n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce 
configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362995044
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/UserDefinedBulkInsertPartitioner.java
 ##
 @@ -31,4 +31,6 @@
 public interface UserDefinedBulkInsertPartitioner {
 
   JavaRDD> repartitionRecords(JavaRDD> 
records, int outputSparkPartitions);
+
+  boolean arePartitionRecordsSorted();
 
 Review comment:
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-01-03 Thread GitBox
n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce 
configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362994779
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/NonSortPartitioner.java
 ##
 @@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.func.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.spark.api.java.JavaRDD;
+
+public class NonSortPartitioner
 
 Review comment:
   We are using "Sort" in the other names, should we be consistent ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-01-03 Thread GitBox
n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce 
configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362994779
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/NonSortPartitioner.java
 ##
 @@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.func.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.spark.api.java.JavaRDD;
+
+public class NonSortPartitioner
 
 Review comment:
   We are using "Sort" in the other names, should we be consistent either 
"shuffling" or "sort" ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-01-03 Thread GitBox
n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce 
configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362994623
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/BulkInsertMapFunctionForNonSortedRecords.java
 ##
 @@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.func.bulkinsert;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import org.apache.avro.Schema;
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.func.CopyOnWriteInsertHandler;
+import 
org.apache.hudi.func.CopyOnWriteLazyInsertIterable.HoodieInsertValueGenResult;
+import org.apache.hudi.table.HoodieTable;
+
+public class BulkInsertMapFunctionForNonSortedRecords
 
 Review comment:
   So in this case, we might end up with a skewed Spark Task which ends up 
writing many files due to large number of records being sent to it (same 
partition path) ?
   @vinothchandar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-01-03 Thread GitBox
n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce 
configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362994362
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/BulkInsertInternalPartitioner.java
 ##
 @@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.func.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.table.UserDefinedBulkInsertPartitioner;
+
+public abstract class BulkInsertInternalPartitioner implements
+UserDefinedBulkInsertPartitioner {
 
 Review comment:
   Can we just keep the UserDefinedBulkInsertPartitioner as an interface and 
then provide an abstract class here ? Might be fine to change at Uber, but it 
might be best to avoid a breaking change since it's simple in this case ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-01-03 Thread GitBox
n3nash commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce 
configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362993888
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -367,20 +370,30 @@ public static SparkConf registerClasses(SparkConf conf) {
 }
   }
 
+  private BulkInsertMapFunction getBulkInsertMapFunction(
+  boolean isSorted, String commitTime, HoodieWriteConfig config, 
HoodieTable hoodieTable,
+  List fileIDPrefixes) {
+if (isSorted) {
+  return new BulkInsertMapFunctionForSortedRecords(
+  commitTime, config, hoodieTable, fileIDPrefixes);
+}
+return new BulkInsertMapFunctionForNonSortedRecords(
+commitTime, config, hoodieTable, fileIDPrefixes);
+  }
+
   private JavaRDD bulkInsertInternal(JavaRDD> 
dedupedRecords, String commitTime,
   HoodieTable table, Option 
bulkInsertPartitioner) {
 
 Review comment:
   @yihua Yes, we do need to keep this functionality, at Uber as @vinothchandar 
pointed out, uses this. We directly pass an implementation rather than a config.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1125: [HUDI-464] : Use Hive Exec Core

2020-01-03 Thread GitBox
n3nash commented on a change in pull request #1125: [HUDI-464] : Use Hive Exec 
Core
URL: https://github.com/apache/incubator-hudi/pull/1125#discussion_r362993485
 
 

 ##
 File path: hudi-client/pom.xml
 ##
 @@ -231,6 +231,13 @@
   hive-exec
   ${hive.version}
   test
 
 Review comment:
   Yes, the issues we faced was during running on tests.  @modi95 please confirm


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1115: [HUDI-392] Introduce DIstributedTestDataSource to generate test data

2020-01-03 Thread GitBox
n3nash commented on issue #1115: [HUDI-392] Introduce DIstributedTestDataSource 
to generate test data
URL: https://github.com/apache/incubator-hudi/pull/1115#issuecomment-570727891
 
 
   @yanghua I was on a holiday break, apologies for the late response. Have you 
tried to run the test-suite ? If the current data generation methodology meets 
our needs, we might not require the DistributedTestDataSource. If not, we can 
tweek the current implementation or bring in the DistributedSource, wdyt ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-430) Design Inline FileSystem which supports embedding any file format (parquet/avro/etc)

2020-01-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-430:

Labels: pull-request-available  (was: )

> Design Inline FileSystem which supports embedding any file format 
> (parquet/avro/etc) 
> -
>
> Key: HUDI-430
> URL: https://issues.apache.org/jira/browse/HUDI-430
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>
> Basically the log file should be capable of embedding any file format. In 
> other words, if parquet is embedded, direct parquet reader should work on 
> reading the content directly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nsivabalan opened a new pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-01-03 Thread GitBox
nsivabalan opened a new pull request #1176: [HUDI-430] Adding InlineFileSystem 
to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   - This PR adds a new FileSystem called InlineFileSystem to support embedding 
any file format as an InlineFile within a regular file.  InlineFS will be used 
only in read path. 
   - Have added InMemoryFileSystem as part of the PR for write path.
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
 - *Added tests for InLineFileSystem and InLineFileSystem*
 - *Added tests for testing InlineFS for Parquet and HFile format*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-495) Update deprecated HBase API

2020-01-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-495:

Labels: pull-request-available  (was: )

> Update deprecated HBase API
> ---
>
> Key: HUDI-495
> URL: https://issues.apache.org/jira/browse/HUDI-495
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> Internally we are using HBase 2.x, and HBase 2.x no longer supports 
> _*Htable.flushCommits()*_ and it is replaced by _*BufferedMutator.flush()*_.
>  Thus for put operation and delete operation, we can use BufferedMutator 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] zhedoubushishi opened a new pull request #1175: [HUDI-495] Update deprecated HBase API

2020-01-03 Thread GitBox
zhedoubushishi opened a new pull request #1175: [HUDI-495] Update deprecated 
HBase API
URL: https://github.com/apache/incubator-hudi/pull/1175
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   Jira: https://jira.apache.org/jira/browse/HUDI-495
   
   Internally we are using HBase 2.x, and HBase 2.x no longer supports 
```Htable.flushCommits()``` and it is replaced by ```BufferedMutator.flush()```.
   Thus for put operation and delete operation, we can use BufferedMutator 
instead.
   
   ## Brief change log
   
 - *Replace HTable with BufferedMutator*
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-495) Update deprecated HBase API

2020-01-03 Thread Wenning Ding (Jira)
Wenning Ding created HUDI-495:
-

 Summary: Update deprecated HBase API
 Key: HUDI-495
 URL: https://issues.apache.org/jira/browse/HUDI-495
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Wenning Ding


Internally we are using HBase 2.x, and HBase 2.x no longer supports 
_*Htable.flushCommits()*_ and it is replaced by _*BufferedMutator.flush()*_.
 Thus for put operation and delete operation, we can use BufferedMutator 
instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1120: [HUDI-440] Rework the hudi web site

2020-01-03 Thread GitBox
lamber-ken edited a comment on issue #1120: [HUDI-440] Rework the hudi web site
URL: https://github.com/apache/incubator-hudi/pull/1120#issuecomment-570623306
 
 
   > The hyperlinks back and forth are great.. Let me make a another pass for 
any typos/rewordings.
   > 
   > In the meantime, for ease of review
   > 
   > * Can you call out pages where there is new content (i.e text) added and 
if any pages needs to be retranslated to chinese for e.g
   > * Can you also prepare a README with complete instructions on how to get 
the new and old site up and running using a single command (how the old site 
was), if any special changes are needed..
   
   Thanks for reviewing again. I've done a lot of preparatory work. Just 
publishing the site as before is ok. 
   
   ```
   incubator-hudi-doc
   ├── content
   ├── docs
   └── docs-new
   ```
   
   Highly recommended to use docker to build the new site, the site should be 
up & running at `http://localhost:4000`
   ```
   docker-compose build --no-cache && docker-compose up
   ```
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1167: [HUDI-484] Fix NPE when reading IncrementalPull.sqltemplate in HiveIncrementalPuller

2020-01-03 Thread GitBox
lamber-ken commented on issue #1167: [HUDI-484] Fix NPE when reading 
IncrementalPull.sqltemplate in HiveIncrementalPuller
URL: https://github.com/apache/incubator-hudi/pull/1167#issuecomment-570628522
 
 
   > Sure we can file a new JIRA for end-end test of HiveIncrementalPuller .. 
We can merge this once you address the test case comment. Thanks @lamber-ken !
   
   You are welcome, I'll ping you later when finishing. I am debuging other 
issues(need switch branch).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1120: [HUDI-440] Rework the hudi web site

2020-01-03 Thread GitBox
lamber-ken edited a comment on issue #1120: [HUDI-440] Rework the hudi web site
URL: https://github.com/apache/incubator-hudi/pull/1120#issuecomment-570626206
 
 
   Hi @vinothchandar, I had talked with @leesf about the chinese-translation. 
Will not translate every page in the new site, just translate several core 
articles to help user to learn about hudi quickly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1120: [HUDI-440] Rework the hudi web site

2020-01-03 Thread GitBox
lamber-ken commented on issue #1120: [HUDI-440] Rework the hudi web site
URL: https://github.com/apache/incubator-hudi/pull/1120#issuecomment-570626206
 
 
   hi @vinothchandar, I had talked with @leesf about the chinese-translation. 
Will not translate every page in the new site, just translate several core 
articles to help user to learn about hudi quickly.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1120: [HUDI-440] Rework the hudi web site

2020-01-03 Thread GitBox
lamber-ken commented on issue #1120: [HUDI-440] Rework the hudi web site
URL: https://github.com/apache/incubator-hudi/pull/1120#issuecomment-570623306
 
 
   > The hyperlinks back and forth are great.. Let me make a another pass for 
any typos/rewordings.
   > 
   > In the meantime, for ease of review
   > 
   > * Can you call out pages where there is new content (i.e text) added and 
if any pages needs to be retranslated to chinese for e.g
   > * Can you also prepare a README with complete instructions on how to get 
the new and old site up and running using a single command (how the old site 
was), if any special changes are needed..
   
   Thanks for reviewing again. I've done a lot of preparatory work. Just 
publishing the site as before is ok. 
   
   ```
   incubator-hudi-doc
   ├── content
   ├── docs
   └── docs-new
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1120: [HUDI-440] Rework the hudi web site

2020-01-03 Thread GitBox
vinothchandar commented on issue #1120: [HUDI-440] Rework the hudi web site
URL: https://github.com/apache/incubator-hudi/pull/1120#issuecomment-570621693
 
 
   The hyperlinks back and forth are great..  Let me make a another pass for 
any typos/rewordings. 
   
   In the meantime, for ease of review 
   - Can you call out pages where there is new content (i.e text) added and if 
any pages needs to be retranslated to chinese for e.g 
   - Can you also prepare a README with complete instructions on how to get the 
new and old site up and running using a single command (how the old site was), 
if any special changes are needed.. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-03 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r362799979
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,242 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.TableConfig;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTopics;
+  private Set failedTopics;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) {
+this.tableExecutionObjects = new ArrayList<>();
+this.successTopics = new HashSet<>();
+this.failedTopics = new HashSet<>();
+this.jssc = jssc;
+String tableConfigFile = getCustomPropsFileName(args);
+FileSystem fs = FSUtils.getFs(tableConfigFile, jssc.hadoopConfiguration());
+List configList = UtilHelpers.readTableConfig(fs, new 
Path(tableConfigFile)).getConfigs();
+
+for (TableConfig config : configList) {
+  validateTableConfigObject(config);
+  populateTableExecutionObjectList(config, args);
+}
+  }
+
+  /*
+  validate if given object has all the necessary fields.
+  Throws IllegalArgumentException if any of the required fields are missing
+   */
+  private void validateTableConfigObject(TableConfig config) {
+if (Strings.isNullOrEmpty(config.getDatabase()) || 
Strings.isNullOrEmpty(config.getTableName()) || 
Strings.isNullOrEmpty(config.getPrimaryKeyField())
+|| Strings.isNullOrEmpty(config.getTopic())) {
+  throw new IllegalArgumentException("Please provide valid table config 
arguments!");
+}
+  }
+
+  private void populateTableExecutionObjectList(TableConfig config, String[] 
args) {
+TableExecutionObject executionObject;
+try {
+  final Config cfg = new Config();
+  String[] tableArgs = args.clone();
+  String targetBasePath = resetTarget(tableArgs, config.getDatabase(), 
config.getTableName());
+  JCommander cmd = new JCommander(cfg);
+  cmd.parse(tableArgs);
+  cfg.targetBasePath = targetBasePath;
+  FileSystem fs = FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration());
+  TypedProperties typedProperties = UtilHelpers.readConfig(fs, new 
Path(cfg.propsFilePath), cfg.configs).getConfig();
+  populateIngestionProps(typedProperties, config);
+  populateSchemaProviderProps(cfg, typedProperties, config);
+  populateHiveSyncProps(cfg, typedProperties, config);
+  executionObject = new TableExecutionObject();
+  executionObject.setConfig(cfg);
+  executionObject.setProperties(typedProperties);
+  executionObject.setTableConfig(config);
+  this.tableExecutionObjects.add(executionObject);
+} catch (Exception e) {
+  logger.error("Error while creating execution object for topic: " + 
config.getTopic(), e);
+  throw e;
+}
+  

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-03 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r362799446
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -171,6 +186,10 @@ public Operation convert(String value) throws 
ParameterException {
 public String propsFilePath =
 "file://" + System.getProperty("user.dir") + 
"/src/test/resources/delta-streamer-config/dfs-source.properties";
 
+@Parameter(names = {"--custom-props"}, description = "path to properties 
file on localfs or dfs, with configurations for "
 
 Review comment:
   Yes, This props file holds the table config objects needed for multi table 
execution. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-03 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r362799180
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -156,6 +167,10 @@ public Operation convert(String value) throws 
ParameterException {
 required = true)
 public String targetBasePath;
 
+@Parameter(names = {"--base-path-prefix"},
 
 Review comment:
   It was done with the idea that the complete path will consist of this 
base-path-prefix. In essence, complete path is getting created like 
//.This was initially discussed with 
@vinothchandar here 
(https://issues.apache.org/jira/browse/HUDI-288?focusedCommentId=16977695&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16977695).
   I am open to more suggestions @bvaradar . 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-03 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r362797626
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -80,16 +79,27 @@
   + "{\"name\": \"begin_lat\", \"type\": \"double\"},{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
   + "{\"name\": \"end_lat\", \"type\": \"double\"},{\"name\": \"end_lon\", 
\"type\": \"double\"},"
   + "{\"name\":\"fare\",\"type\": \"double\"}]}";
+  public static String GROCERY_PURCHASE_SCHEMA = 
"{\"type\":\"record\",\"name\":\"purchaserec\",\"fields\":["
 
 Review comment:
   @bvaradar I am still trying to understand why do you want to distinguish the 
topic from which the record with TestRawTripPayload got ingested? In the 
current setup also, all the test cases are passing. Could you please make your 
intention more clear?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-485) Check for where clause is wrong in HiveIncrementalPuller

2020-01-03 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007426#comment-17007426
 ] 

Pratyaksh Sharma commented on HUDI-485:
---

Thank you for the explanation [~vinoth]. Will take a look at the previous 
attempt and try to implement the needful now. 

> Check for where clause is wrong in HiveIncrementalPuller
> 
>
> Key: HUDI-485
> URL: https://issues.apache.org/jira/browse/HUDI-485
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull, newbie
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> HiveIncrementalPuller checks the clause in incrementalSqlFile like this -> 
> if (!incrementalSQL.contains("`_hoodie_commit_time` > '%targetBasePath'"))
> { LOG.info("Incremental SQL : " + incrementalSQL + " does not contain 
> `_hoodie_commit_time` > %targetBasePath. Please add " + "this clause for 
> incremental to work properly."); throw new HoodieIncrementalPullSQLException( 
> "Incremental SQL does not have clause `_hoodie_commit_time` > 
> '%targetBasePath', which " + "means its not pulling incrementally"); }
> Basically we are trying to add a placeholder here which is later replaced 
> with config.fromCommitTime here - 
> incrementalPullSQLtemplate.add("incrementalSQL", 
> String.format(incrementalSQL, config.fromCommitTime));
> Hence, the above check needs to replaced with `_hoodie_commit_time` > %s



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1154: [HUDI-406] Added default partition path in TimestampBasedKeyGenerator

2020-01-03 Thread GitBox
pratyakshsharma commented on issue #1154: [HUDI-406] Added default partition 
path in TimestampBasedKeyGenerator
URL: https://github.com/apache/incubator-hudi/pull/1154#issuecomment-570546153
 
 
   @bvaradar Done with the renaming.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1154: [HUDI-406] Added default partition path in TimestampBasedKeyGenerator

2020-01-03 Thread GitBox
pratyakshsharma commented on a change in pull request #1154: [HUDI-406] Added 
default partition path in TimestampBasedKeyGenerator
URL: https://github.com/apache/incubator-hudi/pull/1154#discussion_r362776898
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
 ##
 @@ -52,29 +52,18 @@
  */
 public class DataSourceUtils {
 
-  /**
-   * Obtain value of the provided nullable field as string, denoted by dot 
notation. e.g: a.b.c
-   */
-  public static String getNullableNestedFieldValAsString(GenericRecord record, 
String fieldName) {
-try {
-  return getNestedFieldValAsString(record, fieldName);
-} catch (HoodieException e) {
-  return null;
-}
-  }
-
   /**
* Obtain value of the provided field as string, denoted by dot notation. 
e.g: a.b.c
*/
-  public static String getNestedFieldValAsString(GenericRecord record, String 
fieldName) {
-Object obj = getNestedFieldVal(record, fieldName);
-return obj.toString();
+  public static String getNestedFieldValAsString(GenericRecord record, String 
fieldName, boolean returnNullValue) {
+Object obj = getNestedFieldVal(record, fieldName, returnNullValue);
+return (obj == null) ? null : obj.toString();
   }
 
   /**
* Obtain value of the provided field, denoted by dot notation. e.g: a.b.c
*/
-  public static Object getNestedFieldVal(GenericRecord record, String 
fieldName) {
+  public static Object getNestedFieldVal(GenericRecord record, String 
fieldName, boolean returnNullValue) {
 
 Review comment:
   Done. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-96) Use Command line options instead of positional arguments when launching spark applications from various CLI commands

2020-01-03 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007412#comment-17007412
 ] 

Pratyaksh Sharma commented on HUDI-96:
--

[~vbalaji] I have resumed the work for this ticket and have tried to address 
most of the comments that you already gave. I have raised a fresh PR for this. 
Please have a look and let me know your thoughts. Here is the PR - 
[https://github.com/apache/incubator-hudi/pull/1174].

> Use Command line options instead of positional arguments when launching spark 
> applications from various CLI commands
> 
>
> Key: HUDI-96
> URL: https://issues.apache.org/jira/browse/HUDI-96
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI, newbie
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Minor
>  Labels: newbie, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hoodie CLI commands like compaction/rollback/repair/savepoints/parquet-import 
> relies on launching a spark application to perform their operations (look at 
> SparkMain.java). 
> SparkMain (Look at SparkMain.main()) relies on positional arguments for 
> passing  various CLI options. Instead we should define proper CLI options in 
> SparkMain and use them (using Jcommander)  to improve readability and avoid 
> accidental errors at call sites. For e.g : See 
> com.uber.hoodie.utilities.HoodieCompactor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-96) Use Command line options instead of positional arguments when launching spark applications from various CLI commands

2020-01-03 Thread Pratyaksh Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-96?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma reassigned HUDI-96:


Assignee: Pratyaksh Sharma

> Use Command line options instead of positional arguments when launching spark 
> applications from various CLI commands
> 
>
> Key: HUDI-96
> URL: https://issues.apache.org/jira/browse/HUDI-96
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI, newbie
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Minor
>  Labels: newbie, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hoodie CLI commands like compaction/rollback/repair/savepoints/parquet-import 
> relies on launching a spark application to perform their operations (look at 
> SparkMain.java). 
> SparkMain (Look at SparkMain.main()) relies on positional arguments for 
> passing  various CLI options. Instead we should define proper CLI options in 
> SparkMain and use them (using Jcommander)  to improve readability and avoid 
> accidental errors at call sites. For e.g : See 
> com.uber.hoodie.utilities.HoodieCompactor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma opened a new pull request #1174: [HUDI-96]: Implemented command line options instead of positional arguments for CLI commands

2020-01-03 Thread GitBox
pratyakshsharma opened a new pull request #1174: [HUDI-96]: Implemented command 
line options instead of positional arguments for CLI commands
URL: https://github.com/apache/incubator-hudi/pull/1174
 
 
   1. Implemented command line options replacing positional arguments when 
launching spark applications from CLI commands. 
   2. Created AbstractCommandConfig class as the base class for all the CLI 
specific config objects. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1154: [HUDI-406] Added default partition path in TimestampBasedKeyGenerator

2020-01-03 Thread GitBox
bvaradar commented on a change in pull request #1154: [HUDI-406] Added default 
partition path in TimestampBasedKeyGenerator
URL: https://github.com/apache/incubator-hudi/pull/1154#discussion_r362736026
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
 ##
 @@ -52,29 +52,18 @@
  */
 public class DataSourceUtils {
 
-  /**
-   * Obtain value of the provided nullable field as string, denoted by dot 
notation. e.g: a.b.c
-   */
-  public static String getNullableNestedFieldValAsString(GenericRecord record, 
String fieldName) {
-try {
-  return getNestedFieldValAsString(record, fieldName);
-} catch (HoodieException e) {
-  return null;
-}
-  }
-
   /**
* Obtain value of the provided field as string, denoted by dot notation. 
e.g: a.b.c
*/
-  public static String getNestedFieldValAsString(GenericRecord record, String 
fieldName) {
-Object obj = getNestedFieldVal(record, fieldName);
-return obj.toString();
+  public static String getNestedFieldValAsString(GenericRecord record, String 
fieldName, boolean returnNullValue) {
+Object obj = getNestedFieldVal(record, fieldName, returnNullValue);
+return (obj == null) ? null : obj.toString();
   }
 
   /**
* Obtain value of the provided field, denoted by dot notation. e.g: a.b.c
*/
-  public static Object getNestedFieldVal(GenericRecord record, String 
fieldName) {
+  public static Object getNestedFieldVal(GenericRecord record, String 
fieldName, boolean returnNullValue) {
 
 Review comment:
   Can you rename to returnNullIfNotFound ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-01-03 Thread GitBox
bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation 
type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#discussion_r362728271
 
 

 ##
 File path: hudi-common/src/main/avro/HoodieCommitMetadata.avsc
 ##
 @@ -129,6 +129,11 @@
  }],
  "default": null
   },
+  {
+ "name":"operateType",
+ "type":["null","string"],
 
 Review comment:
   Can we use enum instead of string type ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-01-03 Thread GitBox
bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation 
type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#discussion_r362727261
 
 

 ##
 File path: hudi-common/src/main/avro/HoodieCommitMetadata.avsc
 ##
 @@ -129,6 +129,11 @@
  }],
  "default": null
   },
+  {
+ "name":"operateType",
 
 Review comment:
   nit: operationtionType


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-01-03 Thread GitBox
bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation 
type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#discussion_r362728404
 
 

 ##
 File path: hudi-common/src/main/avro/HoodieCommitMetadata.avsc
 ##
 @@ -129,6 +129,11 @@
  }],
  "default": null
   },
+  {
+ "name":"operateType",
+ "type":["null","string"],
+ "default": null
 
 Review comment:
   Also, Can you confirm if the operation type is stored in the avro objects 
when archiving ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-01-03 Thread GitBox
bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation 
type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#discussion_r362727335
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##
 @@ -106,6 +108,14 @@ public void setCompacted(Boolean compacted) {
 return filePaths;
   }
 
+  public void setOperateType(WriteOperationType type) {
 
 Review comment:
   rename to operationType along with getters/setters.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-01-03 Thread GitBox
bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation 
type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#discussion_r362727612
 
 

 ##
 File path: hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
 ##
 @@ -82,10 +82,10 @@ object DataSourceWriteOptions {
 * Default: upsert()
 */
   val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
-  val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
-  val INSERT_OPERATION_OPT_VAL = "insert"
-  val UPSERT_OPERATION_OPT_VAL = "upsert"
-  val DELETE_OPERATION_OPT_VAL = "delete"
+  val BULK_INSERT_OPERATION_OPT_VAL = WriteOperationType.BULK_INSERT.toString
 
 Review comment:
   Let us not change these configuration values as it would cause backwards 
compatibility issues.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-01-03 Thread GitBox
bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation 
type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#discussion_r362728805
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -510,21 +515,21 @@ private Partitioner getPartitioner(HoodieTable table, 
boolean isUpsert, Workload
   /**
* Commit changes performed at the given commitTime marker.
*/
-  public boolean commit(String commitTime, JavaRDD writeStatuses) 
{
-return commit(commitTime, writeStatuses, Option.empty());
+  public boolean commit(String commitTime, JavaRDD writeStatuses, 
WriteOperationType operationType) {
 
 Review comment:
   As only one hudi write operation is outstanding at a time, can you cache the 
last operation type in instance variables within HoodieWriteClient object so 
that users don't need to explicitly pass them in this commit() call


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services