date:20200202

[jira] [Updated] (HUDI-596) KafkaConsumer need to be closed

2020-02-02 Thread dengziming (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dengziming updated HUDI-596:

Summary: KafkaConsumer need to be closed  (was: KafkaConsumer need to be 
close)

> KafkaConsumer need to be closed
> ---
>
> Key: HUDI-596
> URL: https://issues.apache.org/jira/browse/HUDI-596
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Utilities
>Reporter: dengziming
>Assignee: dengziming
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> `offsetGen.getNextOffsetRanges` will is called periodically in DeltaStreamer 
> application, and it will `new KafkaConsumer(kafkaParams)` without close, and 
> Exception will throw after a while.
> ```
> java.net.SocketException: Too many open files
>   at sun.nio.ch.Net.socket0(Native Method)
>   at sun.nio.ch.Net.socket(Net.java:411)
>   at sun.nio.ch.Net.socket(Net.java:404)
>   at sun.nio.ch.SocketChannelImpl.(SocketChannelImpl.java:105)
>   at 
> sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)
>   at java.nio.channels.SocketChannel.open(SocketChannel.java:145)
>   at org.apache.kafka.common.network.Selector.connect(Selector.java:211)
>   at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:864)
>   at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:265)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.trySend(ConsumerNetworkClient.java:485)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:261)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:218)
>   at 
> org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:274)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1774)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1742)
>   at 
> org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen.getNextOffsetRanges(KafkaOffsetGen.java:177)
>   at 
> org.apache.hudi.utilities.sources.JsonKafkaSource.fetchNewData(JsonKafkaSource.java:56)
>   at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:73)
>   at 
> org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:107)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:288)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-596) KafkaConsumer need to be close

2020-02-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-596:

Labels: pull-request-available  (was: )

> KafkaConsumer need to be close
> --
>
> Key: HUDI-596
> URL: https://issues.apache.org/jira/browse/HUDI-596
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Utilities
>Reporter: dengziming
>Assignee: dengziming
>Priority: Major
>  Labels: pull-request-available
>
> `offsetGen.getNextOffsetRanges` will is called periodically in DeltaStreamer 
> application, and it will `new KafkaConsumer(kafkaParams)` without close, and 
> Exception will throw after a while.
> ```
> java.net.SocketException: Too many open files
>   at sun.nio.ch.Net.socket0(Native Method)
>   at sun.nio.ch.Net.socket(Net.java:411)
>   at sun.nio.ch.Net.socket(Net.java:404)
>   at sun.nio.ch.SocketChannelImpl.(SocketChannelImpl.java:105)
>   at 
> sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)
>   at java.nio.channels.SocketChannel.open(SocketChannel.java:145)
>   at org.apache.kafka.common.network.Selector.connect(Selector.java:211)
>   at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:864)
>   at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:265)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.trySend(ConsumerNetworkClient.java:485)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:261)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:218)
>   at 
> org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:274)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1774)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1742)
>   at 
> org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen.getNextOffsetRanges(KafkaOffsetGen.java:177)
>   at 
> org.apache.hudi.utilities.sources.JsonKafkaSource.fetchNewData(JsonKafkaSource.java:56)
>   at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:73)
>   at 
> org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:107)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:288)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] dengziming opened a new pull request #1303: [HUDI-596] Close KafkaConsumer every time

2020-02-02 Thread GitBox

dengziming opened a new pull request #1303: [HUDI-596] Close KafkaConsumer 
every time
URL: https://github.com/apache/incubator-hudi/pull/1303
 
 
   if we create KafkaConsumer but didn't close it, exception will be thrown 
after a while.
   
   ```
   java.net.SocketException: Too many open files
   at sun.nio.ch.Net.socket0(Native Method)
   at sun.nio.ch.Net.socket(Net.java:411)
   at sun.nio.ch.Net.socket(Net.java:404)
   ```
   
   ## What is the purpose of the pull request
   
   use try with resources to close KafkaConsumer
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Assigned] (HUDI-596) KafkaConsumer need to be close

2020-02-02 Thread dengziming (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dengziming reassigned HUDI-596:
---

Assignee: dengziming

> KafkaConsumer need to be close
> --
>
> Key: HUDI-596
> URL: https://issues.apache.org/jira/browse/HUDI-596
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Utilities
>Reporter: dengziming
>Assignee: dengziming
>Priority: Major
>
> `offsetGen.getNextOffsetRanges` will is called periodically in DeltaStreamer 
> application, and it will `new KafkaConsumer(kafkaParams)` without close, and 
> Exception will throw after a while.
> ```
> java.net.SocketException: Too many open files
>   at sun.nio.ch.Net.socket0(Native Method)
>   at sun.nio.ch.Net.socket(Net.java:411)
>   at sun.nio.ch.Net.socket(Net.java:404)
>   at sun.nio.ch.SocketChannelImpl.(SocketChannelImpl.java:105)
>   at 
> sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)
>   at java.nio.channels.SocketChannel.open(SocketChannel.java:145)
>   at org.apache.kafka.common.network.Selector.connect(Selector.java:211)
>   at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:864)
>   at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:265)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.trySend(ConsumerNetworkClient.java:485)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:261)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:218)
>   at 
> org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:274)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1774)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1742)
>   at 
> org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen.getNextOffsetRanges(KafkaOffsetGen.java:177)
>   at 
> org.apache.hudi.utilities.sources.JsonKafkaSource.fetchNewData(JsonKafkaSource.java:56)
>   at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:73)
>   at 
> org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:107)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:288)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-596) KafkaConsumer need to be close

2020-02-02 Thread dengziming (Jira)

dengziming created HUDI-596:
---

 Summary: KafkaConsumer need to be close
 Key: HUDI-596
 URL: https://issues.apache.org/jira/browse/HUDI-596
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Utilities
Reporter: dengziming


`offsetGen.getNextOffsetRanges` will is called periodically in DeltaStreamer 
application, and it will `new KafkaConsumer(kafkaParams)` without close, and 
Exception will throw after a while.

```
java.net.SocketException: Too many open files
at sun.nio.ch.Net.socket0(Native Method)
at sun.nio.ch.Net.socket(Net.java:411)
at sun.nio.ch.Net.socket(Net.java:404)
at sun.nio.ch.SocketChannelImpl.(SocketChannelImpl.java:105)
at 
sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)
at java.nio.channels.SocketChannel.open(SocketChannel.java:145)
at org.apache.kafka.common.network.Selector.connect(Selector.java:211)
at 
org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:864)
at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:265)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.trySend(ConsumerNetworkClient.java:485)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:261)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:218)
at 
org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:274)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1774)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1742)
at 
org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen.getNextOffsetRanges(KafkaOffsetGen.java:177)
at 
org.apache.hudi.utilities.sources.JsonKafkaSource.fetchNewData(JsonKafkaSource.java:56)
at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:73)
at 
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:107)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:288)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] smarthi opened a new pull request #1302: [HUDI-595] code cleanup, refactoring code out of PR# 1159

2020-02-02 Thread GitBox

smarthi opened a new pull request #1302: [HUDI-595] code cleanup, refactoring 
code out of PR# 1159
URL: https://github.com/apache/incubator-hudi/pull/1302
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-595) code cleanup

2020-02-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-595:

Labels: pull-request-available  (was: )

> code cleanup 
> -
>
> Key: HUDI-595
> URL: https://issues.apache.org/jira/browse/HUDI-595
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>
> Moving out the cleanup code from PR# 1159 into a separate PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-595) code cleanup

2020-02-02 Thread Suneel Marthi (Jira)

Suneel Marthi created HUDI-595:
--

 Summary: code cleanup 
 Key: HUDI-595
 URL: https://issues.apache.org/jira/browse/HUDI-595
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Code Cleanup
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 0.5.2


Moving out the cleanup code from PR# 1159 into a separate PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #178

2020-02-02 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.02 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4:
bin
boot
conf
lib
LICENSE
NOTICE
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.2-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark_2.11[jar]
[INFO] hudi-utilities_2.11[jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle_2.11 [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle_2.11

[jira] [Closed] (HUDI-564) Improve unit test coverage for org.apache.hudi.common.table.log.HoodieLogFormatVersion

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed HUDI-564.
--

> Improve unit test coverage for 
> org.apache.hudi.common.table.log.HoodieLogFormatVersion
> --
>
> Key: HUDI-564
> URL: https://issues.apache.org/jira/browse/HUDI-564
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-583) cleanup legacy code

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed HUDI-583.
--

> cleanup legacy code 
> 
>
> Key: HUDI-583
> URL: https://issues.apache.org/jira/browse/HUDI-583
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Cleaner
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/incubator-hudi/pull/1237]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-578) Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed HUDI-578.
--

> Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator
> ---
>
> Key: HUDI-578
> URL: https://issues.apache.org/jira/browse/HUDI-578
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> when use ComplexKeyGenerator
> the options the below.
> {code:java}
> option("hoodie.datasource.write.recordkey.field", "name, age").
> option("hoodie.datasource.write.keygenerator.class", 
> ComplexKeyGenerator.class.getName()).
> option("hoodie.datasource.write.partitionpath.field", "location, age").
> {code}
> and the data is 
> {code:java}
> "{ \"name\": \"name1\", \"ts\": 1574297893839, \"age\": 15, \"location\": 
> \"latitude\", \"sex\":\"male\"}"
> {code}
> the result is incorrect with age = null in recordkey, and age = default in 
> partitionpath.
> We would trim the paritions and recordkeys in complexKeyGenerator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-536) Update release notes to include KeyGenerator package changes

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-536.

Fix Version/s: (was: 0.5.2)
   0.5.1
   Resolution: Fixed

> Update release notes to include KeyGenerator package changes
> 
>
> Key: HUDI-536
> URL: https://issues.apache.org/jira/browse/HUDI-536
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Brandon Scheller
>Priority: Major
> Fix For: 0.5.1
>
>
> The change introduced here:
>  [https://github.com/apache/incubator-hudi/pull/1194]
> Refactors hudi keygenerators into their own package.
> We need to make this a backwards compatible change or update the release 
> notes to address this.
> Specifically:
> org.apache.hudi.ComplexKeyGenerator -> 
> org.apache.hudi.keygen.ComplexKeyGenerator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-469) HoodieCommitMetadata only show first commit insert rows.

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-469:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> HoodieCommitMetadata only show first commit insert rows. 
> -
>
> Key: HUDI-469
> URL: https://issues.apache.org/jira/browse/HUDI-469
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI
>Reporter: cdmikechen
>Assignee: cdmikechen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When I run hudi cli to get insert rows, I found that hudi cli can not get 
> insert rows if it is not in first commit time. I found that 
> *{{HoodieCommitMetadata.fetchTotalInsertRecordsWritten()*}} method use 
> *{{stat.getPrevCommit().equalsIgnoreCase("null")*}} to filter first commit. 
> This check option should be removed。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-389) Updates sent to diff partition for a given key with Global Index

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-389:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> Updates sent to diff partition for a given key with Global Index 
> -
>
> Key: HUDI-389
> URL: https://issues.apache.org/jira/browse/HUDI-389
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Index
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>   Original Estimate: 48h
>  Time Spent: 20m
>  Remaining Estimate: 47h 40m
>
> Updates sent to diff partition for a given key with Global Index should 
> succeed by updating the record under original partition. As of now, it throws 
> exception. 
> [https://github.com/apache/incubator-hudi/issues/1021] 
>  
>  
> error log:
> {code:java}
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.timeline.HoodieActiveTimeline - Loaded instants 
> java.util.stream.ReferencePipeline$Head@d02b1c7
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - Building file 
> system view for partition (2016/04/15)
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - #files found 
> in partition (2016/04/15) =0, Time taken =0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - 
> addFilesToView: NumFiles=0, FileGroupsCreationTime=0, StoreTimeTaken=0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.HoodieTableFileSystemView - Adding 
> file-groups for partition :2016/04/15, #FileGroups=0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - Time to load 
> partition (2016/04/15) =0
>  14754 [Executor task launch worker-0] ERROR 
> com.uber.hoodie.table.HoodieCopyOnWriteTable - Error upserting bucketType 
> UPDATE for partition :0
>  java.util.NoSuchElementException: No value present
>  at com.uber.hoodie.common.util.Option.get(Option.java:112)
>  at com.uber.hoodie.io.HoodieMergeHandle.(HoodieMergeHandle.java:71)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:226)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:180)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:263)
>  at 
> com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:442)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at

[jira] [Commented] (HUDI-389) Updates sent to diff partition for a given key with Global Index

2020-02-02 Thread leesf (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028630#comment-17028630
 ] 

leesf commented on HUDI-389:


Fixed via master: 9c4217a3e1b9b728690282c914db2067117f4cfb

> Updates sent to diff partition for a given key with Global Index 
> -
>
> Key: HUDI-389
> URL: https://issues.apache.org/jira/browse/HUDI-389
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Index
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>   Original Estimate: 48h
>  Time Spent: 20m
>  Remaining Estimate: 47h 40m
>
> Updates sent to diff partition for a given key with Global Index should 
> succeed by updating the record under original partition. As of now, it throws 
> exception. 
> [https://github.com/apache/incubator-hudi/issues/1021] 
>  
>  
> error log:
> {code:java}
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.timeline.HoodieActiveTimeline - Loaded instants 
> java.util.stream.ReferencePipeline$Head@d02b1c7
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - Building file 
> system view for partition (2016/04/15)
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - #files found 
> in partition (2016/04/15) =0, Time taken =0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - 
> addFilesToView: NumFiles=0, FileGroupsCreationTime=0, StoreTimeTaken=0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.HoodieTableFileSystemView - Adding 
> file-groups for partition :2016/04/15, #FileGroups=0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - Time to load 
> partition (2016/04/15) =0
>  14754 [Executor task launch worker-0] ERROR 
> com.uber.hoodie.table.HoodieCopyOnWriteTable - Error upserting bucketType 
> UPDATE for partition :0
>  java.util.NoSuchElementException: No value present
>  at com.uber.hoodie.common.util.Option.get(Option.java:112)
>  at com.uber.hoodie.io.HoodieMergeHandle.(HoodieMergeHandle.java:71)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:226)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:180)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:263)
>  at 
> com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:442)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
>

[jira] [Updated] (HUDI-443) Add slides for Hadoop summit 2019, Bangalore to powered-by page

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-443:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> Add slides for Hadoop summit 2019, Bangalore to powered-by page
> ---
>
> Key: HUDI-443
> URL: https://issues.apache.org/jira/browse/HUDI-443
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs, newbie
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Add slides for the talk on Apache Hudi and debezium at Hadoop summit 2019, 
> Bangalore to powered-by page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-415:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> HoodieSparkSqlWriter Commit time not representing the Spark job starting time
> -
>
> Key: HUDI-415
> URL: https://issues.apache.org/jira/browse/HUDI-415
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi records the commit time after the first action complete. If there is a 
> heavy transformation before isEmpty(), then the commit time could be 
> inaccurate.
> {code:java}
> if (hoodieRecords.isEmpty()) { 
> log.info("new batch has no new records, skipping...") 
> return (true, common.util.Option.empty()) 
> } 
> commitTime = client.startCommit() 
> writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, 
> commitTime, operation)
> {code}
> For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 
> hours, then the commit time in the .hoodie folder will be 201901010*2*00. If 
> I use the commit time to ingest data starting from 201901010200(from HDFS, 
> not using deltastreamer), then I will miss 2 hours of data.
> Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-343) Create a DOAP File for Hudi

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-343:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> Create a DOAP File for Hudi
> ---
>
> Key: HUDI-343
> URL: https://issues.apache.org/jira/browse/HUDI-343
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> But please create a DOAP file for Hudi, where you can also list the
> release: https://projects.apache.org/create.html
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-377) Add Delete() support to HoodieDeltaStreamer

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-377:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> Add Delete() support to HoodieDeltaStreamer
> ---
>
> Key: HUDI-377
> URL: https://issues.apache.org/jira/browse/HUDI-377
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>   Original Estimate: 72h
>  Time Spent: 20m
>  Remaining Estimate: 71h 40m
>
> Add Delete() support to HoodieDeltaStreamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-311) Support AWS DMS source on DeltaStreamer

2020-02-02 Thread leesf (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028629#comment-17028629
 ] 

leesf commented on HUDI-311:


Fix via master: 350b0ecb4d137411c6231a1568add585c6d7b7d5

> Support AWS DMS source on DeltaStreamer
> ---
>
> Key: HUDI-311
> URL: https://issues.apache.org/jira/browse/HUDI-311
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://aws.amazon.com/dms/ seems like a one-stop shop for database change 
> logs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-106) Dynamically tune bloom filter entries

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-106:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> Dynamically tune bloom filter entries
> -
>
> Key: HUDI-106
> URL: https://issues.apache.org/jira/browse/HUDI-106
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available, realtime-data-lakes
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Tuning bloom filters is currently based on a configuration, that could be 
> cumbersome to tune per dataset to obtain good indexing performance.. Lets add 
> support for Dynamic Bloom Filters, that can automatically achieve a 
> configured false positive ratio depending on number of entries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-308) Avoid Renames for tracking state transitions of all actions on dataset

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-308:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> Avoid Renames for tracking state transitions of all actions on dataset
> --
>
> Key: HUDI-308
> URL: https://issues.apache.org/jira/browse/HUDI-308
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
> Attachments: IMG_0118.jpg
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, We employ renames when transitioning states (REQUESTED, INFLIGHT, 
> COMPLETED) of all actions in Hudi. 
> The idea is to always create new files pertaining to each state of an action 
> (commit, compaction, clean, ) that is being performed and have the 
> Timeline management resolve conflicts when loading them from .hoodie to 
> folder.  The Archiving logic will cleanup transient state files and archive 
> terminal state files. 
> THis handling will be done consistently for all kinds of actions on datasets. 
> As part of this project, we will cleanup un-necessary fields in metada, 
> version them and standardize on avro/json.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-80) Incrementalize cleaning based on timeline metadata

2020-02-02 Thread leesf (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-80?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028627#comment-17028627
 ] 

leesf commented on HUDI-80:
---

Fixed via master: 8ff06ddb0fdc8325382dbca4bd9dd4884b4e1110

> Incrementalize cleaning based on timeline metadata
> --
>
> Key: HUDI-80
> URL: https://issues.apache.org/jira/browse/HUDI-80
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, cleaning lists all partitions once and then picks the file groups 
> to clean from DFS. This is partly due to support for retaining last x 
> versions of a file group as well (in additon to the default mode of retaining 
> last x commits). This could be expensive in some cases. See 
> [https://github.com/apache/incubator-hudi/issues/613] for a issue reported. 
>  
> This task tracks work to 
>  * Determine if we can get rid of last X version cleaning mode 
>  * Implement cleaning based on file metadata in hudi timeline itself
>  * Resulting rpc calls to DFS would be O(number of filegroups 
> cleaned)/O(number of partitions touched in last X commits)
>  
> HUDI-1 implements a timeline service for writing, that promotes caching of 
> file system metadata. This can be implemented on top of that. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-91) Replace Databricks spark-avro with native spark-avro #628

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-91?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-91:
--
Fix Version/s: (was: 0.5.2)
   0.5.1

> Replace Databricks spark-avro with native spark-avro #628
> -
>
> Key: HUDI-91
> URL: https://issues.apache.org/jira/browse/HUDI-91
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration, Usability
>Reporter: Vinoth Chandar
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/628] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-25) Faster Incremental queries on Hoodie #492

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-25:
--
Fix Version/s: (was: 0.5.2)
   0.5.1

> Faster Incremental queries on Hoodie #492
> -
>
> Key: HUDI-25
> URL: https://issues.apache.org/jira/browse/HUDI-25
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hive Incremental queries on Hoodie currently suffer a limitation of listing 
> all partitions when a datestr is not present (lists .hoodie and the 
> partitions) and end up throwing away a lot of the files (since 
> `__hoodie__commit_time` column values filters out those files) . This can be 
> very expensive and can impact query planning time and sometime causes 
> timeouts as well if the table is large. The original issue is tracked here - 
> [https://github.com/uber/hudi/issues/492]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-15) Add a delete() API to HoodieWriteClient as well as Spark datasource #531

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-15:
--
Fix Version/s: (was: 0.5.2)
   0.5.1

> Add a delete() API to HoodieWriteClient as well as Spark datasource #531
> 
>
> Key: HUDI-15
> URL: https://issues.apache.org/jira/browse/HUDI-15
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration, Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.5.1
>
>
> Delete API needs to be supported as first class citizen via DeltaStreamer, 
> WriteClient and datasources. Currently there are two ways to delete, soft 
> deletes and hard deletes - https://hudi.apache.org/writing_data.html#deletes. 
> We need to ensure for hard deletes, we are able to leverage 
> EmptyHoodieRecordPayload with just the HoodieKey and empty record value for 
> deleting.
> [https://github.com/uber/hudi/issues/531]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-550) Add to Release Notes : Configuration Value change for Kafka Reset Offset Strategies

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-550:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> Add to Release Notes : Configuration Value change for Kafka Reset Offset 
> Strategies
> ---
>
> Key: HUDI-550
> URL: https://issues.apache.org/jira/browse/HUDI-550
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.1
>
>
> Enum Values are changed for configuring kafka reset offset strategies in 
> deltastreamer
>    LARGEST -> LATEST
>   SMALLEST -> EARLIEST
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-547) Call out changes in package names due to scala cross compiling support

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-547:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> Call out changes in package names due to scala cross compiling support
> --
>
> Key: HUDI-547
> URL: https://issues.apache.org/jira/browse/HUDI-547
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.1
>
>
> Two versions of each of the below packages needs to be built. 
> hudi-spark is hudi-spark_2.11 and hudi-spark_2.12
> hudi-utilities is hudi-utilities_2.11 and hudi-utilities_2.12
> hudi-spark-bundle is hudi-spark-bundle_2.11 and hudi-spark-bundle_2.12
> hudi-utilities-bundle is hudi-utilities-bundle_2.11 and 
> hudi-utilities-bundle_2.12
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-12) Upgrade Hudi to Spark 2.4

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-12:
--
Fix Version/s: (was: 0.5.2)
   0.5.1

> Upgrade Hudi to Spark 2.4
> -
>
> Key: HUDI-12
> URL: https://issues.apache.org/jira/browse/HUDI-12
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://github.com/uber/hudi/issues/549



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-238) Make separate release for hudi spark/scala based packages for scala 2.12

2020-02-02 Thread leesf (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028623#comment-17028623
 ] 

leesf commented on HUDI-238:


Fixed via master: 292c1e2ff436a711cbbb53ad9b1f6232121d53ec

> Make separate release for hudi spark/scala based packages for scala 2.12 
> -
>
> Key: HUDI-238
> URL: https://issues.apache.org/jira/browse/HUDI-238
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Release  Administrative, Usability
>Reporter: Balaji Varadarajan
>Assignee: Tadas Sugintas
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/881#issuecomment-528700749]
> Suspects: 
> h3. Hudi utilities package 
> bringing in spark-streaming-kafka-0.8* 
> {code:java}
> [INFO] Scanning for projects...
> [INFO] 
> [INFO] ---< org.apache.hudi:hudi-utilities 
> >---
> [INFO] Building hudi-utilities 0.5.0-SNAPSHOT
> [INFO] [ jar 
> ]-
> [INFO] 
> [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-utilities 
> ---
> [INFO] org.apache.hudi:hudi-utilities:jar:0.5.0-SNAPSHOT
> [INFO] ...
> [INFO] +- org.apache.hudi:hudi-client:jar:0.5.0-SNAPSHOT:compile
>...
> [INFO] 
> [INFO] +- org.apache.hudi:hudi-spark:jar:0.5.0-SNAPSHOT:compile
> [INFO] |  \- org.scala-lang:scala-library:jar:2.11.8:compile
> [INFO] +- log4j:log4j:jar:1.2.17:compile
>...
> [INFO] +- org.apache.spark:spark-core_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
> [INFO] |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
> [INFO] |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
> [INFO] |  +- com.twitter:chill_2.11:jar:0.8.0:provided
> [INFO] |  +- com.twitter:chill-java:jar:0.8.0:provided
> [INFO] |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
> [INFO] |  +- org.apache.spark:spark-launcher_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-common_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-shuffle_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-unsafe_2.11:jar:2.1.0:provided
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:provided
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.5:provided
> [INFO] |  +- org.apache.commons:commons-math3:jar:3.4.1:provided
> [INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:provided
> [INFO] |  +- org.slf4j:slf4j-api:jar:1.7.16:compile
> [INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.16:compile
> [INFO] |  +- com.ning:compress-lzf:jar:1.0.3:provided
> [INFO] |  +- org.xerial.snappy:snappy-java:jar:1.1.2.6:compile
> [INFO] |  +- net.jpountz.lz4:lz4:jar:1.3.0:compile
> [INFO] |  +- org.roaringbitmap:RoaringBitmap:jar:0.5.11:provided
> [INFO] |  +- commons-net:commons-net:jar:2.2:provided
>
> [INFO] +- org.apache.spark:spark-sql_2.11:jar:2.1.0:provided
> [INFO] |  +- com.univocity:univocity-parsers:jar:2.2.1:provided
> [INFO] |  +- org.apache.spark:spark-sketch_2.11:jar:2.1.0:provided
> [INFO] |  \- org.apache.spark:spark-catalyst_2.11:jar:2.1.0:provided
> [INFO] | +- org.codehaus.janino:janino:jar:3.0.0:provided
> [INFO] | +- org.codehaus.janino:commons-compiler:jar:3.0.0:provided
> [INFO] | \- org.antlr:antlr4-runtime:jar:4.5.3:provided
> [INFO] +- com.databricks:spark-avro_2.11:jar:4.0.0:provided
> [INFO] +- org.apache.spark:spark-streaming_2.11:jar:2.1.0:compile
> [INFO] +- org.apache.spark:spark-streaming-kafka-0-8_2.11:jar:2.1.0:compile
> [INFO] |  \- org.apache.kafka:kafka_2.11:jar:0.8.2.1:compile
> [INFO] | +- org.scala-lang.modules:scala-xml_2.11:jar:1.0.2:compile
> [INFO] | +- 
> org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.2:compile
> [INFO] | \- org.apache.kafka:kafka-clients:jar:0.8.2.1:compile
> [INFO] +- io.dropwizard.metrics:metrics-core:jar:4.0.2:compile
> [INFO] +- org.antlr:stringtemplate:jar:4.0.2:compile
> [INFO] |  \- org.antlr:antlr-runtime:jar:3.3:compile
> [INFO] +- com.beust:jcommander:jar:1.72:compile
> [INFO] +- com.twitter:bijection-avro_2.11:jar:0.9.2:compile
> [INFO] |  \- com.twitter:bijection-core_2.11:jar:0.9.2:compile
> [INFO] +- io.confluent:kafka-avro-serializer:jar:3.0.0:compile
> [INFO] +- io.confluent:common-config:jar:3.0.0:compile
> [INFO] +- io.confluent:common-utils:jar:3.0.0:compile
> [INFO] |  \- com.101tec:zkclient:jar:0.5:compile
> [INFO] +-

[jira] [Updated] (HUDI-238) Make separate release for hudi spark/scala based packages for scala 2.12

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-238:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> Make separate release for hudi spark/scala based packages for scala 2.12 
> -
>
> Key: HUDI-238
> URL: https://issues.apache.org/jira/browse/HUDI-238
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Release  Administrative, Usability
>Reporter: Balaji Varadarajan
>Assignee: Tadas Sugintas
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/881#issuecomment-528700749]
> Suspects: 
> h3. Hudi utilities package 
> bringing in spark-streaming-kafka-0.8* 
> {code:java}
> [INFO] Scanning for projects...
> [INFO] 
> [INFO] ---< org.apache.hudi:hudi-utilities 
> >---
> [INFO] Building hudi-utilities 0.5.0-SNAPSHOT
> [INFO] [ jar 
> ]-
> [INFO] 
> [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-utilities 
> ---
> [INFO] org.apache.hudi:hudi-utilities:jar:0.5.0-SNAPSHOT
> [INFO] ...
> [INFO] +- org.apache.hudi:hudi-client:jar:0.5.0-SNAPSHOT:compile
>...
> [INFO] 
> [INFO] +- org.apache.hudi:hudi-spark:jar:0.5.0-SNAPSHOT:compile
> [INFO] |  \- org.scala-lang:scala-library:jar:2.11.8:compile
> [INFO] +- log4j:log4j:jar:1.2.17:compile
>...
> [INFO] +- org.apache.spark:spark-core_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
> [INFO] |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
> [INFO] |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
> [INFO] |  +- com.twitter:chill_2.11:jar:0.8.0:provided
> [INFO] |  +- com.twitter:chill-java:jar:0.8.0:provided
> [INFO] |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
> [INFO] |  +- org.apache.spark:spark-launcher_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-common_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-shuffle_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-unsafe_2.11:jar:2.1.0:provided
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:provided
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.5:provided
> [INFO] |  +- org.apache.commons:commons-math3:jar:3.4.1:provided
> [INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:provided
> [INFO] |  +- org.slf4j:slf4j-api:jar:1.7.16:compile
> [INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.16:compile
> [INFO] |  +- com.ning:compress-lzf:jar:1.0.3:provided
> [INFO] |  +- org.xerial.snappy:snappy-java:jar:1.1.2.6:compile
> [INFO] |  +- net.jpountz.lz4:lz4:jar:1.3.0:compile
> [INFO] |  +- org.roaringbitmap:RoaringBitmap:jar:0.5.11:provided
> [INFO] |  +- commons-net:commons-net:jar:2.2:provided
>
> [INFO] +- org.apache.spark:spark-sql_2.11:jar:2.1.0:provided
> [INFO] |  +- com.univocity:univocity-parsers:jar:2.2.1:provided
> [INFO] |  +- org.apache.spark:spark-sketch_2.11:jar:2.1.0:provided
> [INFO] |  \- org.apache.spark:spark-catalyst_2.11:jar:2.1.0:provided
> [INFO] | +- org.codehaus.janino:janino:jar:3.0.0:provided
> [INFO] | +- org.codehaus.janino:commons-compiler:jar:3.0.0:provided
> [INFO] | \- org.antlr:antlr4-runtime:jar:4.5.3:provided
> [INFO] +- com.databricks:spark-avro_2.11:jar:4.0.0:provided
> [INFO] +- org.apache.spark:spark-streaming_2.11:jar:2.1.0:compile
> [INFO] +- org.apache.spark:spark-streaming-kafka-0-8_2.11:jar:2.1.0:compile
> [INFO] |  \- org.apache.kafka:kafka_2.11:jar:0.8.2.1:compile
> [INFO] | +- org.scala-lang.modules:scala-xml_2.11:jar:1.0.2:compile
> [INFO] | +- 
> org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.2:compile
> [INFO] | \- org.apache.kafka:kafka-clients:jar:0.8.2.1:compile
> [INFO] +- io.dropwizard.metrics:metrics-core:jar:4.0.2:compile
> [INFO] +- org.antlr:stringtemplate:jar:4.0.2:compile
> [INFO] |  \- org.antlr:antlr-runtime:jar:3.3:compile
> [INFO] +- com.beust:jcommander:jar:1.72:compile
> [INFO] +- com.twitter:bijection-avro_2.11:jar:0.9.2:compile
> [INFO] |  \- com.twitter:bijection-core_2.11:jar:0.9.2:compile
> [INFO] +- io.confluent:kafka-avro-serializer:jar:3.0.0:compile
> [INFO] +- io.confluent:common-config:jar:3.0.0:compile
> [INFO] +- io.confluent:common-utils:jar:3.0.0:compile
> [INFO] |  \- com.101tec:zkclient:jar:0.5:compile
> [INFO] +- io.confluent:kafka-schema-registry-client:jar:3.0.0:compile

[jira] [Updated] (HUDI-106) Dynamically tune bloom filter entries

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-106:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Dynamically tune bloom filter entries
> -
>
> Key: HUDI-106
> URL: https://issues.apache.org/jira/browse/HUDI-106
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available, realtime-data-lakes
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Tuning bloom filters is currently based on a configuration, that could be 
> cumbersome to tune per dataset to obtain good indexing performance.. Lets add 
> support for Dynamic Bloom Filters, that can automatically achieve a 
> configured false positive ratio depending on number of entries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-25) Faster Incremental queries on Hoodie #492

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-25:
--
Fix Version/s: (was: 0.5.1)
   0.5.2

> Faster Incremental queries on Hoodie #492
> -
>
> Key: HUDI-25
> URL: https://issues.apache.org/jira/browse/HUDI-25
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hive Incremental queries on Hoodie currently suffer a limitation of listing 
> all partitions when a datestr is not present (lists .hoodie and the 
> partitions) and end up throwing away a lot of the files (since 
> `__hoodie__commit_time` column values filters out those files) . This can be 
> very expensive and can impact query planning time and sometime causes 
> timeouts as well if the table is large. The original issue is tracked here - 
> [https://github.com/uber/hudi/issues/492]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-118) Hudi CLI : Provide options for passing properties to Compactor, Cleaner and ParquetImporter

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-118:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Hudi CLI : Provide options for passing properties to Compactor, Cleaner and 
> ParquetImporter 
> 
>
> Key: HUDI-118
> URL: https://issues.apache.org/jira/browse/HUDI-118
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI, Common Core, newbie
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For non-trivial CLI operations, we have a standalone script in hudi-utilities 
> that users can call directly using spark-submit (usually). We also have 
> commands in hudi-cli to invoke the commands directly from hudi-cli shell.
> There was an earlier effort to allow users to pass properties directly to the 
> scripts in hudi-utilities but we still need to give the same functionality to 
> the corresponding commands in hudi-cli.
> In hudi-cli, Compaction (schedule/compact), Cleaner and HDFSParquetImporter 
> command does not have option to pass DFS properties file. This is a followup 
> to PR [https://github.com/apache/incubator-hudi/pull/691]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-547) Call out changes in package names due to scala cross compiling support

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-547:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Call out changes in package names due to scala cross compiling support
> --
>
> Key: HUDI-547
> URL: https://issues.apache.org/jira/browse/HUDI-547
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.2
>
>
> Two versions of each of the below packages needs to be built. 
> hudi-spark is hudi-spark_2.11 and hudi-spark_2.12
> hudi-utilities is hudi-utilities_2.11 and hudi-utilities_2.12
> hudi-spark-bundle is hudi-spark-bundle_2.11 and hudi-spark-bundle_2.12
> hudi-utilities-bundle is hudi-utilities-bundle_2.11 and 
> hudi-utilities-bundle_2.12
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-469) HoodieCommitMetadata only show first commit insert rows.

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-469:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> HoodieCommitMetadata only show first commit insert rows. 
> -
>
> Key: HUDI-469
> URL: https://issues.apache.org/jira/browse/HUDI-469
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI
>Reporter: cdmikechen
>Assignee: cdmikechen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When I run hudi cli to get insert rows, I found that hudi cli can not get 
> insert rows if it is not in first commit time. I found that 
> *{{HoodieCommitMetadata.fetchTotalInsertRecordsWritten()*}} method use 
> *{{stat.getPrevCommit().equalsIgnoreCase("null")*}} to filter first commit. 
> This check option should be removed。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-443) Add slides for Hadoop summit 2019, Bangalore to powered-by page

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-443:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Add slides for Hadoop summit 2019, Bangalore to powered-by page
> ---
>
> Key: HUDI-443
> URL: https://issues.apache.org/jira/browse/HUDI-443
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs, newbie
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Add slides for the talk on Apache Hudi and debezium at Hadoop summit 2019, 
> Bangalore to powered-by page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-311) Support AWS DMS source on DeltaStreamer

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-311:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Support AWS DMS source on DeltaStreamer
> ---
>
> Key: HUDI-311
> URL: https://issues.apache.org/jira/browse/HUDI-311
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://aws.amazon.com/dms/ seems like a one-stop shop for database change 
> logs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-12) Upgrade Hudi to Spark 2.4

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-12:
--
Fix Version/s: (was: 0.5.1)
   0.5.2

> Upgrade Hudi to Spark 2.4
> -
>
> Key: HUDI-12
> URL: https://issues.apache.org/jira/browse/HUDI-12
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://github.com/uber/hudi/issues/549



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-248) CLI doesn't allow rolling back a Delta commit

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-248:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> CLI doesn't allow rolling back a Delta commit
> -
>
> Key: HUDI-248
> URL: https://issues.apache.org/jira/browse/HUDI-248
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI, Usability
>Reporter: Rahul Bhartia
>Assignee: leesf
>Priority: Minor
>  Labels: aws-emr, pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java#L128]
>  
> When trying to find a match for passed in commit value, the "commit rollback" 
> command is always default to using HoodieTimeline.COMMIT_ACTION - and hence 
> doesn't allow rolling back delta commits.
> Note: Delta Commits can be rolled back using a HoodieWriteClient, so seems 
> like it's a just a matter of having to match against both COMMIT_ACTION and 
> DELTA_COMMIT_ACTION in the CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-238) Make separate release for hudi spark/scala based packages for scala 2.12

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-238:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Make separate release for hudi spark/scala based packages for scala 2.12 
> -
>
> Key: HUDI-238
> URL: https://issues.apache.org/jira/browse/HUDI-238
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Release  Administrative, Usability
>Reporter: Balaji Varadarajan
>Assignee: Tadas Sugintas
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/881#issuecomment-528700749]
> Suspects: 
> h3. Hudi utilities package 
> bringing in spark-streaming-kafka-0.8* 
> {code:java}
> [INFO] Scanning for projects...
> [INFO] 
> [INFO] ---< org.apache.hudi:hudi-utilities 
> >---
> [INFO] Building hudi-utilities 0.5.0-SNAPSHOT
> [INFO] [ jar 
> ]-
> [INFO] 
> [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-utilities 
> ---
> [INFO] org.apache.hudi:hudi-utilities:jar:0.5.0-SNAPSHOT
> [INFO] ...
> [INFO] +- org.apache.hudi:hudi-client:jar:0.5.0-SNAPSHOT:compile
>...
> [INFO] 
> [INFO] +- org.apache.hudi:hudi-spark:jar:0.5.0-SNAPSHOT:compile
> [INFO] |  \- org.scala-lang:scala-library:jar:2.11.8:compile
> [INFO] +- log4j:log4j:jar:1.2.17:compile
>...
> [INFO] +- org.apache.spark:spark-core_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
> [INFO] |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
> [INFO] |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
> [INFO] |  +- com.twitter:chill_2.11:jar:0.8.0:provided
> [INFO] |  +- com.twitter:chill-java:jar:0.8.0:provided
> [INFO] |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
> [INFO] |  +- org.apache.spark:spark-launcher_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-common_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-shuffle_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-unsafe_2.11:jar:2.1.0:provided
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:provided
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.5:provided
> [INFO] |  +- org.apache.commons:commons-math3:jar:3.4.1:provided
> [INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:provided
> [INFO] |  +- org.slf4j:slf4j-api:jar:1.7.16:compile
> [INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.16:compile
> [INFO] |  +- com.ning:compress-lzf:jar:1.0.3:provided
> [INFO] |  +- org.xerial.snappy:snappy-java:jar:1.1.2.6:compile
> [INFO] |  +- net.jpountz.lz4:lz4:jar:1.3.0:compile
> [INFO] |  +- org.roaringbitmap:RoaringBitmap:jar:0.5.11:provided
> [INFO] |  +- commons-net:commons-net:jar:2.2:provided
>
> [INFO] +- org.apache.spark:spark-sql_2.11:jar:2.1.0:provided
> [INFO] |  +- com.univocity:univocity-parsers:jar:2.2.1:provided
> [INFO] |  +- org.apache.spark:spark-sketch_2.11:jar:2.1.0:provided
> [INFO] |  \- org.apache.spark:spark-catalyst_2.11:jar:2.1.0:provided
> [INFO] | +- org.codehaus.janino:janino:jar:3.0.0:provided
> [INFO] | +- org.codehaus.janino:commons-compiler:jar:3.0.0:provided
> [INFO] | \- org.antlr:antlr4-runtime:jar:4.5.3:provided
> [INFO] +- com.databricks:spark-avro_2.11:jar:4.0.0:provided
> [INFO] +- org.apache.spark:spark-streaming_2.11:jar:2.1.0:compile
> [INFO] +- org.apache.spark:spark-streaming-kafka-0-8_2.11:jar:2.1.0:compile
> [INFO] |  \- org.apache.kafka:kafka_2.11:jar:0.8.2.1:compile
> [INFO] | +- org.scala-lang.modules:scala-xml_2.11:jar:1.0.2:compile
> [INFO] | +- 
> org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.2:compile
> [INFO] | \- org.apache.kafka:kafka-clients:jar:0.8.2.1:compile
> [INFO] +- io.dropwizard.metrics:metrics-core:jar:4.0.2:compile
> [INFO] +- org.antlr:stringtemplate:jar:4.0.2:compile
> [INFO] |  \- org.antlr:antlr-runtime:jar:3.3:compile
> [INFO] +- com.beust:jcommander:jar:1.72:compile
> [INFO] +- com.twitter:bijection-avro_2.11:jar:0.9.2:compile
> [INFO] |  \- com.twitter:bijection-core_2.11:jar:0.9.2:compile
> [INFO] +- io.confluent:kafka-avro-serializer:jar:3.0.0:compile
> [INFO] +- io.confluent:common-config:jar:3.0.0:compile
> [INFO] +- io.confluent:common-utils:jar:3.0.0:compile
> [INFO] |  \- com.101tec:zkclient:jar:0.5:compile
> [INFO] +-

[jira] [Updated] (HUDI-550) Add to Release Notes : Configuration Value change for Kafka Reset Offset Strategies

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-550:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Add to Release Notes : Configuration Value change for Kafka Reset Offset 
> Strategies
> ---
>
> Key: HUDI-550
> URL: https://issues.apache.org/jira/browse/HUDI-550
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.2
>
>
> Enum Values are changed for configuring kafka reset offset strategies in 
> deltastreamer
>    LARGEST -> LATEST
>   SMALLEST -> EARLIEST
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-377) Add Delete() support to HoodieDeltaStreamer

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-377:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Add Delete() support to HoodieDeltaStreamer
> ---
>
> Key: HUDI-377
> URL: https://issues.apache.org/jira/browse/HUDI-377
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>   Original Estimate: 72h
>  Time Spent: 20m
>  Remaining Estimate: 71h 40m
>
> Add Delete() support to HoodieDeltaStreamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-519) Document the need for Avro dependency shading/relocation for custom payloads, need for spark-avro

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-519:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Document the need for Avro dependency shading/relocation for custom payloads, 
> need for spark-avro
> -
>
> Key: HUDI-519
> URL: https://issues.apache.org/jira/browse/HUDI-519
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs, Usability
>Reporter: Udit Mehrotra
>Priority: Major
> Fix For: 0.5.2
>
>
> In [https://github.com/apache/incubator-hudi/pull/1005] we are migrating Hudi 
> to Spark 2.4.4. As part of this migration, we also had to migrate Hudi to use 
> Avro 1.8.2 (required by spark), while Hive still uses older version of Avro.
> This has resulted in the need to shade Avro in *hadoop-mr-bundle*. This has 
> implications on users of Hudi, who implement custom record payloads. They 
> would have start shading Avro in there custom jars, similar to how it shaded 
> in *hadoop-mr-bundle*.
> This Jira is to track the documentation of this caveat in release notes, and 
> if needed at other places like website etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-15) Add a delete() API to HoodieWriteClient as well as Spark datasource #531

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-15:
--
Fix Version/s: (was: 0.5.1)
   0.5.2

> Add a delete() API to HoodieWriteClient as well as Spark datasource #531
> 
>
> Key: HUDI-15
> URL: https://issues.apache.org/jira/browse/HUDI-15
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration, Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.5.2
>
>
> Delete API needs to be supported as first class citizen via DeltaStreamer, 
> WriteClient and datasources. Currently there are two ways to delete, soft 
> deletes and hard deletes - https://hudi.apache.org/writing_data.html#deletes. 
> We need to ensure for hard deletes, we are able to leverage 
> EmptyHoodieRecordPayload with just the HoodieKey and empty record value for 
> deleting.
> [https://github.com/uber/hudi/issues/531]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-493) Add docs for delete support in Hudi client apis

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-493:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Add docs for delete support in Hudi client apis
> ---
>
> Key: HUDI-493
> URL: https://issues.apache.org/jira/browse/HUDI-493
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.5.2
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-80) Incrementalize cleaning based on timeline metadata

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-80?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-80:
--
Fix Version/s: (was: 0.5.1)
   0.5.2

> Incrementalize cleaning based on timeline metadata
> --
>
> Key: HUDI-80
> URL: https://issues.apache.org/jira/browse/HUDI-80
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, cleaning lists all partitions once and then picks the file groups 
> to clean from DFS. This is partly due to support for retaining last x 
> versions of a file group as well (in additon to the default mode of retaining 
> last x commits). This could be expensive in some cases. See 
> [https://github.com/apache/incubator-hudi/issues/613] for a issue reported. 
>  
> This task tracks work to 
>  * Determine if we can get rid of last X version cleaning mode 
>  * Implement cleaning based on file metadata in hudi timeline itself
>  * Resulting rpc calls to DFS would be O(number of filegroups 
> cleaned)/O(number of partitions touched in last X commits)
>  
> HUDI-1 implements a timeline service for writing, that promotes caching of 
> file system metadata. This can be implemented on top of that. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-415:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> HoodieSparkSqlWriter Commit time not representing the Spark job starting time
> -
>
> Key: HUDI-415
> URL: https://issues.apache.org/jira/browse/HUDI-415
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi records the commit time after the first action complete. If there is a 
> heavy transformation before isEmpty(), then the commit time could be 
> inaccurate.
> {code:java}
> if (hoodieRecords.isEmpty()) { 
> log.info("new batch has no new records, skipping...") 
> return (true, common.util.Option.empty()) 
> } 
> commitTime = client.startCommit() 
> writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, 
> commitTime, operation)
> {code}
> For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 
> hours, then the commit time in the .hoodie folder will be 201901010*2*00. If 
> I use the commit time to ingest data starting from 201901010200(from HDFS, 
> not using deltastreamer), then I will miss 2 hours of data.
> Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-389) Updates sent to diff partition for a given key with Global Index

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-389:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Updates sent to diff partition for a given key with Global Index 
> -
>
> Key: HUDI-389
> URL: https://issues.apache.org/jira/browse/HUDI-389
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Index
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>   Original Estimate: 48h
>  Time Spent: 20m
>  Remaining Estimate: 47h 40m
>
> Updates sent to diff partition for a given key with Global Index should 
> succeed by updating the record under original partition. As of now, it throws 
> exception. 
> [https://github.com/apache/incubator-hudi/issues/1021] 
>  
>  
> error log:
> {code:java}
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.timeline.HoodieActiveTimeline - Loaded instants 
> java.util.stream.ReferencePipeline$Head@d02b1c7
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - Building file 
> system view for partition (2016/04/15)
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - #files found 
> in partition (2016/04/15) =0, Time taken =0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - 
> addFilesToView: NumFiles=0, FileGroupsCreationTime=0, StoreTimeTaken=0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.HoodieTableFileSystemView - Adding 
> file-groups for partition :2016/04/15, #FileGroups=0
>  14738 [Executor task launch worker-0] INFO 
> com.uber.hoodie.common.table.view.AbstractTableFileSystemView - Time to load 
> partition (2016/04/15) =0
>  14754 [Executor task launch worker-0] ERROR 
> com.uber.hoodie.table.HoodieCopyOnWriteTable - Error upserting bucketType 
> UPDATE for partition :0
>  java.util.NoSuchElementException: No value present
>  at com.uber.hoodie.common.util.Option.get(Option.java:112)
>  at com.uber.hoodie.io.HoodieMergeHandle.(HoodieMergeHandle.java:71)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:226)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:180)
>  at 
> com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:263)
>  at 
> com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:442)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:843)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

[jira] [Updated] (HUDI-308) Avoid Renames for tracking state transitions of all actions on dataset

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-308:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Avoid Renames for tracking state transitions of all actions on dataset
> --
>
> Key: HUDI-308
> URL: https://issues.apache.org/jira/browse/HUDI-308
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
> Attachments: IMG_0118.jpg
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, We employ renames when transitioning states (REQUESTED, INFLIGHT, 
> COMPLETED) of all actions in Hudi. 
> The idea is to always create new files pertaining to each state of an action 
> (commit, compaction, clean, ) that is being performed and have the 
> Timeline management resolve conflicts when loading them from .hoodie to 
> folder.  The Archiving logic will cleanup transient state files and archive 
> terminal state files. 
> THis handling will be done consistently for all kinds of actions on datasets. 
> As part of this project, we will cleanup un-necessary fields in metada, 
> version them and standardize on avro/json.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-343) Create a DOAP File for Hudi

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-343:
---
Fix Version/s: (was: 0.5.1)
   0.5.2

> Create a DOAP File for Hudi
> ---
>
> Key: HUDI-343
> URL: https://issues.apache.org/jira/browse/HUDI-343
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> But please create a DOAP file for Hudi, where you can also list the
> release: https://projects.apache.org/create.html
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-91) Replace Databricks spark-avro with native spark-avro #628

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-91?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-91:
--
Fix Version/s: (was: 0.5.1)
   0.5.2

> Replace Databricks spark-avro with native spark-avro #628
> -
>
> Key: HUDI-91
> URL: https://issues.apache.org/jira/browse/HUDI-91
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration, Usability
>Reporter: Vinoth Chandar
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/628] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-581) NOTICE need more work as it missing content form included 3rd party ALv2 licensed NOTICE files

2020-02-02 Thread Suneel Marthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned HUDI-581:
--

Assignee: Suneel Marthi

> NOTICE need more work as it missing content form included 3rd party ALv2 
> licensed NOTICE files
> --
>
> Key: HUDI-581
> URL: https://issues.apache.org/jira/browse/HUDI-581
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: Suneel Marthi
>Priority: Major
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] leesf commented on issue #1293: [HUDI-585] Optimize the steps of building with scala-2.12

2020-02-02 Thread GitBox

leesf commented on issue #1293: [HUDI-585] Optimize the steps of building with 
scala-2.12
URL: https://github.com/apache/incubator-hudi/pull/1293#issuecomment-581142414
 
 
   Thanks for the update, will check it again.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated: [MINOR] Updated DOAP with 0.5.1 release (#1301)

2020-02-02 Thread leesf

This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new fcf9e4a  [MINOR] Updated DOAP with 0.5.1 release (#1301)
fcf9e4a is described below

commit fcf9e4aded13a1a6306d6cbbc26d7f71ecbf08a9
Author: Suneel Marthi 
AuthorDate: Sun Feb 2 15:41:30 2020 +0100

[MINOR] Updated DOAP with 0.5.1 release (#1301)
---
 doap_HUDI.rdf | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doap_HUDI.rdf b/doap_HUDI.rdf
index 29baa24..c33d201 100644
--- a/doap_HUDI.rdf
+++ b/doap_HUDI.rdf
@@ -42,7 +42,7 @@
 0.5.0
   
   
-Apache Hudi-incubating 0.5.0
+Apache Hudi-incubating 0.5.1
 2020-01-31
 0.5.1

[GitHub] [incubator-hudi] leesf merged pull request #1301: [MINOR] Updated DOAP with 0.5.1 release

2020-02-02 Thread GitBox

leesf merged pull request #1301: [MINOR] Updated DOAP with 0.5.1 release
URL: https://github.com/apache/incubator-hudi/pull/1301
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on issue #1301: [MINOR] Updated DOAP with 0.5.1 release

2020-02-02 Thread GitBox

smarthi commented on issue #1301: [MINOR] Updated DOAP with 0.5.1 release
URL: https://github.com/apache/incubator-hudi/pull/1301#issuecomment-581141535
 
 
   @lamber-ken please review and merge this - this change is HUGE :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi opened a new pull request #1301: [MINOR] Updated DOAP with 0.5.1 release

2020-02-02 Thread GitBox

smarthi opened a new pull request #1301: [MINOR] Updated DOAP with 0.5.1 release
URL: https://github.com/apache/incubator-hudi/pull/1301
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on issue #1297: [HUDI-591] Support Spark version upgrade

2020-02-02 Thread GitBox

yanghua commented on issue #1297: [HUDI-591] Support Spark version upgrade
URL: https://github.com/apache/incubator-hudi/pull/1297#issuecomment-581140896
 
 
   This PR actived a issue about resource leak when using `HiveTestService`, 
more details please see: https://api.travis-ci.org/v3/job/644375428/log.txt


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1300: [MINOR] Updated DOAP with 0.5.1 release

2020-02-02 Thread GitBox

smarthi commented on a change in pull request #1300: [MINOR] Updated DOAP with 
0.5.1 release
URL: https://github.com/apache/incubator-hudi/pull/1300#discussion_r373850103
 
 

 ##
 File path: doap_HUDI.rdf
 ##
 @@ -41,6 +41,11 @@
 2019-10-24
 0.5.0
   
+  
+Apache Hudi-incubating 0.5.0
 
 Review comment:
   Damn!!1 yeah u r right - PR again - my bad.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1300: [MINOR] Updated DOAP with 0.5.1 release

2020-02-02 Thread GitBox

lamber-ken commented on a change in pull request #1300: [MINOR] Updated DOAP 
with 0.5.1 release
URL: https://github.com/apache/incubator-hudi/pull/1300#discussion_r373849293
 
 

 ##
 File path: doap_HUDI.rdf
 ##
 @@ -41,6 +41,11 @@
 2019-10-24
 0.5.0
   
+  
+Apache Hudi-incubating 0.5.0
 
 Review comment:
   hello @smarthi, it is `Apache Hudi-incubating 0.5.1`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi merged pull request #1300: [MINOR] Updated DOAP with 0.5.1 release

2020-02-02 Thread GitBox

smarthi merged pull request #1300: [MINOR] Updated DOAP with 0.5.1 release
URL: https://github.com/apache/incubator-hudi/pull/1300
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on issue #1300: [MINOR] Updated DOAP with 0.5.1 release

2020-02-02 Thread GitBox

smarthi commented on issue #1300: [MINOR] Updated DOAP with 0.5.1 release
URL: https://github.com/apache/incubator-hudi/pull/1300#issuecomment-581139511
 
 
   Merging this without review - very trivial PR


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated: [MINOR] Updated DOAP with 0.5.1 release (#1300)

2020-02-02 Thread smarthi

This is an automated email from the ASF dual-hosted git repository.

smarthi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0026234  [MINOR] Updated DOAP with 0.5.1 release (#1300)
0026234 is described below

commit 00262340115986676fef8bbda0b8f08000c06442
Author: Suneel Marthi 
AuthorDate: Sun Feb 2 15:13:24 2020 +0100

[MINOR] Updated DOAP with 0.5.1 release (#1300)
---
 doap_HUDI.rdf | 5 +
 1 file changed, 5 insertions(+)

diff --git a/doap_HUDI.rdf b/doap_HUDI.rdf
index 7df689b..29baa24 100644
--- a/doap_HUDI.rdf
+++ b/doap_HUDI.rdf
@@ -41,6 +41,11 @@
 2019-10-24
 0.5.0
   
+  
+Apache Hudi-incubating 0.5.0
+2020-01-31
+0.5.1
+

[GitHub] [incubator-hudi] smarthi opened a new pull request #1300: [MINOR] Updated DOAP with 0.5.1 release

2020-02-02 Thread GitBox

smarthi opened a new pull request #1300: [MINOR] Updated DOAP with 0.5.1 release
URL: https://github.com/apache/incubator-hudi/pull/1300
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: [MINOR] Add padding to code area in release page (#1296)

2020-02-02 Thread leesf

This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 61fc206  [MINOR] Add padding to code area in release page (#1296)
61fc206 is described below

commit 61fc206cddaab63efc8e04355f7dfaf88cca4724
Author: lamber-ken 
AuthorDate: Sun Feb 2 21:51:25 2020 +0800

[MINOR] Add padding to code area in release page (#1296)
---
 docs/_docs/1_1_quick_start_guide.md   |  3 +-
 docs/_pages/releases.md   | 54 +--
 docs/_sass/hudi_style/_variables.scss |  2 +-
 3 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/docs/_docs/1_1_quick_start_guide.md 
b/docs/_docs/1_1_quick_start_guide.md
index 4a9c1b3..256e560 100644
--- a/docs/_docs/1_1_quick_start_guide.md
+++ b/docs/_docs/1_1_quick_start_guide.md
@@ -16,7 +16,8 @@ Hudi works with Spark-2.x versions. You can follow 
instructions [here](https://s
 From the extracted directory run spark-shell with Hudi as:
 
 ```scala
-spark-2.4.4-bin-hadoop2.7/bin/spark-shell --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
+spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
+--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
 
diff --git a/docs/_pages/releases.md b/docs/_pages/releases.md
index 8797e84..a27f555 100644
--- a/docs/_pages/releases.md
+++ b/docs/_pages/releases.md
@@ -13,36 +13,36 @@ last_modified_at: 2019-12-30T15:59:57-04:00
  * Apache Hudi (incubating) jars corresponding to this release is available 
[here](https://repository.apache.org/#nexus-search;quick~hudi)
 
 ### Release Highlights
-* Dependency Version Upgrades
-* Upgrade from Spark 2.1.0 to Spark 2.4.4
-* Upgrade from Avro 1.7.7 to Avro 1.8.2
-* Upgrade from Parquet 1.8.1 to Parquet 1.10.1
-* Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating 
spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
-* **IMPORTANT** This version requires your runtime spark version to be 
upgraded to 2.4+.
-* Hudi now supports both Scala 2.11 and Scala 2.12, please refer to [Build 
with Scala 2.12](https://github.com/apache/incubator-hudi#build-with-scala-212) 
to build with Scala 2.12.
-Also, the packages hudi-spark, hudi-utilities, hudi-spark-bundle and 
hudi-utilities-bundle are changed correspondingly to 
hudi-spark_{scala_version}, hudi-spark_{scala_version}, 
hudi-utilities_{scala_version}, hudi-spark-bundle_{scala_version} and 
hudi-utilities-bundle_{scala_version}.
-Note that scala_version here is one of (2.11, 2.12).
-* With 0.5.1, we added functionality to stop using renames for Hudi timeline 
metadata operations. This feature is automatically enabled for newly created 
Hudi tables. For existing tables, this feature is turned off by default. Please 
read this [section](https://hudi.apache.org/docs/deployment.html#upgrading), 
before enabling this feature for existing hudi tables.
-To enable the new hudi timeline layout which avoids renames, use the write 
config "hoodie.timeline.layout.version=1". Alternatively, you can use "repair 
overwrite-hoodie-props" to append the line "hoodie.timeline.layout.version=1" 
to hoodie.properties. Note that in any case, upgrade hudi readers (query 
engines) first with 0.5.1-incubating release before upgrading writer.
-* CLI supports `repair overwrite-hoodie-props` to overwrite the table's 
hoodie.properties with specified file, for one-time updates to table name or 
even enabling the new timeline layout above. Note that few queries may 
temporarily fail while the overwrite happens (few milliseconds).
-* DeltaStreamer CLI parameter for capturing table type is changed from 
--storage-type to --table-type. Refer to 
[wiki](https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture)
 with more latest terminologies.
-* Configuration Value change for Kafka Reset Offset Strategies. Enum values 
are changed from LARGEST to LATEST, SMALLEST to EARLIEST for configuring Kafka 
reset offset strategies with configuration(auto.offset.reset) in deltastreamer.
-* When using spark-shell to give a quick peek at Hudi, please provide 
`--packages org.apache.spark:spark-avro_2.11:2.4.4`, more details would refer 
to [latest quickstart docs](https://hudi.apache.org/docs/quick-start-guide.html)
-* Key generator moved to separate package under org.apache.hudi.keygen. If you 
are using overridden key generator classes (configuration 
("hoodie.datasource.write.keygenerator.class")) that comes with hudi package, 
please ensure the fully qualified class name is changed accordingly.
-* Hive Sync tool will register RO tables for MOR with a _ro suffix, so query 
with _ro suffix. You would use

[GitHub] [incubator-hudi] leesf merged pull request #1296: [MINOR] Add padding to code area in release page

2020-02-02 Thread GitBox

leesf merged pull request #1296: [MINOR] Add padding to code area in release 
page
URL: https://github.com/apache/incubator-hudi/pull/1296
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1296: [MINOR] Add padding to code area in release page

2020-02-02 Thread GitBox

lamber-ken commented on issue #1296: [MINOR] Add padding to code area in 
release page
URL: https://github.com/apache/incubator-hudi/pull/1296#issuecomment-581136577
 
 
   Welcome @leesf, all reviewed comments are addressed and fixed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1296: [MINOR] Add padding to code area in release page

2020-02-02 Thread GitBox

lamber-ken commented on a change in pull request #1296: [MINOR] Add padding to 
code area in release page
URL: https://github.com/apache/incubator-hudi/pull/1296#discussion_r373846646
 
 

 ##
 File path: docs/_pages/releases.md
 ##
 @@ -13,36 +13,36 @@ last_modified_at: 2019-12-30T15:59:57-04:00
  * Apache Hudi (incubating) jars corresponding to this release is available 
[here](https://repository.apache.org/#nexus-search;quick~hudi)
 
 ### Release Highlights
-* Dependency Version Upgrades
-* Upgrade from Spark 2.1.0 to Spark 2.4.4
-* Upgrade from Avro 1.7.7 to Avro 1.8.2
-* Upgrade from Parquet 1.8.1 to Parquet 1.10.1
-* Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating 
spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
-* **IMPORTANT** This version requires your runtime spark version to be 
upgraded to 2.4+.
-* Hudi now supports both Scala 2.11 and Scala 2.12, please refer to [Build 
with Scala 2.12](https://github.com/apache/incubator-hudi#build-with-scala-212) 
to build with Scala 2.12.
-Also, the packages hudi-spark, hudi-utilities, hudi-spark-bundle and 
hudi-utilities-bundle are changed correspondingly to 
hudi-spark_{scala_version}, hudi-spark_{scala_version}, 
hudi-utilities_{scala_version}, hudi-spark-bundle_{scala_version} and 
hudi-utilities-bundle_{scala_version}.
-Note that scala_version here is one of (2.11, 2.12).
-* With 0.5.1, we added functionality to stop using renames for Hudi timeline 
metadata operations. This feature is automatically enabled for newly created 
Hudi tables. For existing tables, this feature is turned off by default. Please 
read this [section](https://hudi.apache.org/docs/deployment.html#upgrading), 
before enabling this feature for existing hudi tables.
-To enable the new hudi timeline layout which avoids renames, use the write 
config "hoodie.timeline.layout.version=1". Alternatively, you can use "repair 
overwrite-hoodie-props" to append the line "hoodie.timeline.layout.version=1" 
to hoodie.properties. Note that in any case, upgrade hudi readers (query 
engines) first with 0.5.1-incubating release before upgrading writer.
-* CLI supports `repair overwrite-hoodie-props` to overwrite the table's 
hoodie.properties with specified file, for one-time updates to table name or 
even enabling the new timeline layout above. Note that few queries may 
temporarily fail while the overwrite happens (few milliseconds).
-* DeltaStreamer CLI parameter for capturing table type is changed from 
--storage-type to --table-type. Refer to 
[wiki](https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture)
 with more latest terminologies.
-* Configuration Value change for Kafka Reset Offset Strategies. Enum values 
are changed from LARGEST to LATEST, SMALLEST to EARLIEST for configuring Kafka 
reset offset strategies with configuration(auto.offset.reset) in deltastreamer.
-* When using spark-shell to give a quick peek at Hudi, please provide 
`--packages org.apache.spark:spark-avro_2.11:2.4.4`, more details would refer 
to [latest quickstart docs](https://hudi.apache.org/docs/quick-start-guide.html)
-* Key generator moved to separate package under org.apache.hudi.keygen. If you 
are using overridden key generator classes (configuration 
("hoodie.datasource.write.keygenerator.class")) that comes with hudi package, 
please ensure the fully qualified class name is changed accordingly.
-* Hive Sync tool will register RO tables for MOR with a _ro suffix, so query 
with _ro suffix. You would use `--skip-ro-suffix` in sync config in sync config 
to retain the old naming without the _ro suffix.
-* With 0.5.1, hudi-hadoop-mr-bundle which is used by query engines such as 
presto and hive includes shaded avro package to support hudi real time queries 
through these engines. Hudi supports pluggable logic for merging of records. 
Users provide their own implementation of 
[HoodieRecordPayload](https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java).
-If you are using this feature, you need to relocate the avro dependencies in 
your custom record payload class to be consistent with internal hudi shading. 
You need to add the following relocation when shading the package containing 
the record payload implementation.
-
- ```xml
-
-org.apache.avro.
-org.apache.hudi.org.apache.avro.
-
- ```
+ * Dependency Version Upgrades
+   - Upgrade from Spark 2.1.0 to Spark 2.4.4
+   - Upgrade from Avro 1.7.7 to Avro 1.8.2
+   - Upgrade from Parquet 1.8.1 to Parquet 1.10.1
+   - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating 
spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
+ * **IMPORTANT** This version requires your runtime spark version to be 
upgraded to 2.4+.
+ * Hudi now supports both Scala 2.11 and Scala 2.12, please refer to [Build 
with Scala

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1296: [MINOR] Add padding to code area in release page

2020-02-02 Thread GitBox

lamber-ken commented on a change in pull request #1296: [MINOR] Add padding to 
code area in release page
URL: https://github.com/apache/incubator-hudi/pull/1296#discussion_r373846179
 
 

 ##
 File path: docs/_sass/hudi_style/_variables.scss
 ##
 @@ -54,7 +54,7 @@ $light-gray: mix(#fff, $gray, 50%) !default;
 $lighter-gray: mix(#fff, $gray, 90%) !default;
 
 $background-color: #fff !default;
-$code-background-color: #fafafa !default;
+$code-background-color: #f3f3f3 !default;
 
 Review comment:
   Darken background color.
   
   
![image](https://user-images.githubusercontent.com/20113411/73608913-484be100-4603-11ea-8aed-7cb2cbf88e22.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1293: [HUDI-585] Optimize the steps of building with scala-2.12

2020-02-02 Thread GitBox

lamber-ken commented on issue #1293: [HUDI-585] Optimize the steps of building 
with scala-2.12
URL: https://github.com/apache/incubator-hudi/pull/1293#issuecomment-581135530
 
 
   Thanks very much for testing this pr @leesf @zhedoubushishi. I updated the 
pr, please use bellow command
   ```
   mvn clean package -DskipTests -DskipITs -Dscala-2.12
   ```
   
![image](https://user-images.githubusercontent.com/20113411/73608841-b0e68e00-4602-11ea-80a1-f54e5987dadf.png)
   
   
![image](https://user-images.githubusercontent.com/20113411/73608845-b5ab4200-4602-11ea-87dc-f07d3f0bc397.png)
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-583) cleanup legacy code

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-583:
---
Status: Open  (was: New)

> cleanup legacy code 
> 
>
> Key: HUDI-583
> URL: https://issues.apache.org/jira/browse/HUDI-583
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Cleaner
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/incubator-hudi/pull/1237]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-583) cleanup legacy code

2020-02-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-583.

Resolution: Fixed

Fixed via master: 5b7bb142dc6712c41fd8ada208ab3186369431f9

> cleanup legacy code 
> 
>
> Key: HUDI-583
> URL: https://issues.apache.org/jira/browse/HUDI-583
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Cleaner
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/incubator-hudi/pull/1237]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] leesf merged pull request #1237: [HUDI-583] Code Cleanup, remove redundant code, and other changes

2020-02-02 Thread GitBox

leesf merged pull request #1237: [HUDI-583] Code Cleanup, remove redundant 
code, and other changes
URL: https://github.com/apache/incubator-hudi/pull/1237
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] leesf commented on issue #1293: [HUDI-585] Optimize the steps of building with scala-2.12

2020-02-02 Thread GitBox

leesf commented on issue #1293: [HUDI-585] Optimize the steps of building with 
scala-2.12
URL: https://github.com/apache/incubator-hudi/pull/1293#issuecomment-581116499
 
 
   Yes @zhedoubushishi , I also got the error after delete the folder. CC 
@lamber-ken 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

76 matches

Mail list logo