[jira] [Commented] (HUDI-1088) hive version 1.1.0 integrated with hudi,select * from hudi_table error in HUE

2020-07-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/HUDI-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157928#comment-17157928
 ] 

海南中剑 commented on HUDI-1088:


[~Trevorzhang],I dont agree with your idea. Hue is a tool to show datas. i 
think you can use spark or mapReduce engine to query on hue. 

> hive version 1.1.0 integrated with hudi,select * from hudi_table error in HUE
> -
>
> Key: HUDI-1088
> URL: https://issues.apache.org/jira/browse/HUDI-1088
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
> Environment: Hive version 1.1.0、hudi-0.5.3、Cloudera manager 5.14.4
>Reporter: wangmeng
>Priority: Major
>
> * Hue执行语句:select * from hudi_table where
>  * inputformat:set 
> hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat;
>  * 异常信息
> Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1457)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1445)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1444)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1444)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>  at scala.Option.foreach(Option.scala:236)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1668)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1627)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1616)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>  Caused by: java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:154)
>  at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>  at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>  at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
>  at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120)
>  at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120)
>  at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2022)
>  at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2022)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>  at org.apache.spark.scheduler.Task.run(Task.scala:89)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.lang.NullPointerException
>  at 
> org.apache.hadoop.hive.ql.exec.MapOperator.getNominalPath(MapOperator.java:392)
>  at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:446)
>  at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1051)
>  at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:490)
>  at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:141)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] Raghvendradubey commented on issue #1694: Slow Write into Hudi Dataset(MOR)

2020-07-14 Thread GitBox


Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-658571381


   Thanks @vinothchandar for clarifications, will try GLOBAL_SIMPLE.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1098) Marker file finalizing may block on a data file that was never written

2020-07-14 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-1098:


 Summary: Marker file finalizing may block on a data file that was 
never written
 Key: HUDI-1098
 URL: https://issues.apache.org/jira/browse/HUDI-1098
 Project: Apache Hudi
  Issue Type: Bug
  Components: Writer Core
Reporter: Vinoth Chandar
 Fix For: 0.6.0


{code:java}
// Ensure all files in delete list is actually present. This is mandatory for 
an eventually consistent FS. // Otherwise, we may miss deleting such files. If 
files are not found even after retries, fail the commit 
if (consistencyCheckEnabled) { 
  // This will either ensure all files to be deleted are present. 
waitForAllFiles(jsc, groupByPartition, FileVisibility.APPEAR); 
}
{code}

We need to handle the case where marker file was created, but we crashed before 
the data file was created. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1097) Integration test for prestosql queries

2020-07-14 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-1097:
---

 Summary: Integration test for prestosql queries
 Key: HUDI-1097
 URL: https://issues.apache.org/jira/browse/HUDI-1097
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Presto Integration
Reporter: Bhavani Sudha
Assignee: Bhavani Sudha
 Fix For: 0.6.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1094) Docker demo integration of Prestosql queries

2020-07-14 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1094:

Status: Open  (was: New)

> Docker demo integration of Prestosql queries
> 
>
> Key: HUDI-1094
> URL: https://issues.apache.org/jira/browse/HUDI-1094
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1095) Add documentation for prestosql support

2020-07-14 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1095:

Component/s: Presto Integration

> Add documentation for prestosql support
> ---
>
> Key: HUDI-1095
> URL: https://issues.apache.org/jira/browse/HUDI-1095
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1093) Add support for COW tables from Prestosql

2020-07-14 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1093:

Status: Open  (was: New)

> Add support for COW tables from Prestosql
> -
>
> Key: HUDI-1093
> URL: https://issues.apache.org/jira/browse/HUDI-1093
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.1
>
>
> This would be change required in Prestosql. Once available the end goals is:
> [] Hudi Copy on write tables are queryable from Prestosql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1096) MOR queries support from Prestosql

2020-07-14 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1096:

Status: Open  (was: New)

> MOR queries support from Prestosql
> --
>
> Key: HUDI-1096
> URL: https://issues.apache.org/jira/browse/HUDI-1096
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1095) Add documentation for prestosql support

2020-07-14 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1095:

Status: Open  (was: New)

> Add documentation for prestosql support
> ---
>
> Key: HUDI-1095
> URL: https://issues.apache.org/jira/browse/HUDI-1095
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1094) Docker demo integration of Prestosql queries

2020-07-14 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1094:

Component/s: Presto Integration

> Docker demo integration of Prestosql queries
> 
>
> Key: HUDI-1094
> URL: https://issues.apache.org/jira/browse/HUDI-1094
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1096) MOR queries support from Prestosql

2020-07-14 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-1096:
---

 Summary: MOR queries support from Prestosql
 Key: HUDI-1096
 URL: https://issues.apache.org/jira/browse/HUDI-1096
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Presto Integration
Reporter: Bhavani Sudha
Assignee: Bhavani Sudha
 Fix For: 0.6.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1095) Add documentation for prestosql support

2020-07-14 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-1095:
---

 Summary: Add documentation for prestosql support
 Key: HUDI-1095
 URL: https://issues.apache.org/jira/browse/HUDI-1095
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Bhavani Sudha
Assignee: Bhavani Sudha
 Fix For: 0.6.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1094) Docker demo integration of Prestosql queries

2020-07-14 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-1094:
---

 Summary: Docker demo integration of Prestosql queries
 Key: HUDI-1094
 URL: https://issues.apache.org/jira/browse/HUDI-1094
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Bhavani Sudha
Assignee: Bhavani Sudha
 Fix For: 0.6.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1093) Add support for COW tables from Prestosql

2020-07-14 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-1093:
---

 Summary: Add support for COW tables from Prestosql
 Key: HUDI-1093
 URL: https://issues.apache.org/jira/browse/HUDI-1093
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Presto Integration
Reporter: Bhavani Sudha
Assignee: Bhavani Sudha
 Fix For: 0.6.1


This would be change required in Prestosql. Once available the end goals is:

[] Hudi Copy on write tables are queryable from Prestosql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1092) Hudi support from prestosql

2020-07-14 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1092:

Status: Open  (was: New)

> Hudi support from prestosql
> ---
>
> Key: HUDI-1092
> URL: https://issues.apache.org/jira/browse/HUDI-1092
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Presto Integration
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: features
> Fix For: 0.6.1
>
>
> This is an umbrella ticket to add support for querying Hudi tables from 
> Prestosql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1092) Hudi support from prestosql

2020-07-14 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-1092:
---

 Summary: Hudi support from prestosql
 Key: HUDI-1092
 URL: https://issues.apache.org/jira/browse/HUDI-1092
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Presto Integration
Reporter: Bhavani Sudha
Assignee: Bhavani Sudha
 Fix For: 0.6.1


This is an umbrella ticket to add support for querying Hudi tables from 
Prestosql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1092) Hudi support from prestosql

2020-07-14 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1092:

Issue Type: New Feature  (was: Improvement)

> Hudi support from prestosql
> ---
>
> Key: HUDI-1092
> URL: https://issues.apache.org/jira/browse/HUDI-1092
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Presto Integration
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: features
> Fix For: 0.6.1
>
>
> This is an umbrella ticket to add support for querying Hudi tables from 
> Prestosql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] Mathieu1124 commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-14 Thread GitBox


Mathieu1124 commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-658535590


   > > Hi, @vinothchandar @yanghua @leesf as the refactor is finished, I have 
filed a Jira ticket to track this work,
   > > please review the refactor work on this pr :)
   > 
   > ack. @Mathieu1124 pls check travis failure.
   
   copy, have resolved the ci failure and conflicts with master, will push it 
after work.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-14 Thread GitBox


Mathieu1124 commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-658535113


   > @leesf @Mathieu1124 @lw309637554 so this replaces #1727 right?
   
   yes, https://github.com/apache/hudi/pull/1727  can be closed now



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1831: [HUDI-1087] Handle decimal type for realtime record reader with SparkSQL

2020-07-14 Thread GitBox


vinothchandar commented on pull request #1831:
URL: https://github.com/apache/hudi/pull/1831#issuecomment-658532736


   actually nvm.. its a small PR.. LGTM.. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1831: [HUDI-1087] Handle decimal type for realtime record reader with SparkSQL

2020-07-14 Thread GitBox


vinothchandar commented on pull request #1831:
URL: https://github.com/apache/hudi/pull/1831#issuecomment-658532404


   cc @umehrot2 want to review this one?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #339

2020-07-14 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.33 KB...]

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${s

[GitHub] [hudi] garyli1019 commented on a change in pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-14 Thread GitBox


garyli1019 commented on a change in pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#discussion_r454762535



##
File path: 
hudi-spark/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetRealtimeFileFormat.scala
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hudi.hadoop.realtime.HoodieRealtimeFileSplit
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapred.{FileSplit, JobConf}
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.hadoop.mapreduce.{JobID, TaskAttemptID, TaskID, TaskType}
+import org.apache.parquet.filter2.compat.FilterCompat
+import org.apache.parquet.filter2.predicate.FilterApi
+import 
org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS
+import org.apache.parquet.hadoop.{ParquetFileReader, ParquetInputFormat, 
ParquetRecordReader}
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.expressions.{JoinedRow, UnsafeRow}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+import java.net.URI
+import scala.collection.JavaConverters._
+
+/**
+ * This class is an extension of ParquetFileFormat from Spark SQL.
+ * The file split, record reader, record reader iterator are customized to 
read Hudi MOR table.
+ */
+class HoodieParquetRealtimeFileFormat extends ParquetFileFormat {

Review comment:
   I think this could come with a schema issue. `super.buildReader..()` 
could use a vectorized reader that only read a few columns and right now I 
didn't implement merging `Avro` with `InternalRow` with different schema. Right 
now we can only merge row by row with the same schema.
   We will get rid of this `FileFormat` approach. The RDD approach is cleaner. 
https://github.com/apache/hudi/pull/1702/files#diff-809772c649e85ffb321055d9871e37e0R75





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-14 Thread GitBox


bvaradar commented on pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#issuecomment-658506263


   @garyli1019 @vinothchandar : Yes, I am planning to address the bootstrap PR 
comments and also give review comments for @umehrot2  changes by this weekend. 
@umehrot2 :  I know this is not ideal and apologies for not able to give a 
review quickly. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1824: [HUDI-996] Add functional test suite in hudi-client

2020-07-14 Thread GitBox


vinothchandar commented on pull request #1824:
URL: https://github.com/apache/hudi/pull/1824#issuecomment-658506081


   No all good.. was waitng for you actually :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-14 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-658505880


   Good to get @n3nash 's review here as well to make sure we are not breaking 
anything for the RDD client users.. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-14 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-658505687


   @leesf @Mathieu1124 @lw309637554 so this replaces #1727 right? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-14 Thread GitBox


xushiyan commented on pull request #1819:
URL: https://github.com/apache/hudi/pull/1819#issuecomment-658505404


   @shenh062326 @nsivabalan got it. yup, making it through the constructor 
looks good. thanks for clarifying.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-14 Thread GitBox


vinothchandar commented on a change in pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#discussion_r454749227



##
File path: 
hudi-spark/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetRealtimeFileFormat.scala
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hudi.hadoop.realtime.HoodieRealtimeFileSplit
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapred.{FileSplit, JobConf}
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.hadoop.mapreduce.{JobID, TaskAttemptID, TaskID, TaskType}
+import org.apache.parquet.filter2.compat.FilterCompat
+import org.apache.parquet.filter2.predicate.FilterApi
+import 
org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS
+import org.apache.parquet.hadoop.{ParquetFileReader, ParquetInputFormat, 
ParquetRecordReader}
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.expressions.{JoinedRow, UnsafeRow}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+import java.net.URI
+import scala.collection.JavaConverters._
+
+/**
+ * This class is an extension of ParquetFileFormat from Spark SQL.
+ * The file split, record reader, record reader iterator are customized to 
read Hudi MOR table.
+ */
+class HoodieParquetRealtimeFileFormat extends ParquetFileFormat {

Review comment:
   @garyli1019 it should be possible to call `super.buildReader..()` get 
the `Iterator[InternalRow]` for the parquet base file alone and then use the 
class above to merge right? 
   It will avoid a ton of code from this file.. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1090) presto读取hudi数据错位

2020-07-14 Thread Hong Shen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157818#comment-17157818
 ] 

Hong Shen commented on HUDI-1090:
-

[~710514878] Please describe the issue in English.

> presto读取hudi数据错位
> 
>
> Key: HUDI-1090
> URL: https://issues.apache.org/jira/browse/HUDI-1090
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: obar
>Priority: Major
> Attachments: 15947189106642.png
>
>
> presto版本333
> hudi版本0.5.3
> hive读取hudi数据正常
> presto读取hudi数据错位



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-14 Thread GitBox


nsivabalan commented on pull request #1819:
URL: https://github.com/apache/hudi/pull/1819#issuecomment-658502484


   @xushiyan : nope. As I have mentioned above, we need it in getInsertValue() 
as well which is called from lot of classes. Hence I suggested to add it as 
part of constructor. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-14 Thread GitBox


vinothchandar commented on pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#issuecomment-658501011


   I am fine with doing that.. not sure if thats more work for @umehrot2  .. 
wdyt ?
   
   @bvaradar in general, can we get more of the bootstrap landed and work on 
the follow ups vs having these large PRs pending out there this long.. We kind 
of have a traffic gridlock here. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-14 Thread GitBox


shenh062326 commented on pull request #1819:
URL: https://github.com/apache/hudi/pull/1819#issuecomment-658495234


   @xushiyan  It seems good for OverwriteWithLatestAvroPayload. But for 
AWSDmsAvroPayload, users need to define not only the deletion column, but also 
the processing method of deleting the column. Where is the new method should 
be? If it continue to stay in combineAndGetUpdateValue, then 
combineAndGetUpdateValue has the processing delete logic, and the calling 
method also has the delete logic, will it be repeated?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-14 Thread GitBox


leesf commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-658490359


   > Hi, @vinothchandar @yanghua @leesf as the refactor is finished, I have 
filed a Jira ticket to track this work,
   > please review the refactor work on this pr :)
   
   ack. @Mathieu1124 pls check travis failure.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-1082) Bug in deciding the upsert/insert buckets

2020-07-14 Thread Hong Shen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157341#comment-17157341
 ] 

Hong Shen edited comment on HUDI-1082 at 7/15/20, 1:12 AM:
---

It seems does not matter. We just need to ensure that records are allocated to 
the filegroup in proportion to the weight.  The reason is that the hash value 
of the key is large, and it can be randomly distributed, so although global 
insert is used, it can eventually be distributed in proportion to weight.

Add a testcase testGetPartitioner2, it has two partitions "2016/09/26" and 
"2016/09/27".  Generate 200 insert records to partition "2016/09/26" and 2000 
insert records to  "2016/09/27", since the fileSize limit is 1000 * 1024, 
partition "2016/09/26" will generate 2 fileGroup, partition "2016/09/27" will 
generate 20 fileGroup. For the 200 insert records to partition "2016/09/26", it 
will has approximately 100 records to fileGroup1, and the others to fileGroup2.

 
{code:java|title=TestUpsertPartitioner.java|borderStyle=solid}
@Test

public void testGetPartitioner2() throws Exception {
 String testPartitionPath1 = "2016/09/26";
 String testPartitionPath2 = "2016/09/27";

 HoodieWriteConfig config = makeHoodieClientConfigBuilder()
 
.withCompactionConfig(HoodieCompactionConfig.newBuilder().compactionSmallFileSize(0)
 .insertSplitSize(100).autoTuneInsertSplits(false).build())
 .withStorageConfig(HoodieStorageConfig.newBuilder().limitFileSize(1000 * 
1024).build()).build();

 HoodieClientTestUtils.fakeCommitFile(basePath, "001");

 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieCopyOnWriteTable table = (HoodieCopyOnWriteTable) 
HoodieTable.create(metaClient, config, hadoopConf);
 HoodieTestDataGenerator dataGenerator1 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath1});
 List insertRecords1 = dataGenerator1.generateInserts("001", 200);
 List records1 = new ArrayList<>();
 records1.addAll(insertRecords1);

 HoodieTestDataGenerator dataGenerator2 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath2});
 List insertRecords2 = dataGenerator2.generateInserts("001", 
2000);
 records1.addAll(insertRecords2);

 WorkloadProfile profile = new WorkloadProfile(jsc.parallelize(records1));
 UpsertPartitioner partitioner = new UpsertPartitioner(profile, jsc, table, 
config);
 Map partition2numRecords = new HashMap();
 for (HoodieRecord hoodieRecord: insertRecords1) {
 int partition = partitioner.getPartition(new Tuple2<>(
 hoodieRecord.getKey(), Option.ofNullable(hoodieRecord.getCurrentLocation(;
 if (!partition2numRecords.containsKey(partition)) {
 partition2numRecords.put(partition, 0);
 }
 int num = partition2numRecords.get(partition);
 partition2numRecords.put(partition, num + 1);
 }
 System.out.println(partition2numRecords);
}

{code}
{{Run 5 times, the outputs are:}}
{code:java}
{20=106, 21=94}

{20=97, 21=103}

{20=95, 21=105}

{20=107, 21=93}

{20=113, 21=87}
 {code}


was (Author: shenhong):
It seems does not matter. We just need to ensure that records are allocated to 
the filegroup in proportion to the weight.

Add a testcase testGetPartitioner2, it has two partitions "2016/09/26" and 
"2016/09/27".  Generate 200 insert records to partition "2016/09/26" and 2000 
insert records to  "2016/09/27", since the fileSize limit is 1000 * 1024, 
partition "2016/09/26" will generate 2 fileGroup, partition "2016/09/27" will 
generate 20 fileGroup. For the 200 insert records to partition "2016/09/26", it 
will has approximately 100 records to fileGroup1, and the others to fileGroup2.

 
{code:java|title=TestUpsertPartitioner.java|borderStyle=solid}
@Test

public void testGetPartitioner2() throws Exception {
 String testPartitionPath1 = "2016/09/26";
 String testPartitionPath2 = "2016/09/27";

 HoodieWriteConfig config = makeHoodieClientConfigBuilder()
 
.withCompactionConfig(HoodieCompactionConfig.newBuilder().compactionSmallFileSize(0)
 .insertSplitSize(100).autoTuneInsertSplits(false).build())
 .withStorageConfig(HoodieStorageConfig.newBuilder().limitFileSize(1000 * 
1024).build()).build();

 HoodieClientTestUtils.fakeCommitFile(basePath, "001");

 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieCopyOnWriteTable table = (HoodieCopyOnWriteTable) 
HoodieTable.create(metaClient, config, hadoopConf);
 HoodieTestDataGenerator dataGenerator1 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath1});
 List insertRecords1 = dataGenerator1.generateInserts("001", 200);
 List records1 = new ArrayList<>();
 records1.addAll(insertRecords1);

 HoodieTestDataGenerator dataGenerator2 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath2});
 List insertRecords2 = dataGenerator2.generateInserts("001", 
2000);
 records1.addAll(insertRecords2);

 WorkloadProfile profile = new WorkloadProfile(jsc.parallelize(records1));
 UpsertPartitioner partitioner = new UpsertPartit

[jira] [Comment Edited] (HUDI-1082) Bug in deciding the upsert/insert buckets

2020-07-14 Thread Hong Shen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157341#comment-17157341
 ] 

Hong Shen edited comment on HUDI-1082 at 7/15/20, 1:08 AM:
---

It seems does not matter. We just need to ensure that records are allocated to 
the filegroup in proportion to the weight.

Add a testcase testGetPartitioner2, it has two partitions "2016/09/26" and 
"2016/09/27".  Generate 200 insert records to partition "2016/09/26" and 2000 
insert records to  "2016/09/27", since the fileSize limit is 1000 * 1024, 
partition "2016/09/26" will generate 2 fileGroup, partition "2016/09/27" will 
generate 20 fileGroup. For the 200 insert records to partition "2016/09/26", it 
will has approximately 100 records to fileGroup1, and the others to fileGroup2.

 
{code:java|title=TestUpsertPartitioner.java|borderStyle=solid}
@Test

public void testGetPartitioner2() throws Exception {
 String testPartitionPath1 = "2016/09/26";
 String testPartitionPath2 = "2016/09/27";

 HoodieWriteConfig config = makeHoodieClientConfigBuilder()
 
.withCompactionConfig(HoodieCompactionConfig.newBuilder().compactionSmallFileSize(0)
 .insertSplitSize(100).autoTuneInsertSplits(false).build())
 .withStorageConfig(HoodieStorageConfig.newBuilder().limitFileSize(1000 * 
1024).build()).build();

 HoodieClientTestUtils.fakeCommitFile(basePath, "001");

 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieCopyOnWriteTable table = (HoodieCopyOnWriteTable) 
HoodieTable.create(metaClient, config, hadoopConf);
 HoodieTestDataGenerator dataGenerator1 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath1});
 List insertRecords1 = dataGenerator1.generateInserts("001", 200);
 List records1 = new ArrayList<>();
 records1.addAll(insertRecords1);

 HoodieTestDataGenerator dataGenerator2 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath2});
 List insertRecords2 = dataGenerator2.generateInserts("001", 
2000);
 records1.addAll(insertRecords2);

 WorkloadProfile profile = new WorkloadProfile(jsc.parallelize(records1));
 UpsertPartitioner partitioner = new UpsertPartitioner(profile, jsc, table, 
config);
 Map partition2numRecords = new HashMap();
 for (HoodieRecord hoodieRecord: insertRecords1) {
 int partition = partitioner.getPartition(new Tuple2<>(
 hoodieRecord.getKey(), Option.ofNullable(hoodieRecord.getCurrentLocation(;
 if (!partition2numRecords.containsKey(partition)) {
 partition2numRecords.put(partition, 0);
 }
 int num = partition2numRecords.get(partition);
 partition2numRecords.put(partition, num + 1);
 }
 System.out.println(partition2numRecords);
}

{code}
{{Run 5 times, the outputs are:}}
{code:java}
{20=106, 21=94}

{20=97, 21=103}

{20=95, 21=105}

{20=107, 21=93}

{20=113, 21=87}
 {code}


was (Author: shenhong):
It seems does not matter, we just need to ensure the distribution of records to 
fileId.

Add a testcase testGetPartitioner2, it has two partitions "2016/09/26" and 
"2016/09/27".  Generate 200 insert records to partition "2016/09/26" and 2000 
insert records to  "2016/09/27", since the fileSize limit is 1000 * 1024, 
partition "2016/09/26" will generate 2 fileGroup, partition "2016/09/27" will 
generate 20 fileGroup. For the 200 insert records to partition "2016/09/26", it 
will has approximately 100 records to fileGroup1, and the others to fileGroup2.

 
{code:java|title=TestUpsertPartitioner.java|borderStyle=solid}
@Test

public void testGetPartitioner2() throws Exception {
 String testPartitionPath1 = "2016/09/26";
 String testPartitionPath2 = "2016/09/27";

 HoodieWriteConfig config = makeHoodieClientConfigBuilder()
 
.withCompactionConfig(HoodieCompactionConfig.newBuilder().compactionSmallFileSize(0)
 .insertSplitSize(100).autoTuneInsertSplits(false).build())
 .withStorageConfig(HoodieStorageConfig.newBuilder().limitFileSize(1000 * 
1024).build()).build();

 HoodieClientTestUtils.fakeCommitFile(basePath, "001");

 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieCopyOnWriteTable table = (HoodieCopyOnWriteTable) 
HoodieTable.create(metaClient, config, hadoopConf);
 HoodieTestDataGenerator dataGenerator1 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath1});
 List insertRecords1 = dataGenerator1.generateInserts("001", 200);
 List records1 = new ArrayList<>();
 records1.addAll(insertRecords1);

 HoodieTestDataGenerator dataGenerator2 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath2});
 List insertRecords2 = dataGenerator2.generateInserts("001", 
2000);
 records1.addAll(insertRecords2);

 WorkloadProfile profile = new WorkloadProfile(jsc.parallelize(records1));
 UpsertPartitioner partitioner = new UpsertPartitioner(profile, jsc, table, 
config);
 Map partition2numRecords = new HashMap();
 for (HoodieRecord hoodieRecord: insertRecords1) {
 int partition = partitioner.getPartition(new Tuple2<>(
 hoodieRecord.getKey(), Option.ofN

[hudi] branch master updated: [HUDI-996] Add functional test in hudi-client (#1824)

2020-07-14 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new b399b4a  [HUDI-996] Add functional test in hudi-client (#1824)
b399b4a is described below

commit b399b4ad43b8caa69f129e756e51ad4bc8c81de2
Author: Raymond Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Tue Jul 14 17:28:50 2020 -0700

[HUDI-996] Add functional test in hudi-client (#1824)

- Add functional test suite in hudi-client
- Tag TestHBaseIndex as functional
---
 hudi-client/pom.xml| 18 +
 ...rovider.java => ClientFunctionalTestSuite.java} | 20 +++---
 .../apache/hudi/index/hbase/TestHBaseIndex.java| 80 ++
 .../hudi/testutils/FunctionalTestHarness.java  | 52 --
 .../testutils/{ => providers}/DFSProvider.java |  2 +-
 .../HoodieMetaClientProvider.java} | 22 +++---
 .../HoodieWriteClientProvider.java}| 16 ++---
 .../testutils/{ => providers}/SparkProvider.java   |  2 +-
 8 files changed, 131 insertions(+), 81 deletions(-)

diff --git a/hudi-client/pom.xml b/hudi-client/pom.xml
index 326cf83..a390923 100644
--- a/hudi-client/pom.xml
+++ b/hudi-client/pom.xml
@@ -268,5 +268,23 @@
   mockito-junit-jupiter
   test
 
+
+
+  org.junit.platform
+  junit-platform-runner
+  test
+
+
+
+  org.junit.platform
+  junit-platform-suite-api
+  test
+
+
+
+  org.junit.platform
+  junit-platform-commons
+  test
+
   
 
diff --git 
a/hudi-client/src/test/java/org/apache/hudi/testutils/DFSProvider.java 
b/hudi-client/src/test/java/org/apache/hudi/ClientFunctionalTestSuite.java
similarity index 70%
copy from hudi-client/src/test/java/org/apache/hudi/testutils/DFSProvider.java
copy to hudi-client/src/test/java/org/apache/hudi/ClientFunctionalTestSuite.java
index 16cc471..4e62618 100644
--- a/hudi-client/src/test/java/org/apache/hudi/testutils/DFSProvider.java
+++ b/hudi-client/src/test/java/org/apache/hudi/ClientFunctionalTestSuite.java
@@ -17,18 +17,16 @@
  * under the License.
  */
 
-package org.apache.hudi.testutils;
+package org.apache.hudi;
 
-import org.apache.hadoop.fs.Path;
-import org.apache.hadoop.hdfs.DistributedFileSystem;
-import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.junit.platform.runner.JUnitPlatform;
+import org.junit.platform.suite.api.IncludeTags;
+import org.junit.platform.suite.api.SelectPackages;
+import org.junit.runner.RunWith;
 
-public interface DFSProvider {
-
-  MiniDFSCluster dfsCluster();
-
-  DistributedFileSystem dfs();
-
-  Path dfsBasePath();
+@RunWith(JUnitPlatform.class)
+@SelectPackages("org.apache.hudi.index")
+@IncludeTags("functional")
+public class ClientFunctionalTestSuite {
 
 }
diff --git 
a/hudi-client/src/test/java/org/apache/hudi/index/hbase/TestHBaseIndex.java 
b/hudi-client/src/test/java/org/apache/hudi/index/hbase/TestHBaseIndex.java
index 6b6f11f..b3d2f5a 100644
--- a/hudi-client/src/test/java/org/apache/hudi/index/hbase/TestHBaseIndex.java
+++ b/hudi-client/src/test/java/org/apache/hudi/index/hbase/TestHBaseIndex.java
@@ -31,7 +31,7 @@ import org.apache.hudi.config.HoodieStorageConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.table.HoodieTable;
-import org.apache.hudi.testutils.HoodieClientTestHarness;
+import org.apache.hudi.testutils.FunctionalTestHarness;
 import org.apache.hudi.testutils.HoodieTestDataGenerator;
 
 import org.apache.hadoop.conf.Configuration;
@@ -47,10 +47,10 @@ import org.apache.hadoop.hbase.client.Result;
 import org.apache.hadoop.hbase.util.Bytes;
 import org.apache.spark.api.java.JavaRDD;
 import org.junit.jupiter.api.AfterAll;
-import org.junit.jupiter.api.AfterEach;
 import org.junit.jupiter.api.BeforeAll;
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.MethodOrderer;
+import org.junit.jupiter.api.Tag;
 import org.junit.jupiter.api.Test;
 import org.junit.jupiter.api.TestMethodOrder;
 
@@ -76,12 +76,17 @@ import static org.mockito.Mockito.when;
  * {@link MethodOrderer.Alphanumeric} to make sure the tests run in order. 
Please alter the order of tests running carefully.
  */
 @TestMethodOrder(MethodOrderer.Alphanumeric.class)
-public class TestHBaseIndex extends HoodieClientTestHarness {
+@Tag("functional")
+public class TestHBaseIndex extends FunctionalTestHarness {
 
   private static final String TABLE_NAME = "test_table";
   private static HBaseTestingUtility utility;
   private static Configuration hbaseConfig;
 
+  private Configuration hadoopConf;
+  private HoodieTestDataGenerator dataGen;
+  private HoodieTableMetaClient metaClient;
+
   @AfterAll
   public static void clean() throws Exception {
 if (utility != null) {
@@ -104,21 +109,10 @@ public class TestHBaseIndex 

[GitHub] [hudi] yanghua merged pull request #1824: [HUDI-996] Add functional test suite in hudi-client

2020-07-14 Thread GitBox


yanghua merged pull request #1824:
URL: https://github.com/apache/hudi/pull/1824


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-14 Thread GitBox


garyli1019 commented on a change in pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#discussion_r454694800



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -58,26 +60,28 @@ class DefaultSource extends RelationProvider
   throw new HoodieException("'path' must be specified.")
 }
 
+// Try to create hoodie table meta client from the give path
+// TODO: Smarter path handling
+val metaClient = try {
+  val conf = sqlContext.sparkContext.hadoopConfiguration
+  Option(new HoodieTableMetaClient(conf, path.get, true))

Review comment:
   Udit's PR has this path handling. Should we merge part of his PR first? 
https://github.com/apache/hudi/pull/1702/files#diff-9a21766ebf794414f94b302bcb968f41R31
   With this, we can handle user to `.load(basePath)` or `.load(basePath + 
"/*/*")` for COW, MOR and incremental.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on pull request #1704: [HUDI-115] Enhance OverwriteWithLatestAvroPayload to also respect ordering value of record in storage

2020-07-14 Thread GitBox


n3nash commented on pull request #1704:
URL: https://github.com/apache/hudi/pull/1704#issuecomment-658453437


   @bhasudha let me know if you need any clarification



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-14 Thread GitBox


vinothchandar commented on a change in pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#discussion_r454670467



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -58,26 +60,28 @@ class DefaultSource extends RelationProvider
   throw new HoodieException("'path' must be specified.")
 }
 
+// Try to create hoodie table meta client from the give path
+// TODO: Smarter path handling
+val metaClient = try {
+  val conf = sqlContext.sparkContext.hadoopConfiguration
+  Option(new HoodieTableMetaClient(conf, path.get, true))

Review comment:
   >Snapshot for MOR: only support basePath
   
   Let me think about this more. We need to support some form of globbing for 
MOR/Snapshot query. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-14 Thread GitBox


vinothchandar commented on a change in pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#discussion_r454670023



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -65,7 +66,7 @@ object DataSourceReadOptions {
 * This eases migration from old configs to new configs.
 */
   def translateViewTypesToQueryTypes(optParams: Map[String, String]) : 
Map[String, String] = {
-val translation = Map(VIEW_TYPE_READ_OPTIMIZED_OPT_VAL -> 
QUERY_TYPE_SNAPSHOT_OPT_VAL,
+val translation = Map(VIEW_TYPE_READ_OPTIMIZED_OPT_VAL -> 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL,

Review comment:
   > f they are using VIEW_TYPE_READ_OPTIMIZED_OPT_VAL(deprecated) on MOR 
in their code, after upgrade to the next release, the code will run snapshot 
query instead of RO query.
   
   we have been logging warning for sometime on the use of the deprecated 
configs. and so I think its fair to do the right thing here moving forward and 
call this out in the release notes. Let me push some changes.. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] michetti edited a comment on issue #1789: [SUPPORT] What jars are needed to run on AWS Glue 1.0 ?

2020-07-14 Thread GitBox


michetti edited a comment on issue #1789:
URL: https://github.com/apache/hudi/issues/1789#issuecomment-658432371


   Thanks for clarifying @umehrot2. I actually just found out that shading 
org.eclipse.jetty was recently merged in Hudi, so we should be good without 
changes from the next release: https://github.com/apache/hudi/pull/1781



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] michetti commented on issue #1789: [SUPPORT] What jars are needed to run on AWS Glue 1.0 ?

2020-07-14 Thread GitBox


michetti commented on issue #1789:
URL: https://github.com/apache/hudi/issues/1789#issuecomment-658432371


   Thanks for clarifying @umehrot2. I actually just found out that shading 
org.eclipse.jetty was actually recently merged in Hudi, so we should be good 
without changes from the next release: https://github.com/apache/hudi/pull/1781



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-14 Thread GitBox


asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658422907


   @bvaradar you are right we are looking for clustering. Do you have anytime 
line in mind when this will be available or any branch to look at.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] michetti edited a comment on issue #1789: [SUPPORT] What jars are needed to run on AWS Glue 1.0 ?

2020-07-14 Thread GitBox


michetti edited a comment on issue #1789:
URL: https://github.com/apache/hudi/issues/1789#issuecomment-658373503


   Hey @GrigorievNick, I saw the issue was closed but if I understood 
correctly, the link you posted is about AWS Athena and how it can work with 
Hudi tables registered in the AWS Glue catalog, while the issue is about 
getting Hudi to work on AWS Glue Jobs (AWS serverless Spark service). Not sure 
I missed something?
   
   I was having the same error as @WilliamWhispell, and from what I could find, 
it seems to be related to a version 
   mismatch between the org.eclipse.jetty jars required by Hudi and the AWS 
Glue Jobs runtime.
   
   For example, Timeline service depends on Javalin 2.8.0, which in turn 
requires Jetty version 9.4.15.v20190215:
   - 
https://github.com/apache/hudi/blob/release-0.5.3/hudi-timeline-service/pom.xml#L111
   - https://github.com/tipsy/javalin/blob/javalin-2.8.0/pom.xml#L43
   
   While Spark 2.4.3 (this is the version Glue Jobs 1.0 runtime uses) depends 
on Jetty version 9.3.24.v20180605:
   - https://github.com/apache/spark/blob/v2.4.3/pom.xml#L137
   
   I got it working by shading _org.eclipse.jetty._  in the spark-bundle, by 
adding the following 
[here](https://github.com/apache/hudi/blob/release-0.5.3/packaging/hudi-spark-bundle/pom.xml#L99):
   ```xml
   
 org.eclipse.jetty.
 org.apache.hudi.org.eclipse.jetty.
   
   ```
   
   @WilliamWhispell, I'm not sure there is a better way, but with Hudi 0.5.3 on 
AWS Glue Jobs 1.0, I needed the following jars:
   - httpclient-4.5.12.jar (due to 
[this](https://forums.aws.amazon.com/thread.jspa?messageID=930176) other error)
   - spark-avro_2.11-2.4.3.jar
   - hudi-spark-bundle_2.11-0.5.3.jar (your own, with the changes above)
   
   And remember that you also need to configure spark the way it is described 
in Hudi documentation:
   ```scala
   val sparkConf: SparkConf = new SparkConf();
   sparkConf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer");
   sparkConf.set("spark.sql.hive.convertMetastoreParquet", "false");
 
   val sparkContext: SparkContext = new SparkContext(sparkConf)
   val glueContext: GlueContext = new GlueContext(sparkContext)
   val spark: SparkSession = glueContext.getSparkSession
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #1789: [SUPPORT] What jars are needed to run on AWS Glue 1.0 ?

2020-07-14 Thread GitBox


umehrot2 commented on issue #1789:
URL: https://github.com/apache/hudi/issues/1789#issuecomment-658387970


   @michetti Thanks for the pointers here. You are right that the link posted 
is from AWS Athena, whereas what you folks are trying is run Hudi on AWS Glue 
jobs. Since AWS Glue does not officially support Hudi, you may have to do hacks 
like what you are doing currently using shading dependencies etc. to get your 
way around.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

2020-07-14 Thread GitBox


umehrot2 commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-658385297


   Like @bvaradar mentioned, in the first query the glob pattern matches with 
950 folders which are then parallely listed across the cluster using spark 
context. In the second query the glob patter matches 4750 files because of the 
extra * and now spark has to parallely list 4750 paths using spark context. 
This most likely seems to be the cause of this performance difference. Added to 
this I think the time taken by **HoodieROTablePathFilter** which is applied per 
file might somehow be amplifying it more.
   
   Can you run a similar test queries on a simple parquet table (non-hudi 
table) and observe the performance difference in listing. I think you may see 
slightly similar behavior.
   
   ```
   spark.read.parquet(globPath)
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] michetti commented on issue #1789: [SUPPORT] What jars are needed to run on AWS Glue 1.0 ?

2020-07-14 Thread GitBox


michetti commented on issue #1789:
URL: https://github.com/apache/hudi/issues/1789#issuecomment-658373503


   Hey @GrigorievNick, I saw the issue was closed but if I understood 
correctly, the link you posted is about AWS Athena and how it can work with 
Hudi tables registered in the AWS Glue catalog, while the issue is about 
getting Hudi to work on AWS Glue Jobs (AWS serverless Spark service). Not sure 
I missed something?
   
   I'm was having the same error as @WilliamWhispell, and from what I could 
find, it seems to be related to a version 
   mismatch between the org.eclipse.jetty jars required by Hudi and the AWS 
Glue Jobs runtime.
   
   For example, Timeline service depends on Javalin 2.8.0, which in turn 
requires Jetty version 9.4.15.v20190215:
   - 
https://github.com/apache/hudi/blob/release-0.5.3/hudi-timeline-service/pom.xml#L111
   - https://github.com/tipsy/javalin/blob/javalin-2.8.0/pom.xml#L43
   
   While Spark 2.4.3 (this is the version Glue Jobs 1.0 runtime uses) depends 
on Jetty version 9.3.24.v20180605:
   - https://github.com/apache/spark/blob/v2.4.3/pom.xml#L137
   
   I got it working by shadowing _org.eclipse.jetty._  in the spark-bundle, by 
adding the following 
[here](https://github.com/apache/hudi/blob/release-0.5.3/packaging/hudi-spark-bundle/pom.xml#L99):
   ```xml
   
 org.eclipse.jetty.
 org.apache.hudi.org.eclipse.jetty.
   
   ```
   
   @WilliamWhispell, I'm not sure there is a better way, but with Hudi 0.5.3 on 
AWS Glue Jobs 1.0, I needed the following jars:
   - httpclient-4.5.12.jar (due to 
[this](https://forums.aws.amazon.com/thread.jspa?messageID=930176) other error)
   - spark-avro_2.11-2.4.3.jar
   - hudi-spark-bundle_2.11-0.5.3.jar (your own, with the changes above)
   
   And remember that you also need to configure spark the way it is described 
in Hudi documentation:
   ```scala
   val sparkConf: SparkConf = new SparkConf();
   sparkConf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer");
   sparkConf.set("spark.sql.hive.convertMetastoreParquet", "false");
 
   val sparkContext: SparkContext = new SparkContext(sparkConf)
   val glueContext: GlueContext = new GlueContext(sparkContext)
   val spark: SparkSession = glueContext.getSparkSession
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] srsteinmetz edited a comment on issue #1830: [SUPPORT] Processing time gradually increases while using Spark Streaming

2020-07-14 Thread GitBox


srsteinmetz edited a comment on issue #1830:
URL: https://github.com/apache/hudi/issues/1830#issuecomment-658363753


   Accidentally posted early. Closed and reopened after editing post.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] srsteinmetz commented on issue #1830: [SUPPORT] Processing time gradually increases while using Spark Streaming

2020-07-14 Thread GitBox


srsteinmetz commented on issue #1830:
URL: https://github.com/apache/hudi/issues/1830#issuecomment-658363753


   Accidentally posted 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1087) Realtime Record Reader needs to handle decimal types

2020-07-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1087:
-
Labels: pull-request-available  (was: )

> Realtime Record Reader needs to handle decimal types
> 
>
> Key: HUDI-1087
> URL: https://issues.apache.org/jira/browse/HUDI-1087
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> For MOR, Realtime queries, decimal types are not getting handled correctly 
> resulting in the following exception:
>  
>  
> {{scala> spark.sql("select * from testTable_rt").show
> java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.HiveDecimalWritable
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableHiveDecimalObjectInspector.getPrimitiveWritableObject(WritableHiveDecimalObjectInspector.java:41)
>   at 
> org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:107)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:413)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:291)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:283)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}
> {{}}
> {{Issue : [https://github.com/apache/hudi/issues/1790]}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] zhedoubushishi opened a new pull request #1831: [HUDI-1087] Handle decimal type for realtime record reader with SparkSQL

2020-07-14 Thread GitBox


zhedoubushishi opened a new pull request #1831:
URL: https://github.com/apache/hudi/pull/1831


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   JIRA https://issues.apache.org/jira/browse/HUDI-1087  
   ISSUE https://github.com/apache/hudi/issues/1790
   
   This [pr](https://github.com/apache/hudi/pull/1677/files) calls 
```HiveDecimalWritable.enforcePrecisionScale``` to handle decimal type for Hive 
realtime querying. But ```HiveDecimalWritable.enforcePrecisionScale``` is only 
available for Hive 2.x, so when using SparkSQL (which uses ```1.2.1.spark2``` 
version), it will throw a ```methodNotFound``` error.
   
   This PR replaces ```HiveDecimalWritable.enforcePrecisionScale``` with 
```HiveDecimalUtils.enforcePrecisionScale``` which is compatible with both Hive 
2.x and Hive 1.2.1.spark2.
   
   ## Brief change log
   
   - Replaced ```HiveDecimalWritable.enforcePrecisionScale``` with 
```HiveDecimalUtils.enforcePrecisionScale``` which is compatible with both Hive 
2.x and Hive 1.2.1.spark2.
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] srsteinmetz closed issue #1830: [SUPPORT] Processing time gradually increases while using Spark Streaming

2020-07-14 Thread GitBox


srsteinmetz closed issue #1830:
URL: https://github.com/apache/hudi/issues/1830


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] srsteinmetz opened a new issue #1830: [SUPPORT] Processing time gradually increases while using Spark Streaming

2020-07-14 Thread GitBox


srsteinmetz opened a new issue #1830:
URL: https://github.com/apache/hudi/issues/1830


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   This issue appears to be similar to: 
https://github.com/apache/hudi/issues/1728
   While using Spark Streaming to read a Kinesis stream and upsert records to a 
MoR table.
   We are seeing the processing time increase over time.
   This processing time increase occurs both on new tables and large existing 
tables.
   
   Increasing processing time on new table:
   ![Empty Table Processing Time 
Increase](https://user-images.githubusercontent.com/3799859/87455926-0dd76e80-c5bb-11ea-90fc-7013c018af07.JPG)
   
   Increasing processing time on existing table with 1.4 billion records:
   ![Existing Table Processing Time 
Increase](https://user-images.githubusercontent.com/3799859/87455939-116af580-c5bb-11ea-84f2-47b97bcf11a5.JPG)
   
   From looking at the Spark UI it seems like the job that is increasing in 
duration is countByKey at WorkloadProfile.java:67:
   ![WorkloadProfile Execution 
Time](https://user-images.githubusercontent.com/3799859/87459080-c30c2580-c5bf-11ea-8f94-773aa8cd1c4f.JPG)
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Set up a Kinesis stream. We are using a stream with 200 shards which 
allows us to stream > 10K records/sec
   It's unlikely the source used matters for this issue. This can most 
likely be replicated with Kafka or any other source.
   2. Create a Spark Streaming application to read from the source and upsert 
to a MoR Hudi table.
   ` 
   val spark = SparkSession
 .builder()
 .appName("SparkStreaimingTest")
 .master(args.lift(0).getOrElse("local[*]"))
 // Hudi config settings
 .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
 .config("spark.sql.hive.convertMetastoreParquet", "false")
 // Spark Streaming confis settings
 .config("spark.streaming.blockInterval", 
SPARK_STREAMING_BLOCK_INTERVAL_MILLIS.toInt.toString)
 // Spark config settings
 .config("spark.driver.cores", CORES_PER_EXECUTOR.toString)
 .config("spark.driver.memory", (MEMORY_PER_EXECUTOR.toInt - 
1).toString + "g")
 .config("spark.executor.cores", CORES_PER_EXECUTOR.toString)
 .config("spark.executor.memory", (MEMORY_PER_EXECUTOR.toInt - 
1).toString + "g")
 .config("spark.yarn.executor.memoryOverhead", 
(MEMORY_OVERHEAD_PER_EXECUTOR.toInt + 1).toString + "g")
 .config("spark.executor.instances", TOTAL_EXECUTORS.toString)
 // Default number of partitions in RDDs returned by transformations 
like join, reduceByKey, and parallelize
 .config("spark.default.parallelism", PARALLELISM.toString)
 //  Sets the number of partitions for joins and aggregations
 .config("spark.sql.shuffle.partitions", PARALLELISM.toString)
 // Dynamically increase/decrease number of executors
 .config("spark.dynamicAllocation.enabled", "false")
 .config("spark.executor.extraJavaOptions", "-XX:NewSize=1g 
-XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC 
-XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof")
 .config("spark.sql.parquet.writeLegacyFormat", "true")
 .getOrCreate()
   `
   `
   val test = println("this sucks")
   `
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-07-14 Thread GitBox


prashantwason commented on pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#issuecomment-658325538


   @vinothchandar Am not currently blocked on this. Take a fine-combed look so 
we can make this a model for adding new file formats to HUDI. 
   
   Please suggest better testing approaches / extra documentation / code 
comments.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zuyanton opened a new issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2020-07-14 Thread GitBox


zuyanton opened a new issue #1829:
URL: https://github.com/apache/hudi/issues/1829


   Hudi MoR reading performance gets slower on tables with many (1000+) 
partitions stored in S3. When running simple ```spark.sql("select * from 
table_ro).count``` command, we observe in spark UI  that first 2.5 minutes no 
spark jobs gets scheduled and the load on cluster during that period is almost 
non existing.   
   ![select star 
ro](https://user-images.githubusercontent.com/67354813/87452475-1e391a80-c5b6-11ea-9f63-6e6aa877c20f.PNG)
 

   When looking into logs to figure out what is going on during that period we 
observe that first  two and a half minutes Hudi is busy running 
```HoodieParquetInputFormat.listStatus``` [code 
link](https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java#L68).
 I placed timer logs lines around various parts of that function and was able 
to narrow down to this line 
https://github.com/apache/hudi/blob/f5dc8ca733014d15a6d7966a5b6ae4308868adfa/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java#L103
this line execution takes over 2/3 of all time.
   If I understand correctly what this line does it lists all files in a single 
partition.   
   Looks like this "overhead" is linearly depends on number of partitions as 
increasing number of partitions to 2000 almost doubles the overhead and cluster 
just runs ```HoodieParquetInputFormat.listStatus``` before starting executing 
any spark jobs. 
   
   **To Reproduce**
   see code snippet bellow 
   
   * Hudi version : master branch
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.6
   
   * Hadoop version : 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   ```
   import org.apache.spark.sql.functions._
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.hudi.config.HoodieWriteConfig
   import org.apache.hudi.keygen.ComplexKeyGenerator
   import org.apache.hadoop.hive.conf.HiveConf
   val hiveConf = new HiveConf()
   val hiveMetastoreURI = 
hiveConf.get("hive.metastore.uris").replaceAll("thrift://", "")
   val hiveServer2URI = hiveMetastoreURI.substring(0, 
hiveMetastoreURI.lastIndexOf(":"))
   var hudiOptions = Map[String,String](
 HoodieWriteConfig.TABLE_NAME → "testTable1",
 "hoodie.consistency.check.enabled"->"true",
 "hoodie.compact.inline.max.delta.commits"->"100",
 "hoodie.compact.inline"->"true",
 DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "MERGE_ON_READ",
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "pk",
 DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> 
classOf[ComplexKeyGenerator].getName,
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY ->"partition",
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "sort_key",
 DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY → "true",
 DataSourceWriteOptions.HIVE_TABLE_OPT_KEY → "testTable1",
 DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY → "partition",
 DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY → 
classOf[MultiPartKeysValueExtractor].getName,
 DataSourceWriteOptions.HIVE_URL_OPT_KEY 
->s"jdbc:hive2://$hiveServer2URI:1"
   )
   
   spark.sql("drop table if exists testTable1_ro")
   spark.sql("drop table if exists testTable1_rt")
   var seq = Seq((1, 2, 3))
   for (i<- 2 to 1000) {
 seq = seq :+ (i, i , 1)
   }
   var df = seq.toDF("pk", "partition", "sort_key")
   //create table
   
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://testBucket/test/hudi/zuyanton/1/testTable1")
   //update table couple times
   
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://testBucket/test/hudi/zuyanton/1/testTable1")
   
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://testBucket/test/hudi/zuyanton/1/testTable1")
   
   //read table 
   spark.sql("select * from testTable_ro").count
   ```
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

2020-07-14 Thread GitBox


bhasudha commented on issue #1828:
URL: https://github.com/apache/hudi/issues/1828#issuecomment-658282320


   > Hi Guys,
   > 
   > Is it possible to retain only last commit? When I put 
'hoodie.cleaner.commits.retained': 1 in hudi_options I still have two last 
commits. One that is being written and the previous one. What I want to achieve 
is to have only last change and last parquet file.
   
   @kirkuz  providing some context. Cleaning and compaction happen in the 
background (asynchronous to ingestion itself). When the cleaner kicks in it 
would get rid of the older commit. If there is an ongoing write, generally 
there could be two possibilities - 
   1. the write would succeed. in which case based on 
`hoodie.cleaner.commits.retained` the cleaner would get rid of the old version 
when it triggers.
   2. the write would fail for some reason - in this case the cleaner would 
later get rid of the failed commit and retain the other  version (which is the 
last succeeded one)
   
   This is why you are seeing two commits. This should not affect the queries. 
Can you please elaborate on what you were looking for in terms of use case/ 
performance concern etc to help us understand better ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] kirkuz opened a new issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

2020-07-14 Thread GitBox


kirkuz opened a new issue #1828:
URL: https://github.com/apache/hudi/issues/1828


   Hi Guys, 
   
   Is it possible to retain only last commit? When I put 
'hoodie.cleaner.commits.retained': 1 in hudi_options I still have two last 
commits. One that is being written and the previous one. What I want to achieve 
is to have only last change and last parquet file.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1787: Exception During Insert

2020-07-14 Thread GitBox


bvaradar commented on issue #1787:
URL: https://github.com/apache/hudi/issues/1787#issuecomment-658248015


   @asheeshgarg : If the table is represented as simple parquet table, presto 
queries will start showing duplicates when there are multiple file versions 
present or could fail when writes are happening (no snapshot isolation). 
Creating a table using hive sync would ensure only valid and single file 
versions  are read. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1087) Realtime Record Reader needs to handle decimal types

2020-07-14 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1087:
-
Priority: Blocker  (was: Major)

> Realtime Record Reader needs to handle decimal types
> 
>
> Key: HUDI-1087
> URL: https://issues.apache.org/jira/browse/HUDI-1087
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.0
>
>
> For MOR, Realtime queries, decimal types are not getting handled correctly 
> resulting in the following exception:
>  
>  
> {{scala> spark.sql("select * from testTable_rt").show
> java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.HiveDecimalWritable
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableHiveDecimalObjectInspector.getPrimitiveWritableObject(WritableHiveDecimalObjectInspector.java:41)
>   at 
> org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:107)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:413)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:291)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:283)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}
> {{}}
> {{Issue : [https://github.com/apache/hudi/issues/1790]}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar closed issue #1789: [SUPPORT] What jars are needed to run on AWS Glue 1.0 ?

2020-07-14 Thread GitBox


bvaradar closed issue #1789:
URL: https://github.com/apache/hudi/issues/1789


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1790: [SUPPORT] Querying MoR tables with DecimalType columns via Spark SQL fails

2020-07-14 Thread GitBox


bvaradar commented on issue #1790:
URL: https://github.com/apache/hudi/issues/1790#issuecomment-658245237


   Closing this ticket as we have a jira to track this issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

2020-07-14 Thread GitBox


bvaradar commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-658244220


   The related code (HoodieROTablePathFilter) does not seem to have any 
relevant recent changes. 
   
   @zherenyu831 From the control flow, since Spark deciphers the glob-path, it 
is first performing the listing of all matching entities and this is where I 
think  it is slower try to list files under .aux. One option to try (for 
experimentation) is to skip ".hoodie" folder in glob pattern and see if it is 
faster. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #1806: [SUPPORT] Deltastreamer can`t validate rewritten record that is valid

2020-07-14 Thread GitBox


bvaradar closed issue #1806:
URL: https://github.com/apache/hudi/issues/1806


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

2020-07-14 Thread GitBox


bvaradar commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-658237088


   @jcunhafonte :  This could happen when there are no more files to be 
ingested when running in non-continuous mode. I have opened a jira to get it 
fixed in 0.6.0 :  https://issues.apache.org/jira/browse/HUDI-1091. With no 
input data, automatic schema resolution wont be possible. In continuous mode, 
we do cache the previous schema registry instance to handle this case. Can you 
try with that. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1091) Handle empty input batch gracefully in ParquetDFSSource

2020-07-14 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1091:


Assignee: Balaji Varadarajan

> Handle empty input batch gracefully in ParquetDFSSource
> ---
>
> Key: HUDI-1091
> URL: https://issues.apache.org/jira/browse/HUDI-1091
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1813]
>  Looking at 0.5.3, it is possible the below exception can happen when running 
> in standalone mode and the next batch to write is empty.
> ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down 
> org.apache.hudi.exception.HoodieException: Please provide a valid schema 
> provider class! at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1091) Handle empty input batch gracefully in ParquetDFSSource

2020-07-14 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1091:
-
Status: Open  (was: New)

> Handle empty input batch gracefully in ParquetDFSSource
> ---
>
> Key: HUDI-1091
> URL: https://issues.apache.org/jira/browse/HUDI-1091
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1813]
>  Looking at 0.5.3, it is possible the below exception can happen when running 
> in standalone mode and the next batch to write is empty.
> ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down 
> org.apache.hudi.exception.HoodieException: Please provide a valid schema 
> provider class! at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1091) Handle empty input batch gracefully in ParquetDFSSource

2020-07-14 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1091:


 Summary: Handle empty input batch gracefully in ParquetDFSSource
 Key: HUDI-1091
 URL: https://issues.apache.org/jira/browse/HUDI-1091
 Project: Apache Hudi
  Issue Type: Bug
  Components: DeltaStreamer
Reporter: Balaji Varadarajan
 Fix For: 0.6.0


[https://github.com/apache/hudi/issues/1813]

 Looking at 0.5.3, it is possible the below exception can happen when running 
in standalone mode and the next batch to write is empty.

ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down 
org.apache.hudi.exception.HoodieException: Please provide a valid schema 
provider class! at 
org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
 at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
 at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226) 
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
 at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-14 Thread GitBox


bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658221173


   @asheeshgarg :  I think what you are looking for is clustering (not 
compaction) of files which is under development (Please see 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance).
 To your setup, a good strategy would be to have a single hudi writer read one 
or more of these datasets and ingest to hudi. Hudi supports file sizing - 
Please see 
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles
 for more details. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] aditanase commented on pull request #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema

2020-07-14 Thread GitBox


aditanase commented on pull request #1406:
URL: https://github.com/apache/hudi/pull/1406#issuecomment-658221068


   @umehrot2 agreed and it's a valid point for when you control the schema, 
which in this case is defined by a customer upstream from us.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 commented on issue #506: explodeRecordRDDWithFileComparisons is costly with HoodieBloomIndex/range pruning=on

2020-07-14 Thread GitBox


tooptoop4 commented on issue #506:
URL: https://github.com/apache/hudi/issues/506#issuecomment-658220450


   I seem to have similar issue, running upsert of 700MB csv (twice, ie repeat 
the same csv upsert next day) with 16gb executor memory and shuffle parallelism 
of 16 
   
   
![image](https://user-images.githubusercontent.com/33283496/87439033-2f028400-c5e8-11ea-953d-fdf5dd23a91c.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1079) Cannot upsert on schema with Array of Record with single field

2020-07-14 Thread Adrian Tanase (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157429#comment-17157429
 ] 

Adrian Tanase commented on HUDI-1079:
-

[~vinoth] - thanks for the pointer, I'll take a look around the 
AvroConversionHelper. It is not a blocker now, as we're still evaluating hudi 
against delta and iceberg, but it could be down the line. For what it's worth, 
both delta and iceberg support the above case without issues.

[~uditme] - I agree that if I fall back to Array of primitives it works, it was 
one of the first tests I tried. However I don't always control the upstream 
schema. As long as both parquet, avro and spark support structs with 1 field I 
think this is a valid issue.

> Cannot upsert on schema with Array of Record with single field
> --
>
> Key: HUDI-1079
> URL: https://issues.apache.org/jira/browse/HUDI-1079
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.5.3
> Environment: spark 2.4.4, local 
>Reporter: Adrian Tanase
>Priority: Major
> Fix For: 0.6.0
>
>
> I am trying to trigger upserts on a table that has an array field with 
> records of just one field.
>  Here is the code to reproduce:
> {code:scala}
>   val spark = SparkSession.builder()
>   .master("local[1]")
>   .appName("SparkByExamples.com")
>   .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>   .getOrCreate();
>   // https://sparkbyexamples.com/spark/spark-dataframe-array-of-struct/
>   val arrayStructData = Seq(
> Row("James",List(Row("Java","XX",120),Row("Scala","XA",300))),
> Row("Michael",List(Row("Java","XY",200),Row("Scala","XB",500))),
> Row("Robert",List(Row("Java","XZ",400),Row("Scala","XC",250))),
> Row("Washington",null)
>   )
>   val arrayStructSchema = new StructType()
>   .add("name",StringType)
>   .add("booksIntersted",ArrayType(
> new StructType()
>   .add("bookName",StringType)
> //  .add("author",StringType)
> //  .add("pages",IntegerType)
>   ))
> val df = 
> spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
> {code}
> Running insert following by upsert will fail:
> {code:scala}
>   df.write
>   .format("hudi")
>   .options(getQuickstartWriteConfigs)
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "name")
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name")
>   .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .mode(Overwrite)
>   .save(basePath)
>   df.write
>   .format("hudi")
>   .options(getQuickstartWriteConfigs)
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "name")
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .mode(Append)
>   .save(basePath)
> {code}
> If I create the books record with all the fields (at least 2), it works as 
> expected.
> The relevant part of the exception is this:
> {noformat}
> Caused by: java.lang.ClassCastException: required binary bookName (UTF8) is 
> not a groupCaused by: java.lang.ClassCastException: required binary bookName 
> (UTF8) is not a group at 
> org.apache.parquet.schema.Type.asGroupType(Type.java:207) at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78)
>  at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:536)
>  at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:486)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:141)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:95)
>  at 
> org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
>  at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
>  at 
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) at 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at 
> org.apache.hudi.client.utils.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
>  at 
> org.apache.hudi.common.util.queue.

[GitHub] [hudi] xushiyan commented on pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-14 Thread GitBox


xushiyan commented on pull request #1819:
URL: https://github.com/apache/hudi/pull/1819#issuecomment-658217251


   @nsivabalan actually what i commented here is the 3rd option
   
   > @shenh062326 Maybe we shouldn't do the check in the payload class itself. 
Maybe `org.apache.hudi.io.HoodieMergeHandle#write` is better for this job. After
   > 
https://github.com/apache/hudi/blob/2603cfb33e272632d7f36a53e1b13fe86dbb8627/hudi-client/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java#L222-L223
   > 
   > 
   > we check `combinedAvroRecord` against the delete field defined in the 
configs and convert it to Option.empty() if appropriate. As this feature is 
config-related, whoever owns the configs should do it. Need more inputs from 
@nsivabalan
   
   Basically it is about shifting the responsibility of converting records to 
`Option.empty()` to `HoodieMergeHandle`. WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-1082) Bug in deciding the upsert/insert buckets

2020-07-14 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157418#comment-17157418
 ] 

sivabalan narayanan edited comment on HUDI-1082 at 7/14/20, 2:26 PM:
-

I ran some simulations. By and large, distribution seems to be fine, with 5% 
difference.

weights for partition1 [0.2,0.3, 0.5]

Testing for entry. Partition1 total inserts 924. Total inserts 10354
 Actual dist [0=995, 1=1488, 2=2517]
 Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10378
 Actual dist [0=1008, 1=1481, 2=2511]
 Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10393
 Actual dist [0=979, 1=1502, 2=2519]
 Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10401
 Actual dist [0=956, 1=1560, 2=2484]
 Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 990. Total inserts 10350
 Actual dist [0=1030, 1=1518, 2=2452]
 Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 12350
 Actual dist [0=983, 1=1524, 2=2493]
 Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 14350
 Actual dist [0=977, 1=1505, 2=2518]
 Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 16800
 Actual dist [0=1065, 1=1479, 2=2456]
 Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 21000
 Actual dist [0=1015, 1=1503, 2=2482]
 Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 1210. Total inserts 25170
 Actual dist [0=974, 1=1489, 2=2537]
 Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41368
 Actual dist [0=981, 1=1583, 2=2436]
 Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41246
 Actual dist [0=1027, 1=1464, 2=2509]
 Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41930
 Actual dist [0=1024, 1=1449, 2=2527]
 Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 94320
 Actual dist [0=1033, 1=1459, 2=2508]
 Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 9
 Actual dist [0=996, 1=1510, 2=2494]
 Expected dist [0=1036, 1=1489, 2=2475]


was (Author: shivnarayan):
I ran some simulations. By and large, distribution seems to be fine, with 10% 
variance.

weights for partition1 [0.2,0.3, 0.5]

Testing for entry. Partition1 total inserts 924. Total inserts 10354
Actual dist [0=995, 1=1488, 2=2517]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10378
Actual dist [0=1008, 1=1481, 2=2511]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10393
Actual dist [0=979, 1=1502, 2=2519]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10401
Actual dist [0=956, 1=1560, 2=2484]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 990. Total inserts 10350
Actual dist [0=1030, 1=1518, 2=2452]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 12350
Actual dist [0=983, 1=1524, 2=2493]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 14350
Actual dist [0=977, 1=1505, 2=2518]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 16800
Actual dist [0=1065, 1=1479, 2=2456]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 21000
Actual dist [0=1015, 1=1503, 2=2482]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 1210. Total inserts 25170
Actual dist [0=974, 1=1489, 2=2537]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41368
Actual dist [0=981, 1=1583, 2=2436]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41246
Actual dist [0=1027, 1=1464, 2=2509]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41930
Actual dist [0=1024, 1=1449, 2=2527]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 94320
Actual dist [0=1033, 1=1459, 2=2508]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 9
Actual dist [0=996, 1=1510, 2=2494]
Expected dist [0=1036, 1=1489, 2=2475]

>

[jira] [Commented] (HUDI-1082) Bug in deciding the upsert/insert buckets

2020-07-14 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157418#comment-17157418
 ] 

sivabalan narayanan commented on HUDI-1082:
---

I ran some simulations. By and large, distribution seems to be fine, with 10% 
variance.

weights for partition1 [0.2,0.3, 0.5]

Testing for entry. Partition1 total inserts 924. Total inserts 10354
Actual dist [0=995, 1=1488, 2=2517]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10378
Actual dist [0=1008, 1=1481, 2=2511]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10393
Actual dist [0=979, 1=1502, 2=2519]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10401
Actual dist [0=956, 1=1560, 2=2484]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 990. Total inserts 10350
Actual dist [0=1030, 1=1518, 2=2452]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 12350
Actual dist [0=983, 1=1524, 2=2493]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 14350
Actual dist [0=977, 1=1505, 2=2518]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 16800
Actual dist [0=1065, 1=1479, 2=2456]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 21000
Actual dist [0=1015, 1=1503, 2=2482]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 1210. Total inserts 25170
Actual dist [0=974, 1=1489, 2=2537]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41368
Actual dist [0=981, 1=1583, 2=2436]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41246
Actual dist [0=1027, 1=1464, 2=2509]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41930
Actual dist [0=1024, 1=1449, 2=2527]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 94320
Actual dist [0=1033, 1=1459, 2=2508]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 9
Actual dist [0=996, 1=1510, 2=2494]
Expected dist [0=1036, 1=1489, 2=2475]

> Bug in deciding the upsert/insert buckets
> -
>
> Key: HUDI-1082
> URL: https://issues.apache.org/jira/browse/HUDI-1082
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Affects Versions: 0.6.0
>Reporter: sivabalan narayanan
>Assignee: Hong Shen
>Priority: Major
> Fix For: 0.6.1
>
>
> In 
> [UpsertPartitioner|[https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]],
>  when getPartition(Object key) is called, the logic to determine where the 
> record to be inserted is relying on globalInsertCounts where as this should 
> be perPartitionInsertCount.
>  
> Bcoz, the weights for all targetInsert buckets are determined based on total 
> Inserts going into the partition of interest. // check like 200. Whereas when 
> getPartition(key) is called, we use global insert count to determine the 
> right bucket. 
>  
> For instance,
> P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be 
> inserted is 100.
> P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be 
> inserted is 10025. 
> So, ideal allocation is
> P1: B0 -> 20, B1 -> 50, B2 -> 30
> P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503
>  
> getPartition() for a key is determined based on following.
> mod (hash value, inserts)/ inserts.
> Instead of considering inserts for the partition of interest, currently we 
> take global insert counts.
> Lets say, these are the hash values for insert records in P1.
> 5, 14, 20, 25, 90, 500, 1001, 5180.
> record hash | expected bucket in P1 | actual bucket in P1 |
> 5     | B0 | B0
> 14   | B0 | B0
> 21   | B1  | B0
> 30  | B1 | B0
> 90 | B2 | B0
> 500 | B0 | B0
> 1490 | B2 | B1
> 10019 | B0 | B3
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] asheeshgarg commented on issue #1787: Exception During Insert

2020-07-14 Thread GitBox


asheeshgarg commented on issue #1787:
URL: https://github.com/apache/hudi/issues/1787#issuecomment-658191364


   @bhasudha it work with Presto and I am able to query data fine and data 
seems to be correct based on my queries. The only concern I have is it missing 
anything that might hit in long run?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-14 Thread GitBox


asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658188686


   @bvaradar Balaji please let me know if I need to assign additional 
properties to achieve the behavior.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-1082) Bug in deciding the upsert/insert buckets

2020-07-14 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157346#comment-17157346
 ] 

sivabalan narayanan edited comment on HUDI-1082 at 7/14/20, 1:21 PM:
-

Guess you are missing out partially filled buckets. So, can you try this. 
Partition1 has 3 insert buckets with weights 0.2, 0.3 and 0.5. 

Partition1 total inserts are 990. Partition2 total inserts are 10350. 

 

when you insert 1000 records to partition1, 20% should go to first bucket, 30% 
to 2nd and 50% to 3rd. 

 

 

 


was (Author: shivnarayan):
Guess you are missing out partially filled buckets. So, can you try this. 
Partition1 has 3 insert buckets with weights 0.2, 0.3 and 0.5. Partition3 has 5 
insert buckets.

Partition1 total inserts are 990. Partition2 total inserts are 10350. 

 

when you insert 1000 records to partition1, 20% should go to first bucket, 30% 
to 2nd and 50% to 3rd. 

 

 

 

> Bug in deciding the upsert/insert buckets
> -
>
> Key: HUDI-1082
> URL: https://issues.apache.org/jira/browse/HUDI-1082
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Affects Versions: 0.6.0
>Reporter: sivabalan narayanan
>Assignee: Hong Shen
>Priority: Major
> Fix For: 0.6.1
>
>
> In 
> [UpsertPartitioner|[https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]],
>  when getPartition(Object key) is called, the logic to determine where the 
> record to be inserted is relying on globalInsertCounts where as this should 
> be perPartitionInsertCount.
>  
> Bcoz, the weights for all targetInsert buckets are determined based on total 
> Inserts going into the partition of interest. // check like 200. Whereas when 
> getPartition(key) is called, we use global insert count to determine the 
> right bucket. 
>  
> For instance,
> P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be 
> inserted is 100.
> P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be 
> inserted is 10025. 
> So, ideal allocation is
> P1: B0 -> 20, B1 -> 50, B2 -> 30
> P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503
>  
> getPartition() for a key is determined based on following.
> mod (hash value, inserts)/ inserts.
> Instead of considering inserts for the partition of interest, currently we 
> take global insert counts.
> Lets say, these are the hash values for insert records in P1.
> 5, 14, 20, 25, 90, 500, 1001, 5180.
> record hash | expected bucket in P1 | actual bucket in P1 |
> 5     | B0 | B0
> 14   | B0 | B0
> 21   | B1  | B0
> 30  | B1 | B0
> 90 | B2 | B0
> 500 | B0 | B0
> 1490 | B2 | B1
> 10019 | B0 | B3
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1082) Bug in deciding the upsert/insert buckets

2020-07-14 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157346#comment-17157346
 ] 

sivabalan narayanan commented on HUDI-1082:
---

Guess you are missing out partially filled buckets. So, can you try this. 
Partition1 has 3 insert buckets with weights 0.2, 0.3 and 0.5. Partition3 has 5 
insert buckets.

Partition1 total inserts are 990. Partition2 total inserts are 10350. 

 

when you insert 1000 records to partition1, 20% should go to first bucket, 30% 
to 2nd and 50% to 3rd. 

 

 

 

> Bug in deciding the upsert/insert buckets
> -
>
> Key: HUDI-1082
> URL: https://issues.apache.org/jira/browse/HUDI-1082
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Affects Versions: 0.6.0
>Reporter: sivabalan narayanan
>Assignee: Hong Shen
>Priority: Major
> Fix For: 0.6.1
>
>
> In 
> [UpsertPartitioner|[https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]],
>  when getPartition(Object key) is called, the logic to determine where the 
> record to be inserted is relying on globalInsertCounts where as this should 
> be perPartitionInsertCount.
>  
> Bcoz, the weights for all targetInsert buckets are determined based on total 
> Inserts going into the partition of interest. // check like 200. Whereas when 
> getPartition(key) is called, we use global insert count to determine the 
> right bucket. 
>  
> For instance,
> P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be 
> inserted is 100.
> P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be 
> inserted is 10025. 
> So, ideal allocation is
> P1: B0 -> 20, B1 -> 50, B2 -> 30
> P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503
>  
> getPartition() for a key is determined based on following.
> mod (hash value, inserts)/ inserts.
> Instead of considering inserts for the partition of interest, currently we 
> take global insert counts.
> Lets say, these are the hash values for insert records in P1.
> 5, 14, 20, 25, 90, 500, 1001, 5180.
> record hash | expected bucket in P1 | actual bucket in P1 |
> 5     | B0 | B0
> 14   | B0 | B0
> 21   | B1  | B0
> 30  | B1 | B0
> 90 | B2 | B0
> 500 | B0 | B0
> 1490 | B2 | B1
> 10019 | B0 | B3
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1082) Bug in deciding the upsert/insert buckets

2020-07-14 Thread Hong Shen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157341#comment-17157341
 ] 

Hong Shen edited comment on HUDI-1082 at 7/14/20, 12:53 PM:


It seems does not matter, we just need to ensure the distribution of records to 
fileId.

Add a testcase testGetPartitioner2, it has two partitions "2016/09/26" and 
"2016/09/27".  Generate 200 insert records to partition "2016/09/26" and 2000 
insert records to  "2016/09/27", since the fileSize limit is 1000 * 1024, 
partition "2016/09/26" will generate 2 fileGroup, partition "2016/09/27" will 
generate 20 fileGroup. For the 200 insert records to partition "2016/09/26", it 
will has approximately 100 records to fileGroup1, and the others to fileGroup2.

 
{code:java|title=TestUpsertPartitioner.java|borderStyle=solid}
@Test

public void testGetPartitioner2() throws Exception {
 String testPartitionPath1 = "2016/09/26";
 String testPartitionPath2 = "2016/09/27";

 HoodieWriteConfig config = makeHoodieClientConfigBuilder()
 
.withCompactionConfig(HoodieCompactionConfig.newBuilder().compactionSmallFileSize(0)
 .insertSplitSize(100).autoTuneInsertSplits(false).build())
 .withStorageConfig(HoodieStorageConfig.newBuilder().limitFileSize(1000 * 
1024).build()).build();

 HoodieClientTestUtils.fakeCommitFile(basePath, "001");

 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieCopyOnWriteTable table = (HoodieCopyOnWriteTable) 
HoodieTable.create(metaClient, config, hadoopConf);
 HoodieTestDataGenerator dataGenerator1 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath1});
 List insertRecords1 = dataGenerator1.generateInserts("001", 200);
 List records1 = new ArrayList<>();
 records1.addAll(insertRecords1);

 HoodieTestDataGenerator dataGenerator2 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath2});
 List insertRecords2 = dataGenerator2.generateInserts("001", 
2000);
 records1.addAll(insertRecords2);

 WorkloadProfile profile = new WorkloadProfile(jsc.parallelize(records1));
 UpsertPartitioner partitioner = new UpsertPartitioner(profile, jsc, table, 
config);
 Map partition2numRecords = new HashMap();
 for (HoodieRecord hoodieRecord: insertRecords1) {
 int partition = partitioner.getPartition(new Tuple2<>(
 hoodieRecord.getKey(), Option.ofNullable(hoodieRecord.getCurrentLocation(;
 if (!partition2numRecords.containsKey(partition)) {
 partition2numRecords.put(partition, 0);
 }
 int num = partition2numRecords.get(partition);
 partition2numRecords.put(partition, num + 1);
 }
 System.out.println(partition2numRecords);
}

{code}
{{Run 5 times, the outputs are:}}
{code:java}
{20=106, 21=94}

{20=97, 21=103}

{20=95, 21=105}

{20=107, 21=93}

{20=113, 21=87}
 {code}


was (Author: shenhong):
It seems does not matter, we just need to ensure the distribution of records to 
fileId.

Add a testcase testGetPartitioner2, it has two partitions "2016/09/26" and 
"2016/09/27".  Generate 200 insert records to partition "2016/09/26" and 2000 
insert records to  "2016/09/27", since the fileSize limit is 1000 * 1024, 
partition "2016/09/26" will generate 2 fileGroup, partition "2016/09/27" will 
generate 20 fileGroup. For the 200 insert records to partition "2016/09/26", it 
will has approximately 100 records to fileGroup1, and the others to fileGroup2.

 
{code:java|title=TestUpsertPartitioner.java|borderStyle=solid}
}}

@Test

public void testGetPartitioner2() throws Exception {
 String testPartitionPath1 = "2016/09/26";
 String testPartitionPath2 = "2016/09/27";

 HoodieWriteConfig config = makeHoodieClientConfigBuilder()
 
.withCompactionConfig(HoodieCompactionConfig.newBuilder().compactionSmallFileSize(0)
 .insertSplitSize(100).autoTuneInsertSplits(false).build())
 .withStorageConfig(HoodieStorageConfig.newBuilder().limitFileSize(1000 * 
1024).build()).build();

 HoodieClientTestUtils.fakeCommitFile(basePath, "001");

 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieCopyOnWriteTable table = (HoodieCopyOnWriteTable) 
HoodieTable.create(metaClient, config, hadoopConf);
 HoodieTestDataGenerator dataGenerator1 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath1});
 List insertRecords1 = dataGenerator1.generateInserts("001", 200);
 List records1 = new ArrayList<>();
 records1.addAll(insertRecords1);

 HoodieTestDataGenerator dataGenerator2 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath2});
 List insertRecords2 = dataGenerator2.generateInserts("001", 
2000);
 records1.addAll(insertRecords2);

 WorkloadProfile profile = new WorkloadProfile(jsc.parallelize(records1));
 UpsertPartitioner partitioner = new UpsertPartitioner(profile, jsc, table, 
config);
 Map partition2numRecords = new HashMap();
 for (HoodieRecord hoodieRecord: insertRecords1) {
 int partition = partitioner.getPartition(new Tuple2<>(
 hoodieRecord.getKey(), Option.ofNullable(hoodieRecord.getCurr

[jira] [Comment Edited] (HUDI-1082) Bug in deciding the upsert/insert buckets

2020-07-14 Thread Hong Shen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157341#comment-17157341
 ] 

Hong Shen edited comment on HUDI-1082 at 7/14/20, 12:52 PM:


It seems does not matter, we just need to ensure the distribution of records to 
fileId.

Add a testcase testGetPartitioner2, it has two partitions "2016/09/26" and 
"2016/09/27".  Generate 200 insert records to partition "2016/09/26" and 2000 
insert records to  "2016/09/27", since the fileSize limit is 1000 * 1024, 
partition "2016/09/26" will generate 2 fileGroup, partition "2016/09/27" will 
generate 20 fileGroup. For the 200 insert records to partition "2016/09/26", it 
will has approximately 100 records to fileGroup1, and the others to fileGroup2.

 
{code:java|title=TestUpsertPartitioner.java|borderStyle=solid}
}}

@Test

public void testGetPartitioner2() throws Exception {
 String testPartitionPath1 = "2016/09/26";
 String testPartitionPath2 = "2016/09/27";

 HoodieWriteConfig config = makeHoodieClientConfigBuilder()
 
.withCompactionConfig(HoodieCompactionConfig.newBuilder().compactionSmallFileSize(0)
 .insertSplitSize(100).autoTuneInsertSplits(false).build())
 .withStorageConfig(HoodieStorageConfig.newBuilder().limitFileSize(1000 * 
1024).build()).build();

 HoodieClientTestUtils.fakeCommitFile(basePath, "001");

 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieCopyOnWriteTable table = (HoodieCopyOnWriteTable) 
HoodieTable.create(metaClient, config, hadoopConf);
 HoodieTestDataGenerator dataGenerator1 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath1});
 List insertRecords1 = dataGenerator1.generateInserts("001", 200);
 List records1 = new ArrayList<>();
 records1.addAll(insertRecords1);

 HoodieTestDataGenerator dataGenerator2 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath2});
 List insertRecords2 = dataGenerator2.generateInserts("001", 
2000);
 records1.addAll(insertRecords2);

 WorkloadProfile profile = new WorkloadProfile(jsc.parallelize(records1));
 UpsertPartitioner partitioner = new UpsertPartitioner(profile, jsc, table, 
config);
 Map partition2numRecords = new HashMap();
 for (HoodieRecord hoodieRecord: insertRecords1) {
 int partition = partitioner.getPartition(new Tuple2<>(
 hoodieRecord.getKey(), Option.ofNullable(hoodieRecord.getCurrentLocation(;
 if (!partition2numRecords.containsKey(partition)) {
 partition2numRecords.put(partition, 0);
 }
 int num = partition2numRecords.get(partition);
 partition2numRecords.put(partition, num + 1);
 }
 System.out.println(partition2numRecords);
}

{code}
{{Run 5 times, the outputs are:}}
{code:java}
{20=106, 21=94}

{20=97, 21=103}

{20=95, 21=105}

{20=107, 21=93}

{20=113, 21=87}
 {code}


was (Author: shenhong):
It seems does not matter, we just need to ensure the distribution of records to 
fileId.

Add a testcase testGetPartitioner2, it has two partitions "2016/09/26" and 
"2016/09/27".  Generate 200 insert records to partition "2016/09/26" and 2000 
insert records to  "2016/09/27", since the fileSize limit is 1000 * 1024, 
partition "2016/09/26" will generate 2 fileGroup, partition "2016/09/27" will 
generate 20 fileGroup. For the 200 insert records to partition "2016/09/26", it 
will has approximately 100 records to fileGroup1, and the others to fileGroup2.

 

{{{code:title=TestUpsertPartitioner.java|borderStyle=solid}}}

@Test

public void testGetPartitioner2() throws Exception {
 String testPartitionPath1 = "2016/09/26";
 String testPartitionPath2 = "2016/09/27";

 HoodieWriteConfig config = makeHoodieClientConfigBuilder()
 
.withCompactionConfig(HoodieCompactionConfig.newBuilder().compactionSmallFileSize(0)
 .insertSplitSize(100).autoTuneInsertSplits(false).build())
 .withStorageConfig(HoodieStorageConfig.newBuilder().limitFileSize(1000 * 
1024).build()).build();

 HoodieClientTestUtils.fakeCommitFile(basePath, "001");

 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieCopyOnWriteTable table = (HoodieCopyOnWriteTable) 
HoodieTable.create(metaClient, config, hadoopConf);
 HoodieTestDataGenerator dataGenerator1 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath1});
 List insertRecords1 = dataGenerator1.generateInserts("001", 200);
 List records1 = new ArrayList<>();
 records1.addAll(insertRecords1);

 HoodieTestDataGenerator dataGenerator2 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath2});
 List insertRecords2 = dataGenerator2.generateInserts("001", 
2000);
 records1.addAll(insertRecords2);

 WorkloadProfile profile = new WorkloadProfile(jsc.parallelize(records1));
 UpsertPartitioner partitioner = new UpsertPartitioner(profile, jsc, table, 
config);
 Map partition2numRecords = new HashMap();
 for (HoodieRecord hoodieRecord: insertRecords1) {
 int partition = partitioner.getPartition(new Tuple2<>(
 hoodieRecord.getKey(), Option.ofNullable(hoodieRecord.getCur

[jira] [Commented] (HUDI-1082) Bug in deciding the upsert/insert buckets

2020-07-14 Thread Hong Shen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157341#comment-17157341
 ] 

Hong Shen commented on HUDI-1082:
-

It seems does not matter, we just need to ensure the distribution of records to 
fileId.

Add a testcase testGetPartitioner2, it has two partitions "2016/09/26" and 
"2016/09/27".  Generate 200 insert records to partition "2016/09/26" and 2000 
insert records to  "2016/09/27", since the fileSize limit is 1000 * 1024, 
partition "2016/09/26" will generate 2 fileGroup, partition "2016/09/27" will 
generate 20 fileGroup. For the 200 insert records to partition "2016/09/26", it 
will has approximately 100 records to fileGroup1, and the others to fileGroup2.

 

{{{code:title=TestUpsertPartitioner.java|borderStyle=solid}}}

@Test

public void testGetPartitioner2() throws Exception {
 String testPartitionPath1 = "2016/09/26";
 String testPartitionPath2 = "2016/09/27";

 HoodieWriteConfig config = makeHoodieClientConfigBuilder()
 
.withCompactionConfig(HoodieCompactionConfig.newBuilder().compactionSmallFileSize(0)
 .insertSplitSize(100).autoTuneInsertSplits(false).build())
 .withStorageConfig(HoodieStorageConfig.newBuilder().limitFileSize(1000 * 
1024).build()).build();

 HoodieClientTestUtils.fakeCommitFile(basePath, "001");

 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieCopyOnWriteTable table = (HoodieCopyOnWriteTable) 
HoodieTable.create(metaClient, config, hadoopConf);
 HoodieTestDataGenerator dataGenerator1 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath1});
 List insertRecords1 = dataGenerator1.generateInserts("001", 200);
 List records1 = new ArrayList<>();
 records1.addAll(insertRecords1);

 HoodieTestDataGenerator dataGenerator2 = new HoodieTestDataGenerator(new 
String[] \{testPartitionPath2});
 List insertRecords2 = dataGenerator2.generateInserts("001", 
2000);
 records1.addAll(insertRecords2);

 WorkloadProfile profile = new WorkloadProfile(jsc.parallelize(records1));
 UpsertPartitioner partitioner = new UpsertPartitioner(profile, jsc, table, 
config);
 Map partition2numRecords = new HashMap();
 for (HoodieRecord hoodieRecord: insertRecords1) {
 int partition = partitioner.getPartition(new Tuple2<>(
 hoodieRecord.getKey(), Option.ofNullable(hoodieRecord.getCurrentLocation(;
 if (!partition2numRecords.containsKey(partition)) {
 partition2numRecords.put(partition, 0);
 }
 int num = partition2numRecords.get(partition);
 partition2numRecords.put(partition, num + 1);
 }
 System.out.println(partition2numRecords);
}

{{{code}}}

{{}}

{{Run 5 times, the outputs are:}}

{{{code}}}

{{}}

 \{20=106, 21=94}

{20=97, 21=103}

{20=95, 21=105}

{20=107, 21=93}

{20=113, 21=87}

{{{code}}}

 

{{}}

{{}}

> Bug in deciding the upsert/insert buckets
> -
>
> Key: HUDI-1082
> URL: https://issues.apache.org/jira/browse/HUDI-1082
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Affects Versions: 0.6.0
>Reporter: sivabalan narayanan
>Assignee: Hong Shen
>Priority: Major
> Fix For: 0.6.1
>
>
> In 
> [UpsertPartitioner|[https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]],
>  when getPartition(Object key) is called, the logic to determine where the 
> record to be inserted is relying on globalInsertCounts where as this should 
> be perPartitionInsertCount.
>  
> Bcoz, the weights for all targetInsert buckets are determined based on total 
> Inserts going into the partition of interest. // check like 200. Whereas when 
> getPartition(key) is called, we use global insert count to determine the 
> right bucket. 
>  
> For instance,
> P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be 
> inserted is 100.
> P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be 
> inserted is 10025. 
> So, ideal allocation is
> P1: B0 -> 20, B1 -> 50, B2 -> 30
> P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503
>  
> getPartition() for a key is determined based on following.
> mod (hash value, inserts)/ inserts.
> Instead of considering inserts for the partition of interest, currently we 
> take global insert counts.
> Lets say, these are the hash values for insert records in P1.
> 5, 14, 20, 25, 90, 500, 1001, 5180.
> record hash | expected bucket in P1 | actual bucket in P1 |
> 5     | B0 | B0
> 14   | B0 | B0
> 21   | B1  | B0
> 30  | B1 | B0
> 90 | B2 | B0
> 500 | B0 | B0
> 1490 | B2 | B1
> 10019 | B0 | B3
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] Mathieu1124 edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-14 Thread GitBox


Mathieu1124 edited a comment on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-658151869


   Hi, @vinothchandar @yanghua @leesf  as the refactor is finished, I have 
filed a Jira ticket to track this work,
   please review the refactor work on this pr :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-14 Thread GitBox


Mathieu1124 commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-658151869


   Hi, @vinothchandar @yanghua @leesf  as the refactor is finished, I have 
filed a Jira ticket to track this work,
   please review this on this pr :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 commented on pull request #1727: [Review] refactor hudi-client

2020-07-14 Thread GitBox


Mathieu1124 commented on pull request #1727:
URL: https://github.com/apache/hudi/pull/1727#issuecomment-658150685


   refactor is finished, review goes to https://github.com/apache/hudi/pull/1827
   closing this



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 edited a comment on pull request #1727: [Review] refactor hudi-client

2020-07-14 Thread GitBox


Mathieu1124 edited a comment on pull request #1727:
URL: https://github.com/apache/hudi/pull/1727#issuecomment-658150685


   refactor is finished, review goes to https://github.com/apache/hudi/pull/1827



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 opened a new pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-14 Thread GitBox


Mathieu1124 opened a new pull request #1827:
URL: https://github.com/apache/hudi/pull/1827


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Refactor hudi-client to support multi-engine*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests*.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1089) Refactor hudi-client to support multi-engine

2020-07-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1089:
-
Labels: pull-request-available  (was: )

> Refactor hudi-client to support multi-engine
> 
>
> Key: HUDI-1089
> URL: https://issues.apache.org/jira/browse/HUDI-1089
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Usability
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> To make hudi support more engines, we should abstract the current hudi-client 
> module.
> This Jira aims to abstract hudi-client module and implements spark engine 
> code.
> The structure looks like this:
> hudi-client
>  ├── hudi-client-common
> ├── hudi-client-spark
> ├── hudi-client-flink
> ├── hudi-client-java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-14 Thread GitBox


nsivabalan commented on pull request #1819:
URL: https://github.com/apache/hudi/pull/1819#issuecomment-658130555


   Looks like there are 2 options here. 
   Option1: Change interface for combineAndGetUpdateValue and getInsertValue to 
take in delete field.
   check https://github.com/apache/hudi/pull/1792 on why we need to fix 
getInsertValue as well. 
   Option2: Approach taken in the patch, where in expose a setDeleteField 
method in OverwriteWithLatestAvroPayload and this will be set while generating 
HoodieRecords from GenericRecord.
   
   Since making changes in getInsertValue will touch lot of files, guess we can 
go with Option2 with some changes. 
   Instead of exposing a new method, why can't we add an overloaded 
constructor. When HoodieSparkSqlWriter calls into 
   ```
   DataSourceUtils.createHoodieRecord(gr,
   orderingVal, keyGenerator.getKey(gr), 
parameters(PAYLOAD_CLASS_OPT_KEY))
   ```
   we could also pass in a delete field optionally which inturn will be passed 
into constructor of OverwriteWithLatestAvroPayload. If delete field is set by 
user, then we take that else we resort to "_hoodie_is_deleted".
   
   Let me know your thoughts.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #1824: [HUDI-996] Add functional test suite in hudi-client

2020-07-14 Thread GitBox


yanghua commented on pull request #1824:
URL: https://github.com/apache/hudi/pull/1824#issuecomment-658128510


   @vinothchandar Do you still have any concerns?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1082) Bug in deciding the upsert/insert buckets

2020-07-14 Thread Hong Shen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157284#comment-17157284
 ] 

Hong Shen commented on HUDI-1082:
-

OK, I will add testcase and fix it.

> Bug in deciding the upsert/insert buckets
> -
>
> Key: HUDI-1082
> URL: https://issues.apache.org/jira/browse/HUDI-1082
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Affects Versions: 0.6.0
>Reporter: sivabalan narayanan
>Assignee: Hong Shen
>Priority: Major
> Fix For: 0.6.1
>
>
> In 
> [UpsertPartitioner|[https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]],
>  when getPartition(Object key) is called, the logic to determine where the 
> record to be inserted is relying on globalInsertCounts where as this should 
> be perPartitionInsertCount.
>  
> Bcoz, the weights for all targetInsert buckets are determined based on total 
> Inserts going into the partition of interest. // check like 200. Whereas when 
> getPartition(key) is called, we use global insert count to determine the 
> right bucket. 
>  
> For instance,
> P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be 
> inserted is 100.
> P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be 
> inserted is 10025. 
> So, ideal allocation is
> P1: B0 -> 20, B1 -> 50, B2 -> 30
> P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503
>  
> getPartition() for a key is determined based on following.
> mod (hash value, inserts)/ inserts.
> Instead of considering inserts for the partition of interest, currently we 
> take global insert counts.
> Lets say, these are the hash values for insert records in P1.
> 5, 14, 20, 25, 90, 500, 1001, 5180.
> record hash | expected bucket in P1 | actual bucket in P1 |
> 5     | B0 | B0
> 14   | B0 | B0
> 21   | B1  | B0
> 30  | B1 | B0
> 90 | B2 | B0
> 500 | B0 | B0
> 1490 | B2 | B1
> 10019 | B0 | B3
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1067) Replace the integer version field with HoodieLogBlockVersion data structure

2020-07-14 Thread Trevorzhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trevorzhang reassigned HUDI-1067:
-

Assignee: Trevorzhang

> Replace the integer version field with HoodieLogBlockVersion data structure
> ---
>
> Key: HUDI-1067
> URL: https://issues.apache.org/jira/browse/HUDI-1067
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: vinoyang
>Assignee: Trevorzhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1090) presto读取hudi数据错位

2020-07-14 Thread obar (Jira)
obar created HUDI-1090:
--

 Summary: presto读取hudi数据错位
 Key: HUDI-1090
 URL: https://issues.apache.org/jira/browse/HUDI-1090
 Project: Apache Hudi
  Issue Type: Bug
Reporter: obar
 Attachments: 15947189106642.png

presto版本333

hudi版本0.5.3

hive读取hudi数据正常

presto读取hudi数据错位



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-909) Introduce hudi-client-flink module to support flink engine

2020-07-14 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-909:
-
Description: Introduce hudi-client-flink module to support flink engine 
based on new abstraction  (was: Introduce hudi-client-flink module to support 
flink engine based on new abstraction

*https://issues.apache.org/jira/browse/HUDI-1089*)

> Introduce hudi-client-flink module to support flink engine
> --
>
> Key: HUDI-909
> URL: https://issues.apache.org/jira/browse/HUDI-909
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
> Fix For: 0.6.0
>
>
> Introduce hudi-client-flink module to support flink engine based on new 
> abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-909) Introduce hudi-client-flink module to support flink engine

2020-07-14 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-909:
-
Status: In Progress  (was: Open)

> Introduce hudi-client-flink module to support flink engine
> --
>
> Key: HUDI-909
> URL: https://issues.apache.org/jira/browse/HUDI-909
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>
> Introduce hudi-client-flink module to support flink engine based on new 
> abstraction
> *https://issues.apache.org/jira/browse/HUDI-1089*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-909) Introduce hudi-client-flink module to support flink engine

2020-07-14 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-909:
-
Fix Version/s: 0.6.0
   Issue Type: Task  (was: Wish)

> Introduce hudi-client-flink module to support flink engine
> --
>
> Key: HUDI-909
> URL: https://issues.apache.org/jira/browse/HUDI-909
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
> Fix For: 0.6.0
>
>
> Introduce hudi-client-flink module to support flink engine based on new 
> abstraction
> *https://issues.apache.org/jira/browse/HUDI-1089*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-909) Introduce hudi-client-flink module to support flink engine

2020-07-14 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-909:
-
Status: Open  (was: New)

> Introduce hudi-client-flink module to support flink engine
> --
>
> Key: HUDI-909
> URL: https://issues.apache.org/jira/browse/HUDI-909
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>
> Introduce hudi-client-flink module to support flink engine based on new 
> abstraction
> *https://issues.apache.org/jira/browse/HUDI-1089*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-909) Introduce hudi-client-flink module to support flink engine

2020-07-14 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-909:
-
Description: 
Introduce hudi-client-flink module to support flink engine based on new 
abstraction

*https://issues.apache.org/jira/browse/HUDI-1089*

  was:
To make hudi support more engines, we should redesigin the high level abstract 
of hudi.

Such as HoodieTable, HoodieClient(including HoodieWriteClient,HoodieReadClient 
etc.), HoodieIndex, ActionExecutor etc.


> Introduce hudi-client-flink module to support flink engine
> --
>
> Key: HUDI-909
> URL: https://issues.apache.org/jira/browse/HUDI-909
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>
> Introduce hudi-client-flink module to support flink engine based on new 
> abstraction
> *https://issues.apache.org/jira/browse/HUDI-1089*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-909) Introduce hudi-client-flink module to support flink engine

2020-07-14 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-909:
-
Summary: Introduce hudi-client-flink module to support flink engine  (was: 
Introduce high level abstraction of hudi write client)

> Introduce hudi-client-flink module to support flink engine
> --
>
> Key: HUDI-909
> URL: https://issues.apache.org/jira/browse/HUDI-909
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>
> To make hudi support more engines, we should redesigin the high level 
> abstract of hudi.
> Such as HoodieTable, HoodieClient(including 
> HoodieWriteClient,HoodieReadClient etc.), HoodieIndex, ActionExecutor etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >