[GitHub] [incubator-hudi] GSHF commented on issue #1563: [SUPPORT] When I package according to the package command in GitHub, I always report an error, such as

2020-04-25 Thread GitBox


GSHF commented on issue #1563:
URL: https://github.com/apache/incubator-hudi/issues/1563#issuecomment-619496046


   > Thanks for reporting this issue. You're right, we can't find 
`xmlenc:xmlenc:jar:sources:0.52` in maven repo, but it doesn't affect the build 
process.
   > 
   > **1. Build Env**
   > 
   > * JDK8
   > * Unix
   > 
   > **2. Commands**
   > 
   > ```
   > git clone https://github.com/apache/incubator-hudi.git
   > mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
-Drat.skip=true
   > ```
   
   hi,Just now, I tried "MVN clean install - dskiptests - dskipits - 
dcheckstyle. Skip = true - drat. Skip = true" or "could not find article 
xmlenc: xmlenc: jar: sources: 0.52 in Maven central" 
(https://repo1.maven.org/maven 2) in idea of Windows version.
   JDK1.8 I use,
   Can I only run packaging in a linux environment?
   please help me



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] harshi2506 commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-25 Thread GitBox


harshi2506 commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619491671


   > hi @harshi2506, build steps:
   > **1. Build Env**
   > 
   > * JDK8
   > * Unix
   > 
   > **2. Commands**
   > 
   > ```
   > git clone https://github.com/apache/incubator-hudi.git
   > mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
-Drat.skip=true
   > ```
   > 
   > **3. Run env**
   > 
   > * Spark-2.4.4+
   > * avro-1.8.0
   > 
   > ```
   > // run in local env
   > export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   > ${SPARK_HOME}/bin/spark-shell \
   >   --driver-memory 6G \
   >   --packages org.apache.spark:spark-avro_2.11:2.4.4 \
   >   --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
   >   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   > 
   > // run in yarn env
   > export SPARK_HOME=/BigData/install/spark-2.4.4-bin-hadoop2.7
   > ${SPARK_HOME}/bin/spark-shell \
   >   --master yarn \
   >   --driver-memory 6G \
   >   --executor-memory 6G \
   >   --num-executors 5 \
   >   --executor-cores 5 \
   >   --queue root.default \
   >   --packages org.apache.spark:spark-avro_2.11:2.4.4 \
   >   --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
   >   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   > 
   > // scripts
   > import org.apache.spark.sql.functions._
   > 
   > val tableName = "hudi_mor_table"
   > val basePath = "file:///tmp/hudi_mor_tablen"
   > // val basePath = "hdfs:///hudi/test"
   > 
   > val hudiOptions = Map[String,String](
   >   "hoodie.upsert.shuffle.parallelism" -> "10",
   >   "hoodie.datasource.write.recordkey.field" -> "key",
   >   "hoodie.datasource.write.partitionpath.field" -> "dt", 
   >   "hoodie.table.name" -> tableName,
   >   "hoodie.datasource.write.precombine.field" -> "timestamp"
   > )
   > 
   > val inputDF = spark.range(1, 7).
   >withColumn("key", $"id").
   >withColumn("data", lit("data")).
   >withColumn("timestamp", unix_timestamp()).
   >withColumn("dtstamp", unix_timestamp() + ($"id" * 24 * 3600)).
   >withColumn("dt", from_unixtime($"dtstamp", "/MM/dd"))
   > 
   > inputDF.write.format("org.apache.hudi").
   >   options(hudiOptions).
   >   mode("Overwrite").
   >   save(basePath)
   > 
   > spark.read.format("org.apache.hudi").load(basePath + "/*/*/*").show();
   > ```
   
   @lamber-ken will try and let you know. Thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] yanghua commented on pull request #1553: [HUDI-810] Migrate ClientTestHarness to JUnit 5

2020-04-25 Thread GitBox


yanghua commented on pull request #1553:
URL: https://github.com/apache/incubator-hudi/pull/1553#issuecomment-619486621


   @xushiyan There are some conflicting files.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] GSHF commented on issue #1563: [SUPPORT] When I package according to the package command in GitHub, I always report an error, such as

2020-04-25 Thread GitBox


GSHF commented on issue #1563:
URL: https://github.com/apache/incubator-hudi/issues/1563#issuecomment-619480158


   Just now, I tried "MVN clean install - dskiptests - dskipits - dcheckstyle. 
Skip = true - drat. Skip = true" or "could not find article xmlenc: xmlenc: 
jar: sources: 0.52 in Maven central" (https://repo1.maven.org/maven 2) in idea 
of Windows version



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] GSHF commented on issue #1563: [SUPPORT] When I package according to the package command in GitHub, I always report an error, such as

2020-04-25 Thread GitBox


GSHF commented on issue #1563:
URL: https://github.com/apache/incubator-hudi/issues/1563#issuecomment-619480090


   Do you have to package on Linux? It's good to pack on the idea of windows
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #259

2020-04-25 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.39 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
or

[GitHub] [incubator-hudi] lamber-ken commented on issue #1563: When I package according to the package command in GitHub, I always report an error, such as

2020-04-25 Thread GitBox


lamber-ken commented on issue #1563:
URL: https://github.com/apache/incubator-hudi/issues/1563#issuecomment-619471157


   Thanks for reporting this issue. You're right, we can't find 
`xmlenc:xmlenc:jar:sources:0.52` in maven repo, but it doesn't affect the build 
process.
   
   **1. Build Env**
   - JDK8
   - Unix
   
   **2. Commands**
   ```
   git clone https://github.com/apache/incubator-hudi.git
   mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
-Drat.skip=true
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] GSHF opened a new issue #1563: When I package according to the package command in GitHub, I always report an error, such as

2020-04-25 Thread GitBox


GSHF opened a new issue #1563:
URL: https://github.com/apache/incubator-hudi/issues/1563


   Could not transfer artifact xmlenc:xmlenc:jar:sources:0.52 from/to Maven 
Central (https://repo1.maven.org/maven2): repo1.maven.org
   
   in "https://repo1.maven.org/maven2"; cannot find 
"xmlenc:xmlenc:jar:sources:0.52",please help me.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-25 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-69:
---
Description: 
[https://github.com/uber/hudi/issues/136]

RFC: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]

WIP commit: [https://github.com/garyli1019/incubator-hudi/pull/1]

  was:
[https://github.com/uber/hudi/issues/136]

RFC: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]


> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> WIP commit: [https://github.com/garyli1019/incubator-hudi/pull/1]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2020-04-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-309:

Fix Version/s: (was: 0.6.0)

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-86) Add indexing support to the log file format

2020-04-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-86?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-86:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Add indexing support to the log file format
> ---
>
> Key: HUDI-86
> URL: https://issues.apache.org/jira/browse/HUDI-86
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Index, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: realtime-data-lakes
> Fix For: 0.6.1
>
>
> https://github.com/apache/incubator-hudi/pull/519



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-839) Implement rollbacks using marker files instead of relying on commit metadata

2020-04-25 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-839:
---

 Summary: Implement rollbacks using marker files instead of relying 
on commit metadata
 Key: HUDI-839
 URL: https://issues.apache.org/jira/browse/HUDI-839
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Writer Core
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar
 Fix For: 0.6.0


This is more efficient and avoids the needs for caching the input into memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2020-04-25 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092430#comment-17092430
 ] 

Vinoth Chandar commented on HUDI-309:
-

Closing the loop here.. https://issues.apache.org/jira/browse/HUDI-839 tracks 
this..  

Untagging this from the 0.6.0 release 

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-25 Thread GitBox


lamber-ken edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619449248


   hi @harshi2506, build steps:
   **1. Build Env**
   - JDK8
   - Unix
   
   **2. Commands**
   ```
   git clone https://github.com/apache/incubator-hudi.git
   mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
-Drat.skip=true
   ```
   
   **3. Run env**
   - Spark-2.4.4+
   - avro-1.8.0
   ```
   // run in local env
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
 --driver-memory 6G \
 --packages org.apache.spark:spark-avro_2.11:2.4.4 \
 --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   // run in yarn env
   export SPARK_HOME=/BigData/install/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
 --master yarn \
 --driver-memory 6G \
 --executor-memory 6G \
 --num-executors 5 \
 --executor-cores 5 \
 --queue root.default \
 --packages org.apache.spark:spark-avro_2.11:2.4.4 \
 --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   // scripts
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   // val basePath = "hdfs:///hudi/test"
   
   val hudiOptions = Map[String,String](
 "hoodie.upsert.shuffle.parallelism" -> "10",
 "hoodie.datasource.write.recordkey.field" -> "key",
 "hoodie.datasource.write.partitionpath.field" -> "dt", 
 "hoodie.table.name" -> tableName,
 "hoodie.datasource.write.precombine.field" -> "timestamp"
   )
   
   val inputDF = spark.range(1, 7).
  withColumn("key", $"id").
  withColumn("data", lit("data")).
  withColumn("timestamp", unix_timestamp()).
  withColumn("dtstamp", unix_timestamp() + ($"id" * 24 * 3600)).
  withColumn("dt", from_unixtime($"dtstamp", "/MM/dd"))
   
   inputDF.write.format("org.apache.hudi").
 options(hudiOptions).
 mode("Overwrite").
 save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*/*").show();
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-25 Thread GitBox


lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619449248


   hi @harshi2506, build steps:
   **1. Build Env**
   - JDK8
   - Unix
   
   **2. Commands**
   ```
   git clone https://github.com/apache/incubator-hudi.git
   mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
-Drat.skip=true
   
   ```
   
   **3. Run env**
   - Spark-2.4.4+
   - avro-1.8.0
   ```
   // run in local env
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
 --driver-memory 6G \
 --packages org.apache.spark:spark-avro_2.11:2.4.4 \
 --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   // run in yarn env
   export SPARK_HOME=/BigData/install/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
 --master yarn \
 --driver-memory 6G \
 --executor-memory 6G \
 --num-executors 5 \
 --executor-cores 5 \
 --queue root.default \
 --packages org.apache.spark:spark-avro_2.11:2.4.4 \
 --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   // scripts
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   // val basePath = "hdfs:///hudi/test"
   
   val hudiOptions = Map[String,String](
 "hoodie.upsert.shuffle.parallelism" -> "10",
 "hoodie.datasource.write.recordkey.field" -> "key",
 "hoodie.datasource.write.partitionpath.field" -> "dt", 
 "hoodie.table.name" -> tableName,
 "hoodie.datasource.write.precombine.field" -> "timestamp"
   )
   
   val inputDF = spark.range(1, 7).
  withColumn("key", $"id").
  withColumn("data", lit("data")).
  withColumn("timestamp", unix_timestamp()).
  withColumn("dtstamp", unix_timestamp() + ($"id" * 24 * 3600)).
  withColumn("dt", from_unixtime($"dtstamp", "/MM/dd"))
   
   inputDF.write.format("org.apache.hudi").
 options(hudiOptions).
 mode("Overwrite").
 save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*/*").show();
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] n3nash commented on issue #1549: Potential issue when using Deltastreamer with DMS

2020-04-25 Thread GitBox


n3nash commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-619420599


   @PhatakN1 So this is what is possibly happening : 
   
   1) Hard Deletes in Hudi is only supported by following a certain contract 
with your payload. Your payload implementation should carry an "empty" record 
value. Something like this -> 
https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/EmptyHoodieRecordPayload.java
 that is supported out of the box by HoodieWriteClient here -> 
https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L307.
   
   To simulate this with your own implementation (AwsDMSPayload), you can 
override the methods `getInsertValue` and `combineAndGetUpdateValue` (take a 
look at the EmptyPayload above). 
   
   2) In your specific use-case, since your payload isn't an "empty" payload, 
even though the MERGE on the realtime query is happening through the 
implementation of your payload, Hudi doesn't know whether this is a "hard 
delete" or a "soft delete" - the only way for Hudi to know it's a hard delete 
is the way I described above. 
   
   Let me know if you have further questions 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on pull request #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-25 Thread GitBox


vinothchandar commented on pull request #1457:
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-619414521


   @n3nash @prashantwason @bvaradar Noticed that the checks are not performed 
for bulk_insert.. Is n't this a problem?  Why exclude bulk_insert? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on pull request #1562: [HUDI-837]: implemented custom deserializer for AvroKafkaSource

2020-04-25 Thread GitBox


vinothchandar commented on pull request #1562:
URL: https://github.com/apache/incubator-hudi/pull/1562#issuecomment-619409517


   @afilipchik @umehrot2   help review this? :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #765: [WIP] Fix KafkaAvroSource to use the latest schema

2020-04-25 Thread GitBox


pratyakshsharma commented on pull request #765:
URL: https://github.com/apache/incubator-hudi/pull/765#issuecomment-619396688


   @vinothchandar I raised https://github.com/apache/incubator-hudi/pull/1562 
for this feature. I guess we can close this PR then. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1562: [HUDI-837]: implemented custom deserializer for AvroKafkaSource

2020-04-25 Thread GitBox


pratyakshsharma commented on pull request #1562:
URL: https://github.com/apache/incubator-hudi/pull/1562#issuecomment-619396502


   Thinking of writing test cases for this, but unable to simulate because 
AbstractKafkaAvroDeserializer expects a working schema-registry url. Not sure 
of how to mock the same here since it is library class. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-837) Fix AvroKafkaSource to use the latest schema for reading

2020-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-837:

Labels: pull-request-available  (was: )

> Fix AvroKafkaSource to use the latest schema for reading
> 
>
> Key: HUDI-837
> URL: https://issues.apache.org/jira/browse/HUDI-837
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currently we specify KafkaAvroDeserializer as the value for 
> value.deserializer in AvroKafkaSource. This implies the published record is 
> read using the same schema with which it was written even though the schema 
> got evolved in between. As a result, messages in incoming batch can have 
> different schemas. This has to be handled at the time of actually writing 
> records in parquet. 
> This Jira aims at providing an option to read all the messages with the same 
> schema by implementing a new custom deserializer class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma opened a new pull request #1562: [HUDI-837]: implemented custom deserializer for AvroKafkaSource

2020-04-25 Thread GitBox


pratyakshsharma opened a new pull request #1562:
URL: https://github.com/apache/incubator-hudi/pull/1562


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
- When we read data from Kafka, we want to always read with the latest 
schema.
- This allows us to make assumption throughout the rest of the pipeline 
that every record has the same schema.
- We create a custom KafkaAvroDecoder that use the latest schema as read 
schema.
- This does not work with all SchemaProvider yet.
   
   ## Brief change log
   
   - Implemented HoodieAvroKafkaDeserializer for supplying readerSchema as per 
user's need.
   - Introduced a property to configure "value.deserializer" property for 
AvroKafkaSource. 
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch asf-site updated: Travis CI build asf-site

2020-04-25 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 609d5bf  Travis CI build asf-site
609d5bf is described below

commit 609d5bf8c3d0a1f4461ff2e4aa548daceedd11d2
Author: CI 
AuthorDate: Sat Apr 25 13:14:10 2020 +

Travis CI build asf-site
---
 content/assets/js/lunr/lunr-store.js   | 2 +-
 content/cn/docs/quick-start-guide.html | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/content/assets/js/lunr/lunr-store.js 
b/content/assets/js/lunr/lunr-store.js
index 1d0335b..f690419 100644
--- a/content/assets/js/lunr/lunr-store.js
+++ b/content/assets/js/lunr/lunr-store.js
@@ -545,7 +545,7 @@ var store = [{
 "url": "https://hudi.apache.org/docs/oss_hoodie.html";,
 "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
 "title": "Quick-Start Guide",
-
"excerpt":"本指南通过使用spark-shell简要介绍了Hudi功能。使用Spark数据源,我们将通过代码段展示如何插入和更新的Hudi默认存储类型数据集:
 写时复制。每次写操作之后,我们还将展示如何读取快照和增量读取数据。 设置spark-shell 
Hudi适用于Spark-2.x版本。您可以按照此处的说明设置spark。 在提取的目录中,使用spark-shell运行Hudi: 
bin/spark-shell --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating 
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 
设置表名、基本路径和数据生成器来为本指南生成记录。 import org.apache.hudi.QuickstartUtils._ import 
scala.collection.JavaConversions._ import org.apache.spark.sql. [...]
+
"excerpt":"本指南通过使用spark-shell简要介绍了Hudi功能。使用Spark数据源,我们将通过代码段展示如何插入和更新Hudi的默认存储类型数据集:
 写时复制。每次写操作之后,我们还将展示如何读取快照和增量数据。 设置spark-shell 
Hudi适用于Spark-2.x版本。您可以按照此处的说明设置spark。 在提取的目录中,使用spark-shell运行Hudi: 
bin/spark-shell --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating 
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 
设置表名、基本路径和数据生成器来为本指南生成记录。 import org.apache.hudi.QuickstartUtils._ import 
scala.collection.JavaConversions._ import org.apache.spark.sql.Sa [...]
 "tags": [],
 "url": "https://hudi.apache.org/cn/docs/quick-start-guide.html";,
 "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
diff --git a/content/cn/docs/quick-start-guide.html 
b/content/cn/docs/quick-start-guide.html
index 3dcd47b..f1bd106 100644
--- a/content/cn/docs/quick-start-guide.html
+++ b/content/cn/docs/quick-start-guide.html
@@ -4,7 +4,7 @@
 
 
 Quick-Start Guide - Apache Hudi
-
+
 
 
 
@@ -13,7 +13,7 @@
 https://hudi.apache.org/cn/docs/quick-start-guide.html";>
 
 
-  
+  
 
 
 
@@ -346,8 +346,8 @@
   
 
 
-
本指南通过使用spark-shell简要介绍了Hudi功能。使用Spark数据源,我们将通过代码段展示如何插入和更新的Hudi默认存储类型数据集:
-写时复制。每次写操作之后,我们还将展示如何读取快照和增量读取数据。
+
本指南通过使用spark-shell简要介绍了Hudi功能。使用Spark数据源,我们将通过代码段展示如何插入和更新Hudi的默认存储类型数据集:
+写时复制。每次写操作之后,我们还将展示如何读取快照和增量数据。
 
 设置spark-shell
 Hudi适用于Spark-2.x版本。您可以按照https://spark.apache.org/downloads.html";>此处的说明设置spark。



[incubator-hudi] branch asf-site updated: [MINOR] revise translation (#1561)

2020-04-25 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new ba4a236  [MINOR] revise translation (#1561)
ba4a236 is described below

commit ba4a23652fe1f52f9a9eada4f6d5feac3374f961
Author: wanglisheng81 <37138788+wanglishen...@users.noreply.github.com>
AuthorDate: Sat Apr 25 21:12:16 2020 +0800

[MINOR] revise translation (#1561)
---
 docs/_docs/1_1_quick_start_guide.cn.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/_docs/1_1_quick_start_guide.cn.md 
b/docs/_docs/1_1_quick_start_guide.cn.md
index 7404bb8..f20e212 100644
--- a/docs/_docs/1_1_quick_start_guide.cn.md
+++ b/docs/_docs/1_1_quick_start_guide.cn.md
@@ -6,8 +6,8 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 language: cn
 ---
 
-本指南通过使用spark-shell简要介绍了Hudi功能。使用Spark数据源,我们将通过代码段展示如何插入和更新的Hudi默认存储类型数据集:
-[写时复制](/cn/docs/concepts.html#copy-on-write-storage)。每次写操作之后,我们还将展示如何读取快照和增量读取数据。
 
+本指南通过使用spark-shell简要介绍了Hudi功能。使用Spark数据源,我们将通过代码段展示如何插入和更新Hudi的默认存储类型数据集:
+[写时复制](/cn/docs/concepts.html#copy-on-write-storage)。每次写操作之后,我们还将展示如何读取快照和增量数据。
 
 
 ## 设置spark-shell
 
Hudi适用于Spark-2.x版本。您可以按照[此处](https://spark.apache.org/downloads.html)的说明设置spark。



[GitHub] [incubator-hudi] wanglisheng81 opened a new pull request #1561: [MINOR] revise translation

2020-04-25 Thread GitBox


wanglisheng81 opened a new pull request #1561:
URL: https://github.com/apache/incubator-hudi/pull/1561


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Revise translation in quick start*
   
   ## Brief change log
   
   *Revise translation in quick start*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] wanglisheng81 opened a new pull request #1560: Revise translation

2020-04-25 Thread GitBox


wanglisheng81 opened a new pull request #1560:
URL: https://github.com/apache/incubator-hudi/pull/1560


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Revise translation in quick start*
   
   ## Brief change log
   
   *Revise translation in quick start*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] leesf commented on pull request #1044: [HUDI-361] Implement CSV metrics reporter

2020-04-25 Thread GitBox


leesf commented on pull request #1044:
URL: https://github.com/apache/incubator-hudi/pull/1044#issuecomment-619355710


   > hi @XuQianJin-Stars, thanks for your contribution, CSV metrices reporter 
seems not popular in product, so I suggest closing this, WDYT? @leesf
   > 
   > also, flink doesn't use CSV reporter
   > 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#reporter
   
   sorry for late response. +1 to close the PR



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org