[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972929
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
 
 Review comment:
   Also need help here to add context on how Spark SQL integrates with Spark 
and Hive. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972858
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
 Review comment:
   Agreed. I moved this section below Spark SQL. I need some help here with 
adding additional context though. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972254
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -145,8 +161,13 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 | filterExists() | Filter out already existing records from the provided 
RDD[HoodieRecord]. Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
+### Read optimized query
+
+For read optimized queries, either Hive SerDe can be used by turning off 
convertMetastoreParquet as described above or Spark's built in support can be 
leveraged. 
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972257
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -145,8 +161,13 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 | filterExists() | Filter out already existing records from the provided 
RDD[HoodieRecord]. Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
+### Read optimized query
+
+For read optimized queries, either Hive SerDe can be used by turning off 
convertMetastoreParquet as described above or Spark's built in support can be 
leveraged. 
+If using spark's built in support, additionally a path filter needs to be 
pushed into sparkContext as described earlier.
 
 ## Presto
 
-Presto is a popular query engine, providing interactive query performance. 
Presto currently supports only read optimized queries on Hudi tables. 
-This requires the `hudi-presto-bundle` jar to be placed into 
`/plugin/hive-hadoop2/`, across the installation.
+Presto is a popular query engine, providing interactive query performance. 
Presto currently supports snapshot queries on
+COPY_On_WRITE and read optimized queries on MERGE_ON_READ Hudi tables. This 
requires the `hudi-presto-bundle` jar 
 
 Review comment:
   will fix


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972249
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
+as described above via --jars or --packages.
  
-### Read optimized query
-
-Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
-This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
-
-```scala
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
-```
-
-If you prefer to glob paths on DFS via the datasource, you can simply do 
something like below to get a Spark dataframe to work with. 
+### Snapshot query {#spark-snapshot-query}
+By default, Spark SQL will try to use its own parquet support instead of Hive 
SerDe when reading from Hive metastore parquet tables. 
+However, for MERGE_ON_READ tables which has both parquet and avro data, this 
default setting needs to be turned off using set 
`spark.sql.hive.convertMetastoreParquet=false`. 
+This will force Spark to fallback to using the Hive Serde to read the data 
(planning/executions is still Spark). 
 
 ```java
-Dataset hoodieROViewDF = spark.read().format("org.apache.hudi")
-// pass any path glob, can include hudi & non-hudi tables
-.load("/glob/path/pattern");
+$ spark-shell --driver-class-path /etc/hive/conf  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 
--driver-memory 7g --executor-memory 2g  --master yarn-client
+
+scala> sqlContext.sql("select count(*) from hudi_trips_mor_rt where datestr = 
'2016-10-02'").show()
+scala> sqlContext.sql("select count(*) from hudi_trips_mor_rt where datestr = 
'2016-10-02'").show()
 ```
- 
-### Snapshot query {#spark-snapshot-query}
-Currently, near-real time data can only be queried as a Hive table in Spark 
using snapshot query mode. In order to do this, set 
`spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback 
-to using the Hive Serde to read the data (planning/executions is still Spark). 
 
-```java
-$ spark-shell --jars hudi-spark-bundle_2.11-x.y.z-SNAPSHOT.jar 
--driver-class-path /etc/hive/conf  --packages 
org.apache.spark:spark-avro_2.11:2.4.4 --conf 
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 
7g --executor-memory 2g  --master yarn-client
+For COPY_ON_WRITE tables, either Hive SerDe can be used by turning off 
convertMetastoreParquet as described above or Spark's built in support can be 
leveraged. 
 
 Review comment:
   okay sure.


This is an automated message from the Apache Git Service.
To respond to the 

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972232
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
 
 Review comment:
   sure.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972240
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
+as described above via --jars or --packages.
  
-### Read optimized query
-
-Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
-This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
-
-```scala
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
-```
-
-If you prefer to glob paths on DFS via the datasource, you can simply do 
something like below to get a Spark dataframe to work with. 
+### Snapshot query {#spark-snapshot-query}
+By default, Spark SQL will try to use its own parquet support instead of Hive 
SerDe when reading from Hive metastore parquet tables. 
 
 Review comment:
   sure.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972235
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
 
 Review comment:
   done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972005
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
 
 Review comment:
   This is pointed in end of current paragraph already. (line 111)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382971414
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
+as described above via --jars or --packages.
  
-### Read optimized query
-
-Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
-This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
-
-```scala
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
-```
-
-If you prefer to glob paths on DFS via the datasource, you can simply do 
something like below to get a Spark dataframe to work with. 
+### Snapshot query {#spark-snapshot-query}
+By default, Spark SQL will try to use its own parquet support instead of Hive 
SerDe when reading from Hive metastore parquet tables. 
 
 Review comment:
   No. I meant Spark's way of handling things when I meant by default. I dint 
not refer to COPY_ON_Write table when I used 'by default'. If it causes 
ambiguity, we can rephrase it. let me know.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-22 Thread GitBox
bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382970890
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -9,7 +9,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark and Presto.
+bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark datasource, Spark SQL and Presto.
 
 Review comment:
   +1 Created https://issues.apache.org/jira/browse/HUDI-630


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-630) Add Impala support in querying page

2020-02-22 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-630:
---
Status: Open  (was: New)

> Add Impala support in querying page 
> 
>
> Key: HUDI-630
> URL: https://issues.apache.org/jira/browse/HUDI-630
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs, docs-chinese
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>
> After next release of Impala (that supports Hudi) Hudi docs querying data 
> page needs to be updated. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-630) Add Impala support in querying page

2020-02-22 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-630:
---
Priority: Minor  (was: Major)

> Add Impala support in querying page 
> 
>
> Key: HUDI-630
> URL: https://issues.apache.org/jira/browse/HUDI-630
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs, docs-chinese
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>
> After next release of Impala (that supports Hudi) Hudi docs querying data 
> page needs to be updated. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-238) Make separate release for hudi spark/scala based packages for scala 2.12

2020-02-22 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-238:
--

Assignee: Tadas Sugintas  (was: Bhavani Sudha)

> Make separate release for hudi spark/scala based packages for scala 2.12 
> -
>
> Key: HUDI-238
> URL: https://issues.apache.org/jira/browse/HUDI-238
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Release  Administrative, Usability
>Reporter: Balaji Varadarajan
>Assignee: Tadas Sugintas
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/881#issuecomment-528700749]
> Suspects: 
> h3. Hudi utilities package 
> bringing in spark-streaming-kafka-0.8* 
> {code:java}
> [INFO] Scanning for projects...
> [INFO] 
> [INFO] ---< org.apache.hudi:hudi-utilities 
> >---
> [INFO] Building hudi-utilities 0.5.0-SNAPSHOT
> [INFO] [ jar 
> ]-
> [INFO] 
> [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-utilities 
> ---
> [INFO] org.apache.hudi:hudi-utilities:jar:0.5.0-SNAPSHOT
> [INFO] ...
> [INFO] +- org.apache.hudi:hudi-client:jar:0.5.0-SNAPSHOT:compile
>...
> [INFO] 
> [INFO] +- org.apache.hudi:hudi-spark:jar:0.5.0-SNAPSHOT:compile
> [INFO] |  \- org.scala-lang:scala-library:jar:2.11.8:compile
> [INFO] +- log4j:log4j:jar:1.2.17:compile
>...
> [INFO] +- org.apache.spark:spark-core_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
> [INFO] |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
> [INFO] |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
> [INFO] |  +- com.twitter:chill_2.11:jar:0.8.0:provided
> [INFO] |  +- com.twitter:chill-java:jar:0.8.0:provided
> [INFO] |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
> [INFO] |  +- org.apache.spark:spark-launcher_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-common_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-shuffle_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-unsafe_2.11:jar:2.1.0:provided
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:provided
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.5:provided
> [INFO] |  +- org.apache.commons:commons-math3:jar:3.4.1:provided
> [INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:provided
> [INFO] |  +- org.slf4j:slf4j-api:jar:1.7.16:compile
> [INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.16:compile
> [INFO] |  +- com.ning:compress-lzf:jar:1.0.3:provided
> [INFO] |  +- org.xerial.snappy:snappy-java:jar:1.1.2.6:compile
> [INFO] |  +- net.jpountz.lz4:lz4:jar:1.3.0:compile
> [INFO] |  +- org.roaringbitmap:RoaringBitmap:jar:0.5.11:provided
> [INFO] |  +- commons-net:commons-net:jar:2.2:provided
>
> [INFO] +- org.apache.spark:spark-sql_2.11:jar:2.1.0:provided
> [INFO] |  +- com.univocity:univocity-parsers:jar:2.2.1:provided
> [INFO] |  +- org.apache.spark:spark-sketch_2.11:jar:2.1.0:provided
> [INFO] |  \- org.apache.spark:spark-catalyst_2.11:jar:2.1.0:provided
> [INFO] | +- org.codehaus.janino:janino:jar:3.0.0:provided
> [INFO] | +- org.codehaus.janino:commons-compiler:jar:3.0.0:provided
> [INFO] | \- org.antlr:antlr4-runtime:jar:4.5.3:provided
> [INFO] +- com.databricks:spark-avro_2.11:jar:4.0.0:provided
> [INFO] +- org.apache.spark:spark-streaming_2.11:jar:2.1.0:compile
> [INFO] +- org.apache.spark:spark-streaming-kafka-0-8_2.11:jar:2.1.0:compile
> [INFO] |  \- org.apache.kafka:kafka_2.11:jar:0.8.2.1:compile
> [INFO] | +- org.scala-lang.modules:scala-xml_2.11:jar:1.0.2:compile
> [INFO] | +- 
> org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.2:compile
> [INFO] | \- org.apache.kafka:kafka-clients:jar:0.8.2.1:compile
> [INFO] +- io.dropwizard.metrics:metrics-core:jar:4.0.2:compile
> [INFO] +- org.antlr:stringtemplate:jar:4.0.2:compile
> [INFO] |  \- org.antlr:antlr-runtime:jar:3.3:compile
> [INFO] +- com.beust:jcommander:jar:1.72:compile
> [INFO] +- com.twitter:bijection-avro_2.11:jar:0.9.2:compile
> [INFO] |  \- com.twitter:bijection-core_2.11:jar:0.9.2:compile
> [INFO] +- io.confluent:kafka-avro-serializer:jar:3.0.0:compile
> [INFO] +- io.confluent:common-config:jar:3.0.0:compile
> [INFO] +- io.confluent:common-utils:jar:3.0.0:compile
> [INFO] |  \- com.101tec:zkclient:jar:0.5:compile
> [INFO] +- 

[jira] [Assigned] (HUDI-630) Add Impala support in querying page

2020-02-22 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-630:
--

Assignee: Bhavani Sudha  (was: vinoyang)

> Add Impala support in querying page 
> 
>
> Key: HUDI-630
> URL: https://issues.apache.org/jira/browse/HUDI-630
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs, docs-chinese
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>
> After next release of Impala (that supports Hudi) Hudi docs querying data 
> page needs to be updated. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-238) Make separate release for hudi spark/scala based packages for scala 2.12

2020-02-22 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-238:
--

Assignee: Bhavani Sudha  (was: Tadas Sugintas)

> Make separate release for hudi spark/scala based packages for scala 2.12 
> -
>
> Key: HUDI-238
> URL: https://issues.apache.org/jira/browse/HUDI-238
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Release  Administrative, Usability
>Reporter: Balaji Varadarajan
>Assignee: Bhavani Sudha
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/881#issuecomment-528700749]
> Suspects: 
> h3. Hudi utilities package 
> bringing in spark-streaming-kafka-0.8* 
> {code:java}
> [INFO] Scanning for projects...
> [INFO] 
> [INFO] ---< org.apache.hudi:hudi-utilities 
> >---
> [INFO] Building hudi-utilities 0.5.0-SNAPSHOT
> [INFO] [ jar 
> ]-
> [INFO] 
> [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-utilities 
> ---
> [INFO] org.apache.hudi:hudi-utilities:jar:0.5.0-SNAPSHOT
> [INFO] ...
> [INFO] +- org.apache.hudi:hudi-client:jar:0.5.0-SNAPSHOT:compile
>...
> [INFO] 
> [INFO] +- org.apache.hudi:hudi-spark:jar:0.5.0-SNAPSHOT:compile
> [INFO] |  \- org.scala-lang:scala-library:jar:2.11.8:compile
> [INFO] +- log4j:log4j:jar:1.2.17:compile
>...
> [INFO] +- org.apache.spark:spark-core_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
> [INFO] |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
> [INFO] |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
> [INFO] |  +- com.twitter:chill_2.11:jar:0.8.0:provided
> [INFO] |  +- com.twitter:chill-java:jar:0.8.0:provided
> [INFO] |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
> [INFO] |  +- org.apache.spark:spark-launcher_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-common_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-shuffle_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-unsafe_2.11:jar:2.1.0:provided
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:provided
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.5:provided
> [INFO] |  +- org.apache.commons:commons-math3:jar:3.4.1:provided
> [INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:provided
> [INFO] |  +- org.slf4j:slf4j-api:jar:1.7.16:compile
> [INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.16:compile
> [INFO] |  +- com.ning:compress-lzf:jar:1.0.3:provided
> [INFO] |  +- org.xerial.snappy:snappy-java:jar:1.1.2.6:compile
> [INFO] |  +- net.jpountz.lz4:lz4:jar:1.3.0:compile
> [INFO] |  +- org.roaringbitmap:RoaringBitmap:jar:0.5.11:provided
> [INFO] |  +- commons-net:commons-net:jar:2.2:provided
>
> [INFO] +- org.apache.spark:spark-sql_2.11:jar:2.1.0:provided
> [INFO] |  +- com.univocity:univocity-parsers:jar:2.2.1:provided
> [INFO] |  +- org.apache.spark:spark-sketch_2.11:jar:2.1.0:provided
> [INFO] |  \- org.apache.spark:spark-catalyst_2.11:jar:2.1.0:provided
> [INFO] | +- org.codehaus.janino:janino:jar:3.0.0:provided
> [INFO] | +- org.codehaus.janino:commons-compiler:jar:3.0.0:provided
> [INFO] | \- org.antlr:antlr4-runtime:jar:4.5.3:provided
> [INFO] +- com.databricks:spark-avro_2.11:jar:4.0.0:provided
> [INFO] +- org.apache.spark:spark-streaming_2.11:jar:2.1.0:compile
> [INFO] +- org.apache.spark:spark-streaming-kafka-0-8_2.11:jar:2.1.0:compile
> [INFO] |  \- org.apache.kafka:kafka_2.11:jar:0.8.2.1:compile
> [INFO] | +- org.scala-lang.modules:scala-xml_2.11:jar:1.0.2:compile
> [INFO] | +- 
> org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.2:compile
> [INFO] | \- org.apache.kafka:kafka-clients:jar:0.8.2.1:compile
> [INFO] +- io.dropwizard.metrics:metrics-core:jar:4.0.2:compile
> [INFO] +- org.antlr:stringtemplate:jar:4.0.2:compile
> [INFO] |  \- org.antlr:antlr-runtime:jar:3.3:compile
> [INFO] +- com.beust:jcommander:jar:1.72:compile
> [INFO] +- com.twitter:bijection-avro_2.11:jar:0.9.2:compile
> [INFO] |  \- com.twitter:bijection-core_2.11:jar:0.9.2:compile
> [INFO] +- io.confluent:kafka-avro-serializer:jar:3.0.0:compile
> [INFO] +- io.confluent:common-config:jar:3.0.0:compile
> [INFO] +- io.confluent:common-utils:jar:3.0.0:compile
> [INFO] |  \- com.101tec:zkclient:jar:0.5:compile
> [INFO] +- 

[jira] [Created] (HUDI-630) Add Impala support in querying page

2020-02-22 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-630:
--

 Summary: Add Impala support in querying page 
 Key: HUDI-630
 URL: https://issues.apache.org/jira/browse/HUDI-630
 Project: Apache Hudi (incubating)
  Issue Type: Task
  Components: Docs, docs-chinese
Reporter: Bhavani Sudha
Assignee: vinoyang


After next release of Impala (that supports Hudi) Hudi docs querying data page 
needs to be updated. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1348: HUDI-597 Enable incremental pulling from defined partitions

2020-02-22 Thread GitBox
vinothchandar commented on a change in pull request #1348: HUDI-597 Enable 
incremental pulling from defined partitions
URL: https://github.com/apache/incubator-hudi/pull/1348#discussion_r38294
 
 

 ##
 File path: hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala
 ##
 @@ -100,17 +101,22 @@ class IncrementalRelation(val sqlContext: SQLContext,
 .get, classOf[HoodieCommitMetadata])
   fileIdToFullPath ++= metadata.getFileIdAndFullPaths(basePath).toMap
 }
+val pathGlobPattern = 
optParams.getOrElse(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "")
+val filteredFullPath = if(!pathGlobPattern.equals("")) {
 
 Review comment:
   here we will compare with the default variable constant. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1348: HUDI-597 Enable incremental pulling from defined partitions

2020-02-22 Thread GitBox
vinothchandar commented on a change in pull request #1348: HUDI-597 Enable 
incremental pulling from defined partitions
URL: https://github.com/apache/incubator-hudi/pull/1348#discussion_r382966852
 
 

 ##
 File path: hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala
 ##
 @@ -100,17 +101,22 @@ class IncrementalRelation(val sqlContext: SQLContext,
 .get, classOf[HoodieCommitMetadata])
   fileIdToFullPath ++= metadata.getFileIdAndFullPaths(basePath).toMap
 }
+val pathGlobPattern = 
optParams.getOrElse(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "")
+val filteredFullPath = if(!pathGlobPattern.equals("")) {
+  val globMatcher = new GlobPattern("*" + pathGlobPattern)
 
 Review comment:
   should we leave the `*` to the user? i.e let the user pass in `*` if needed? 
or is that needed for the matching...
   
   I am not familiar with this class per se.. 
   
   Also http://hadoop.apache.org/docs/r2.8.0/api/allclasses-noframe.html does 
not seem to have `GlobPattern` is this class still around.. Was a bit confused 
by that.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1348: HUDI-597 Enable incremental pulling from defined partitions

2020-02-22 Thread GitBox
vinothchandar commented on a change in pull request #1348: HUDI-597 Enable 
incremental pulling from defined partitions
URL: https://github.com/apache/incubator-hudi/pull/1348#discussion_r382966705
 
 

 ##
 File path: hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala
 ##
 @@ -100,17 +101,22 @@ class IncrementalRelation(val sqlContext: SQLContext,
 .get, classOf[HoodieCommitMetadata])
   fileIdToFullPath ++= metadata.getFileIdAndFullPaths(basePath).toMap
 }
+val pathGlobPattern = 
optParams.getOrElse(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "")
+val filteredFullPath = if(!pathGlobPattern.equals("")) {
+  val globMatcher = new GlobPattern("*" + pathGlobPattern)
+  fileIdToFullPath.filter(p => globMatcher.matches(p._2))
+} else fileIdToFullPath
 
 Review comment:
   please enclose within braces for readability. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1348: HUDI-597 Enable incremental pulling from defined partitions

2020-02-22 Thread GitBox
vinothchandar commented on a change in pull request #1348: HUDI-597 Enable 
incremental pulling from defined partitions
URL: https://github.com/apache/incubator-hudi/pull/1348#discussion_r382966594
 
 

 ##
 File path: hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala
 ##
 @@ -84,7 +85,7 @@ class IncrementalRelation(val sqlContext: SQLContext,
 
   val filters = {
 if 
(optParams.contains(DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY)) {
-  val filterStr = 
optParams.get(DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY).getOrElse("")
+  val filterStr = 
optParams.getOrElse(DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY, "")
 
 Review comment:
   can we move the `""` default to DataSourceOptions, to keep it consistent 
with how the other options are defined


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1348: HUDI-597 Enable incremental pulling from defined partitions

2020-02-22 Thread GitBox
vinothchandar commented on a change in pull request #1348: HUDI-597 Enable 
incremental pulling from defined partitions
URL: https://github.com/apache/incubator-hudi/pull/1348#discussion_r382966967
 
 

 ##
 File path: hudi-spark/src/test/scala/TestDataSource.scala
 ##
 @@ -135,6 +136,14 @@ class TestDataSource extends AssertionsForJUnit {
 countsPerCommit = 
hoodieIncViewDF2.groupBy("_hoodie_commit_time").count().collect();
 assertEquals(1, countsPerCommit.length)
 assertEquals(commitInstantTime2, countsPerCommit(0).get(0))
+
+// pull the latest commit within certain partitions
+val hoodieIncViewDF3 = spark.read.format("org.apache.hudi")
+  .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+  .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, 
commitInstantTime1)
+  .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "/2016/*/*/*")
 
 Review comment:
   is the leading `/` necessary?  could we make it (if not already) such that 
the matchong works with or without it.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #197

2020-02-22 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.25 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.2-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an 

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-22 Thread GitBox
smarthi commented on a change in pull request #1350: [HUDI-629]: Replace 
Guava's Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r382958083
 
 

 ##
 File path: 
hudi-hive/src/test/java/org/apache/hudi/hive/util/HiveTestService.java
 ##
 @@ -87,7 +87,7 @@ public Configuration getHadoopConf() {
   }
 
   public HiveServer2 start() throws IOException {
-Preconditions.checkState(workDir != null, "The work dir must be set before 
starting cluster.");
+Objects.requireNonNull(workDir, "The work dir must be set before starting 
cluster.");
 
 Review comment:
   ditto


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-22 Thread GitBox
smarthi commented on a change in pull request #1350: [HUDI-629]: Replace 
Guava's Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r382958076
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/minicluster/HdfsTestService.java
 ##
 @@ -66,7 +66,7 @@ public Configuration getHadoopConf() {
   }
 
   public MiniDFSCluster start(boolean format) throws IOException {
-Preconditions.checkState(workDir != null, "The work dir must be set before 
starting cluster.");
+Objects.requireNonNull(workDir, "The work dir must be set before starting 
cluster.");
 
 Review comment:
   For this null check, its unnecessary - there is a checkState in 
ValidationUtils to check other boolean conditions. But null check, its fine to 
use Bojects.checkNotNull() 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-22 Thread GitBox
smarthi commented on a change in pull request #1350: [HUDI-629]: Replace 
Guava's Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r382957997
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/NumericUtils.java
 ##
 @@ -31,4 +41,28 @@ public static String humanReadableByteCount(double bytes) {
 String pre = "KMGTPE".charAt(exp - 1) + "";
 return String.format("%.1f %sB", bytes / Math.pow(1024, exp), pre);
   }
+
+  public static long getMessageDigestHash(final String algorithmName, final 
String string) {
+MessageDigest md = null;
+try {
+  md = MessageDigest.getInstance(algorithmName);
+} catch (NoSuchAlgorithmException e) {
+  LOGGER.error("Invalid Algorithm Specified: {}", algorithmName);
 
 Review comment:
   ok will do


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch asf-site updated: [HUDI-611] Add Impala Guide to Doc (#1349)

2020-02-22 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 49047de  [HUDI-611] Add Impala Guide to Doc (#1349)
49047de is described below

commit 49047deb43bc400ccc0fa6b2d02f2fe379d1c8a4
Author: YanJia-Gary-Li 
AuthorDate: Sat Feb 22 18:19:25 2020 -0800

[HUDI-611] Add Impala Guide to Doc (#1349)
---
 docs/_docs/2_3_querying_data.cn.md | 30 ++
 docs/_docs/2_3_querying_data.md| 29 +
 2 files changed, 59 insertions(+)

diff --git a/docs/_docs/2_3_querying_data.cn.md 
b/docs/_docs/2_3_querying_data.cn.md
index 81f2273..b2c4870 100644
--- a/docs/_docs/2_3_querying_data.cn.md
+++ b/docs/_docs/2_3_querying_data.cn.md
@@ -145,3 +145,33 @@ scala> sqlContext.sql("select count(*) from hudi_rt where 
datestr = '2016-10-02'
 
 Presto是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
 这需要在整个安装过程中将`hudi-presto-bundle` jar放入`/plugin/hive-hadoop2/`中。
+
+## Impala(此功能还未正式发布)
+
+### 读优化表
+
+Impala可以在HDFS上查询Hudi读优化表,作为一种 [EXTERNAL 
TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables)
 的形式。  
+可以通过以下方式在Impala上建立Hudi读优化表:
+```
+CREATE EXTERNAL TABLE database.table_name
+LIKE PARQUET '/path/to/load/xxx.parquet'
+STORED AS HUDIPARQUET
+LOCATION '/path/to/load';
+```
+Impala可以利用合理的文件分区来提高查询的效率。
+如果想要建立分区的表,文件夹命名需要根据此种方式`year=2020/month=1`.
+Impala使用`=`来区分分区名和分区值.  
+可以通过以下方式在Impala上建立分区Hudi读优化表:
+```
+CREATE EXTERNAL TABLE database.table_name
+LIKE PARQUET '/path/to/load/xxx.parquet'
+PARTITION BY (year int, month int, day int)
+STORED AS HUDIPARQUET
+LOCATION '/path/to/load';
+ALTER TABLE database.table_name RECOVER PARTITIONS;
+```
+在Hudi成功写入一个新的提交后, 刷新Impala表来得到最新的结果.
+```
+REFRESH database.table_name
+```
+
diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 2d97e2b..0ee5e17 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -150,3 +150,32 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 
 Presto is a popular query engine, providing interactive query performance. 
Presto currently supports only read optimized queries on Hudi tables. 
 This requires the `hudi-presto-bundle` jar to be placed into 
`/plugin/hive-hadoop2/`, across the installation.
+
+## Impala(Not Officially Released)
+
+### Read optimized table
+
+Impala is able to query Hudi read optimized table as an [EXTERNAL 
TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables)
 on HDFS.  
+To create a Hudi read optimized table on Impala:
+```
+CREATE EXTERNAL TABLE database.table_name
+LIKE PARQUET '/path/to/load/xxx.parquet'
+STORED AS HUDIPARQUET
+LOCATION '/path/to/load';
+```
+Impala is able to take advantage of the physical partition structure to 
improve the query performance.
+To create a partitioned table, the folder should follow the naming convention 
like `year=2020/month=1`.
+Impala use `=` to separate partition name and partition value.  
+To create a partitioned Hudi read optimized table on Impala:
+```
+CREATE EXTERNAL TABLE database.table_name
+LIKE PARQUET '/path/to/load/xxx.parquet'
+PARTITION BY (year int, month int, day int)
+STORED AS HUDIPARQUET
+LOCATION '/path/to/load';
+ALTER TABLE database.table_name RECOVER PARTITIONS;
+```
+After Hudi made a new commit, refresh the Impala table to get the latest 
results.
+```
+REFRESH database.table_name
+```



[GitHub] [incubator-hudi] leesf merged pull request #1349: HUDI-611 Add Impala Guide to Doc

2020-02-22 Thread GitBox
leesf merged pull request #1349: HUDI-611 Add Impala Guide to Doc
URL: https://github.com/apache/incubator-hudi/pull/1349
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-22 Thread GitBox
leesf commented on a change in pull request #1350: [HUDI-629]: Replace Guava's 
Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r382957503
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/NumericUtils.java
 ##
 @@ -31,4 +41,28 @@ public static String humanReadableByteCount(double bytes) {
 String pre = "KMGTPE".charAt(exp - 1) + "";
 return String.format("%.1f %sB", bytes / Math.pow(1024, exp), pre);
   }
+
+  public static long getMessageDigestHash(final String algorithmName, final 
String string) {
+MessageDigest md = null;
+try {
+  md = MessageDigest.getInstance(algorithmName);
+} catch (NoSuchAlgorithmException e) {
+  LOGGER.error("Invalid Algorithm Specified: {}", algorithmName);
 
 Review comment:
   How about removing the LOGGER and throwing exception here? though 
Objects.requireNonNull(md) would also throw exception if md is null.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-22 Thread GitBox
leesf commented on a change in pull request #1350: [HUDI-629]: Replace Guava's 
Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r382957313
 
 

 ##
 File path: 
hudi-hive/src/test/java/org/apache/hudi/hive/util/HiveTestService.java
 ##
 @@ -87,7 +87,7 @@ public Configuration getHadoopConf() {
   }
 
   public HiveServer2 start() throws IOException {
-Preconditions.checkState(workDir != null, "The work dir must be set before 
starting cluster.");
+Objects.requireNonNull(workDir, "The work dir must be set before starting 
cluster.");
 
 Review comment:
   ditto


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-22 Thread GitBox
leesf commented on a change in pull request #1350: [HUDI-629]: Replace Guava's 
Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r382957273
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/minicluster/HdfsTestService.java
 ##
 @@ -66,7 +66,7 @@ public Configuration getHadoopConf() {
   }
 
   public MiniDFSCluster start(boolean format) throws IOException {
-Preconditions.checkState(workDir != null, "The work dir must be set before 
starting cluster.");
+Objects.requireNonNull(workDir, "The work dir must be set before starting 
cluster.");
 
 Review comment:
   How about adding `checkState` to ValidationUtils?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-22 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042748#comment-17042748
 ] 

lamber-ken commented on HUDI-625:
-

Let me check it, I took GenericRecord as a demo test here. Now, we know that if 
register the serizliazer before deserializing, kryo will be blazing fast.

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
> at java.util.zip.ZipFile.getEntry(Native Method)
> 

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-22 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042721#comment-17042721
 ] 

Vinoth Chandar commented on HUDI-625:
-

[~lamber-ken] the payload class does not have a GenericRecord actually. It just 
has a byte[]. See `OverwriteWithLatestAvroPayload`..  I am wondering if the 
issue is that Kryo is falling back to java serialization?

So regsitering serializers for the common classes we have : HoodieRecord, 
HoodieKey etc would be good. And please test the performance using the original 
spark-shell snippet as well tomake sure it works end-end i.e the 4M 
insert/update workload from the issue.


> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap 

[jira] [Updated] (HUDI-629) Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-22 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-629:
---
Status: Patch Available  (was: In Progress)

> Replace Guava's Hashing with an equivalent in NumericUtils.java
> ---
>
> Key: HUDI-629
> URL: https://issues.apache.org/jira/browse/HUDI-629
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
> Fix For: 0.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] smarthi opened a new pull request #1350: Hudi 629

2020-02-22 Thread GitBox
smarthi opened a new pull request #1350: Hudi 629
URL: https://github.com/apache/incubator-hudi/pull/1350
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Replace Guava's Hashing call with equivalent.
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [X] Has a corresponding JIRA in PR title & commit

- [X] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-629) Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-22 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-629:
---
Status: Open  (was: New)

> Replace Guava's Hashing with an equivalent in NumericUtils.java
> ---
>
> Key: HUDI-629
> URL: https://issues.apache.org/jira/browse/HUDI-629
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
> Fix For: 0.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-629) Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-22 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-629:
---
Status: In Progress  (was: Open)

> Replace Guava's Hashing with an equivalent in NumericUtils.java
> ---
>
> Key: HUDI-629
> URL: https://issues.apache.org/jira/browse/HUDI-629
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
> Fix For: 0.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-629) Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-22 Thread Suneel Marthi (Jira)
Suneel Marthi created HUDI-629:
--

 Summary: Replace Guava's Hashing with an equivalent in 
NumericUtils.java
 Key: HUDI-629
 URL: https://issues.apache.org/jira/browse/HUDI-629
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Utilities
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 0.5.2






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nsivabalan commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-02-22 Thread GitBox
nsivabalan commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi 
Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-589975941
 
 
   Sure. Will take care.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Issue Comment Deleted] (HUDI-580) Incorrect license header in docker/hoodie/hadoop/base/entrypoint.sh

2020-02-22 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-580:
---
Comment: was deleted

(was: The license header looks OK to me - I would check with Justin once again 
and mark this as Resolved.)

> Incorrect license header in docker/hoodie/hadoop/base/entrypoint.sh
> ---
>
> Key: HUDI-580
> URL: https://issues.apache.org/jira/browse/HUDI-580
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie
>Reporter: leesf
>Assignee: lamber-ken
>Priority: Major
>  Labels: compliance
> Fix For: 0.5.2
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-580) Incorrect license header in docker/hoodie/hadoop/base/entrypoint.sh

2020-02-22 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-580:
---
Labels: compliance  (was: )

> Incorrect license header in docker/hoodie/hadoop/base/entrypoint.sh
> ---
>
> Key: HUDI-580
> URL: https://issues.apache.org/jira/browse/HUDI-580
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie
>Reporter: leesf
>Assignee: lamber-ken
>Priority: Major
>  Labels: compliance
> Fix For: 0.5.2
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-22 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042603#comment-17042603
 ] 

Vinoth Chandar commented on HUDI-625:
-

cc [~ovjforu] Would you like to chime in on this? 

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
> at java.util.zip.ZipFile.getEntry(Native Method)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:310)
> -  locked 

[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-22 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042597#comment-17042597
 ] 

lamber-ken edited comment on HUDI-625 at 2/22/20 3:06 PM:
--

hi [~vinoth], caching the class seems like a right way, but the deserialized 
result is wrong.

So I used a new way that defined a GenericDataRecordSerializer, then rigister 
it to kryo. like above code snippet  does.


was (Author: lamber-ken):
hi [~vinoth], caching the class seems like a good way, but the deserialized 
result is wrong.

So I used a new way that defined a GenericDataRecordSerializer, then rigister 
it to kryo. like above code snippet  does.

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of 

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-22 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042597#comment-17042597
 ] 

lamber-ken commented on HUDI-625:
-

hi [~vinoth], caching the class seems like a good way, but the deserialized 
result is wrong.

So I used a new way that defined a GenericDataRecordSerializer, then rigister 
it to kryo. like above code snippet  does.

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% 

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-22 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042592#comment-17042592
 ] 

Vinoth Chandar commented on HUDI-625:
-

[~lamber-ken] caching the class looks like a good thing to do anyway. Let me 
understand this more.. 

did you find any issues with this approach?

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
> at java.util.zip.ZipFile.getEntry(Native Method)
> 

[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-22 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042356#comment-17042356
 ] 

lamber-ken edited comment on HUDI-625 at 2/22/20 11:15 AM:
---

Hi, [~vinoth] I cached the Class info, 100x more than before, I found that when 
deserialize each entry, kryo always call "KryoBase#newInstantiator" method.

Sorry, I checked the deserialized result, it's wrong (x), keep going !
||times||1||2||10||
|before|48859ms|107353ms|too long|
|now|494ms|590ms|2067ms|
{code:java}
import com.esotericsoftware.kryo.Kryo;
import com.esotericsoftware.kryo.Serializer;
import com.esotericsoftware.kryo.io.Input;
import com.esotericsoftware.kryo.io.Output;
import com.esotericsoftware.kryo.serializers.FieldSerializer;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.hudi.common.util.SerializationUtils;
import org.objenesis.instantiator.ObjectInstantiator;

import java.io.Serializable;
import java.lang.reflect.Constructor;
import java.lang.reflect.InvocationTargetException;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Random;

public class KryoTest3 {

public static final String TRIP_EXAMPLE_SCHEMA = "{\"type\": \"record\"," + 
"\"name\": \"triprec\"," + "\"fields\": [ "
+ "{\"name\": \"timestamp\",\"type\": \"double\"}," + "{\"name\": 
\"_row_key\", \"type\": \"string\"},"
+ "{\"name\": \"rider\", \"type\": \"string\"}," + "{\"name\": 
\"driver\", \"type\": \"string\"},"
+ "{\"name\": \"begin_lat\", \"type\": \"double\"}," + "{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
+ "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
+ "{\"name\": \"fare\",\"type\": {\"type\":\"record\", 
\"name\":\"fare\",\"fields\": ["
+ "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": 
\"currency\", \"type\": \"string\"}]}},"
+ "{\"name\": \"_hoodie_is_deleted\", \"type\": \"boolean\", 
\"default\": false} ]}";

public static final Schema AVRO_SCHEMA = new 
Schema.Parser().parse(TRIP_EXAMPLE_SCHEMA);

public static GenericRecord generateGenericRecord() {
Random RAND = new Random(46474747);

GenericRecord rec = new GenericData.Record(AVRO_SCHEMA);
rec.put("_row_key", "rowKey");
rec.put("timestamp", "timestamp");
rec.put("rider", "riderName");
rec.put("driver", "driverName");
rec.put("begin_lat", RAND.nextDouble());
rec.put("begin_lon", RAND.nextDouble());
rec.put("end_lat", RAND.nextDouble());
rec.put("end_lon", RAND.nextDouble());
rec.put("_hoodie_is_deleted", false);
return rec;
}

public static void main(String[] args) throws Exception {

GenericRecord genericRecord = generateGenericRecord();

Kryo kryo = new KryoInstantiator().newKryo();

// store data bytes
List objectDatas = new LinkedList<>();
for (int i = 0; i < 1; i++) {
Output output = new Output(1, 4024);
kryo.writeClassAndObject(output, genericRecord);
output.close();
objectDatas.add(SerializationUtils.serialize(genericRecord));
}

long t1 = System.currentTimeMillis();

System.out.println("starting deserialize");

// deserialize
List datas = new LinkedList<>();
for (byte[] data : objectDatas) {
datas.add(kryo.readClassAndObject(new Input(data)));
}

long t2 = System.currentTimeMillis();

System.err.println("dese times: " + datas.size());
System.err.println("dese cost: " + (t2 - t1) + "ms");

}

private static class KryoInstantiator implements Serializable {

public Kryo newKryo() {

Kryo kryo = new KryoBase();
// ensure that kryo doesn't fail if classes are not registered with 
kryo.
kryo.setRegistrationRequired(false);
// This would be used for object initialization if nothing else 
works out.
kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
// Handle cases where we may have an odd classloader setup like 
with libjars
// for hadoop
kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
return kryo;
}

private static class KryoBase extends Kryo {
@Override
protected Serializer newDefaultSerializer(Class type) {
final Serializer serializer = super.newDefaultSerializer(type);
if (serializer instanceof FieldSerializer) {
final FieldSerializer fieldSerializer = (FieldSerializer) 
serializer;

[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-22 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042484#comment-17042484
 ] 

lamber-ken edited comment on HUDI-625 at 2/22/20 11:09 AM:
---

BTW, if we didn't use , it will throw KryoException
{code:java}
Exception in thread "main" com.esotericsoftware.kryo.KryoException: Class 
cannot be created (missing no-arg constructor): 
org.apache.avro.generic.GenericData$Record
at 
com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1310)
at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1127)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1136)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:559)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:535)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
at org.apache.hudi.common.util.collection.KryoTest5.main(KryoTest5.java:103)
{code}
 


was (Author: lamber-ken):
BTW, if we didn't use , it will throw KryoException

 
{code:java}
Exception in thread "main" com.esotericsoftware.kryo.KryoException: Class 
cannot be created (missing no-arg constructor): 
org.apache.avro.generic.GenericData$Record
at 
com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1310)
at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1127)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1136)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:559)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:535)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
at org.apache.hudi.common.util.collection.KryoTest5.main(KryoTest5.java:103)
{code}
 

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, 

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-22 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042484#comment-17042484
 ] 

lamber-ken commented on HUDI-625:
-

BTW, if we didn't use , it will throw KryoException

 
{code:java}
Exception in thread "main" com.esotericsoftware.kryo.KryoException: Class 
cannot be created (missing no-arg constructor): 
org.apache.avro.generic.GenericData$Record
at 
com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1310)
at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1127)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1136)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:559)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:535)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
at org.apache.hudi.common.util.collection.KryoTest5.main(KryoTest5.java:103)
{code}
 

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-22 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042483#comment-17042483
 ] 

lamber-ken commented on HUDI-625:
-

It works, I defined a GenericDataRecordSerializer, then register it to kryo.

dese times: 10
dese cost: 4086ms

 

GenericDataRecordSerializer
{code:java}
import com.esotericsoftware.kryo.Kryo;
import com.esotericsoftware.kryo.Serializer;
import com.esotericsoftware.kryo.io.Input;
import com.esotericsoftware.kryo.io.Output;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.io.EncoderFactory;

import java.io.IOException;
import java.io.Serializable;
import java.nio.charset.StandardCharsets;

public class GenericDataRecordSerializer extends Serializer 
implements Serializable {
private static final long serialVersionUID = 1L;

private void serializeDatum(Output output, GenericRecord object) throws 
IOException {

BinaryEncoder binaryEncoder = 
EncoderFactory.get().binaryEncoder(output, null);
Schema schema = object.getSchema();

byte[] bytes = schema.toString().getBytes(StandardCharsets.UTF_8);
output.writeInt(bytes.length);
output.write(bytes);

DatumWriter datumWriter = 
GenericData.get().createDatumWriter(schema);
datumWriter.write(object, binaryEncoder);

binaryEncoder.flush();

}


private GenericRecord deserializeDatum(Input input) throws IOException {

int length = input.readInt();
Schema schema = new Schema.Parser().parse(new 
String(input.readBytes(length)));

BinaryDecoder binaryDecoder = 
DecoderFactory.get().directBinaryDecoder(input, null);
DatumReader datumReader = 
GenericData.get().createDatumReader(schema);
  return datumReader.read(null, binaryDecoder);

   }

@Override
public void write(Kryo kryo, Output output, GenericRecord object) {
  try {
 serializeDatum(output, object);
  } catch (IOException e) {
 throw new RuntimeException();
  }
   }

@Override
public GenericRecord read(Kryo kryo, Input input, Class 
type) {
  try {
 return deserializeDatum(input);
  } catch (IOException e) {
 throw new RuntimeException();
  }
   }
}
{code}
KryoTest5
{code:java}
/import com.esotericsoftware.kryo.Kryo;
import com.esotericsoftware.kryo.io.Input;
import com.esotericsoftware.kryo.io.Output;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;

import java.util.LinkedList;
import java.util.List;
import java.util.Random;

public class KryoTest5 {

public static final String TRIP_EXAMPLE_SCHEMA = "{\n" +
"\"type\":\"record\",\n" +
"\"name\":\"triprec\",\n" +
"\"fields\":[\n" +
"{\n" +
"\"name\":\"timestamp\",\n" +
"\"type\":\"double\"\n" +
"},\n" +
"{\n" +
"\"name\":\"_row_key\",\n" +
"\"type\":\"string\"\n" +
"},\n" +
"{\n" +
"\"name\":\"rider\",\n" +
"\"type\":\"string\"\n" +
"},\n" +
"{\n" +
"\"name\":\"driver\",\n" +
"\"type\":\"string\"\n" +
"},\n" +
"{\n" +
"\"name\":\"begin_lat\",\n" +
"\"type\":\"double\"\n" +
"},\n" +
"{\n" +
"\"name\":\"begin_lon\",\n" +
"\"type\":\"double\"\n" +
"},\n" +
"{\n" +
"\"name\":\"end_lat\",\n" +
"\"type\":\"double\"\n" +
"},\n" +
"{\n" +
"\"name\":\"end_lon\",\n" +
"\"type\":\"double\"\n" +
"},\n" +
"{\n" +
"\"name\":\"_hoodie_is_deleted\",\n" +
"\"type\":\"boolean\",\n" +
"\"default\":false\n" +
"}\n" +
"]\n" +
"}";

public static final Schema AVRO_SCHEMA = new 
Schema.Parser().parse(TRIP_EXAMPLE_SCHEMA);

public static GenericRecord generateGenericRecord() {
Random RAND = new Random(46474747);

GenericRecord rec = new GenericData.Record(AVRO_SCHEMA);
rec.put("_row_key", "rowKey" +