[GitHub] [hudi] n3nash commented on issue #1845: [SUPPORT] Support for Schema evolution. Facing an error

2020-07-21 Thread GitBox


n3nash commented on issue #1845:
URL: https://github.com/apache/hudi/issues/1845#issuecomment-662269249


   @sbernauer I briefly looked at the test-case you added and I see what you 
are trying to reproduce. The issue seems to be as follows : 
   1) Generate data with schema A, pass schema A 
   2) Generate data with schema B, pass schema B 
   3) Generate data with schema A, pass schema B 
   Where schema A is smaller schema, schema B is a larger schema with 1 added 
field.
   @bvaradar Have we tested such a scenario before with the 
`AvroConversionHelper` class where we convert the RDD to DF 
with schema A, but pass schema B as the writer schema ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-21 Thread GitBox


Mathieu1124 commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r458556752



##
File path: 
hudi-client/src/test/java/org/apache/hudi/testutils/HoodieWriteCommitTestHarness.java
##
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.testutils;
+
+import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
+import org.apache.hudi.config.HoodieWriteCommitCallbackConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import java.io.IOException;
+
+/**
+ * The test harness for resource initialization and cleanup.
+ */
+public class HoodieWriteCommitTestHarness extends HoodieCommonTestHarness {

Review comment:
   > ditto
   
   This class provides test functions for callback service, right now, it 
provides functions to init and teardown `HoodieWriteCommitCallbackConfig` and 
`HoodieWriteConfig`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-21 Thread GitBox


Mathieu1124 commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r458555824



##
File path: 
hudi-client/src/test/java/org/apache/hudi/callback/http/TestHoodieWriteCommitHttpCallback.java
##
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.callback.http;
+
+import org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback;
+import org.apache.hudi.testutils.HoodieWriteCommitTestHarness;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+
+import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
+
+/**
+ * Unit test for {@link HoodieWriteCommitHttpCallback}.
+ */
+public class TestHoodieWriteCommitHttpCallback extends 
HoodieWriteCommitTestHarness {

Review comment:
   > What's purpose of this class?
   
   its purpose is to test the Initialization process of 
`HoodieWriteCommitHttpCallback` when provided with proper params referring to 
this test 
'org.apache.hudi.metrics.datadog.TestDatadogMetricsReporter#instantiationShouldSucceed'.
   
   this test might be unnecessary,I'll remove it if you think it is unnecessary 
too





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458544915



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -110,6 +112,10 @@ object DataSourceReadOptions {
*/
   val INCR_PATH_GLOB_OPT_KEY = "hoodie.datasource.read.incr.path.glob"
   val DEFAULT_INCR_PATH_GLOB_OPT_VAL = ""
+
+
+  val REALTIME_SKIP_MERGE_KEY = REALTIME_SKIP_MERGE_PROP

Review comment:
   added `MERGE_ON_READ_PAYLOAD_KEY` and `MERGE_ON_READ_ORDERING_KEY`. Then 
we use the payload to do all the merging.

##
File path: hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala
##
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils
+import 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMaxCompactionMemoryInBytes
+
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapred.JobConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.sources.{BaseRelation, TableScan}
+import org.apache.spark.sql.types.StructType
+
+import scala.collection.JavaConverters._
+
+case class HudiMergeOnReadFileSplit(dataFile: PartitionedFile,
+logPaths: Option[List[String]],
+latestCommit: String,
+tablePath: String,
+maxCompactionMemoryInBytes: Long,
+skipMerge: Boolean)
+
+class SnapshotRelation (val sqlContext: SQLContext,
+val optParams: Map[String, String],
+val userSchema: StructType,
+val globPaths: Seq[Path],
+val metaClient: HoodieTableMetaClient)
+  extends BaseRelation with TableScan with Logging{

Review comment:
   https://issues.apache.org/jira/browse/HUDI-1050

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
##
@@ -40,6 +40,8 @@
 import java.io.IOException;
 import java.util.Map;
 
+import static 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMaxCompactionMemoryInBytes;

Review comment:
   done

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/config/HadoopSerializableConfiguration.java
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.config;
+
+import org.apache.hadoop.conf.Configuration;
+
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import java.io.Serializable;
+
+public class HadoopSerializableConfiguration implements Serializable {

Review comment:
   d

[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-21 Thread GitBox


Mathieu1124 commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r458552435



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteCommitCallbackConfig.java
##
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.config;
+
+import org.apache.hudi.common.config.DefaultHoodieConfig;
+
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.Properties;
+
+/**
+ * Write callback related config.
+ */
+public class HoodieWriteCommitCallbackConfig extends DefaultHoodieConfig {
+
+  public static final String CALLBACK_ON = "hoodie.write.commit.callback.on";
+  public static final boolean DEFAULT_CALLBACK_ON = false;
+
+  public static final String CALLBACK_CLASS_PROP = 
"hoodie.write.commit.callback.class";
+  public static final String DEFAULT_CALLBACK_CLASS_PROP = 
"org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback";

Review comment:
   > `DEFAULT_CALLBACK_CLASS`?
   
   Hi, @yanghua thanks for your detailed review.
   all these variables are named referring to existing variables, keeping the 
same naming style would be better :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1050) Support filter pushdown and column pruning for MOR table on Spark Datasource

2020-07-21 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1050:
-
Fix Version/s: (was: 0.6.1)
   0.6.0

> Support filter pushdown and column pruning for MOR table on Spark Datasource
> 
>
> Key: HUDI-1050
> URL: https://issues.apache.org/jira/browse/HUDI-1050
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> We need to use the information provided by PrunedFilteredScan to push down 
> the filter and column projection. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-21 Thread GitBox


Mathieu1124 commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r458550790



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteCommitCallbackConfig.java
##
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.config;
+
+import org.apache.hudi.common.config.DefaultHoodieConfig;
+
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.Properties;
+
+/**
+ * Write callback related config.
+ */
+public class HoodieWriteCommitCallbackConfig extends DefaultHoodieConfig {
+
+  public static final String CALLBACK_ON = "hoodie.write.commit.callback.on";
+  public static final boolean DEFAULT_CALLBACK_ON = false;
+
+  public static final String CALLBACK_CLASS_PROP = 
"hoodie.write.commit.callback.class";
+  public static final String DEFAULT_CALLBACK_CLASS_PROP = 
"org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback";
+
+  // * REST callback configs *
+  public static final String CALLBACK_HTTP_URL_PROP = 
"hoodie.write.commit.callback.http.url";
+  public static final String CALLBACK_HTTP_API_KEY = 
"hoodie.write.commit.callback.http.api.key";
+  public static final String DEFAULT_CALLBACK_HTTP_API_KEY = 
"hudi_write_commit_http_callback";
+  public static final String CALLBACK_HTTP_TIMEOUT_SECONDS = 
"hoodie.write.commit.callback.http.timeout.seconds";
+  public static final String DEFAULT_CALLBACK_HTTP_TIMEOUT_SECONDS = "3";

Review comment:
   > int?
   
   yes, better





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-781) Re-design test utilities

2020-07-21 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-781:

Status: In Progress  (was: Open)

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-781) Re-design test utilities

2020-07-21 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-781:
---

Assignee: Raymond Xu

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-781) Re-design test utilities

2020-07-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-781:

Labels: pull-request-available  (was: )

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] xushiyan opened a new pull request #1861: [HUDI-781] [WIP] Refactor test utils classes

2020-07-21 Thread GitBox


xushiyan opened a new pull request #1861:
URL: https://github.com/apache/hudi/pull/1861


   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-896) Parallelize CI testing to reduce CI wait time

2020-07-21 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-896:

Status: In Progress  (was: Open)

> Parallelize CI testing to reduce CI wait time
> -
>
> Key: HUDI-896
> URL: https://issues.apache.org/jira/browse/HUDI-896
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> - 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-896) Parallelize CI testing to reduce CI wait time

2020-07-21 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu resolved HUDI-896.
-
Resolution: Done

> Parallelize CI testing to reduce CI wait time
> -
>
> Key: HUDI-896
> URL: https://issues.apache.org/jira/browse/HUDI-896
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> - 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] stackfun opened a new issue #1860: [SUPPORT] Issue when querying from Spark Datasource if COW table is being written to at the same time

2020-07-21 Thread GitBox


stackfun opened a new issue #1860:
URL: https://github.com/apache/hudi/issues/1860


   **Describe the problem you faced**
   
   In one pyspark job, I'm appending 10 rows to a COW table in a loop
   In another pyspark job, I'm doing a select count(*) on the same table in 
another loop.
   
   When querying using the Spark Datasource API, the count is unpredictable, 
sometimes returning the right amount of rows. 
   When querying using hive, the select count(*) query returns expected results.
   
   **To Reproduce**
   
   I'm running two pyspark jobs simultaneously in GCP using dataproc.
   
   Writer Job
   ```python
   from pyspark.sql import SparkSession, functions
   import time
   
   table_name = "hudi_trips_cow"
   hudi_options = {
   "hoodie.table.name": table_name,
   "hoodie.datasource.write.recordkey.field": "uuid",
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.partitionpath.field": "continent,country,city",
   "hoodie.datasource.write.table.name": table_name,
   "hoodie.datasource.write.operation": "upsert",
   "hoodie.datasource.write.precombine.field": "ts",
   "hoodie.datasource.hive_sync.enable": True,
   "hoodie.datasource.hive_sync.database": "default",
   "hoodie.datasource.hive_sync.table": table_name,
   "hoodie.datasource.hive_sync.username": "hive",
   "hoodie.datasource.hive_sync.password": "hive",
   "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://localhost:1",
   "hoodie.datasource.hive_sync.partition_fields": "continent,country,city",
   "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   }
   
   def execute(spark: SparkSession, output_path: str):
   start_time = time.time()
   while time.time() < start_time + 60 * 15:
   df = generate_trips(spark)
   
df.write.format("hudi").options(**hudi_options).mode("append").save(output_path)
   
   def generate_trips(spark):
   sc = spark.sparkContext
   dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
   inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(
   dataGen.generateInserts(10)
   )
   # split partitionspath, necessary to sync with hive
   df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
   split_col = functions.split(df["partitionpath"], "/")
   df = df.withColumn("continent", split_col.getItem(0))
   df = df.withColumn("country", split_col.getItem(1))
   return df.withColumn("city", split_col.getItem(2))
   
   spark = (
   SparkSession.builder.appName("test")
   .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
   .enableHiveSupport()
   .getOrCreate()
   )
   execute(spark, "gs://random-gcs-folder-3adf/hudi-data")
   ```
   
   Reader Job
   ```python
   from pyspark.sql import SparkSession, functions, HiveContext
   from pyspark.sql.functions import col
   import time
   
   def spark_query(spark: SparkSession, input_path: str):
   df = spark.read.format("org.apache.hudi").load(input_path + "/*/*/*/*")
   df.createOrReplaceTempView("trips_spark_temp")
   spark.catalog.refreshTable("trips_spark_temp")
   print("Spark Query:")
   spark.sql("select count(*) from trips_spark_temp").show()
   
   def hive_query(hive_context: HiveContext):
   hudi_trips_table = hive_context.table("default.hudi_trips_cow")
   hudi_trips_table.createOrReplaceTempView("trips_temp")
   hive_context.sql("REFRESH TABLE trips_temp")
   print("Hive Query:")
   hive_context.sql("select count(*) from trips_temp").show()
   
   def execute(spark: SparkSession, input_path: str):
   hive_context = HiveContext(spark.sparkContext)
   
   start_time = time.time()
   while time.time() < start_time + (15 * 60):
   spark_query(spark, input_path)
   hive_query(hive_context)
   
   spark = (
   SparkSession.builder.appName("test")
   .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
   .enableHiveSupport()
   .getOrCreate()
   )
   execute(spark, "gs://random-gcs-folder-3adf/hudi-data")
   ```
   Output from Reader Job:
   ```
   Spark Query:
   ++
   |count(1)|
   ++
   | 545|
   ++
   
   Hive Query:
   ++
   |count(1)|
   ++
   |1750|
   ++
   
   Spark Query:
   ++
   |count(1)|
   ++
   | 1760|
   ++
   
   Hive Query:
   ++
   |count(1)|
   ++
   |1760|
   ++
   
   ```
   
   **Expected behavior**
   
   Queries using spark datasource API should match the hive queries.
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.4.5
   
   * Hive version : 2.3.7
   
   * Hadoop version : 2.10
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #346

2020-07-21 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.34 KB...]

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${s

[GitHub] [hudi] satishkotha commented on pull request #1859: [HUDI-1072] Use replace metadata file to filter excluded files in views

2020-07-21 Thread GitBox


satishkotha commented on pull request #1859:
URL: https://github.com/apache/hudi/pull/1859#issuecomment-662223025


   > Reviewed 50%, high level, I feel the changes of excludeFileGroups is being 
forced into many of the `TableFileSystem` implementations. Need to think more 
if there is a way to introduce the correct abstractions to avoid having to add 
this excludeFileGroups everywhere.
   
   Yes, intent is to get early feedback. Appreciate any suggestions. The reason 
I added excludeFileGroups in all views is that in some cases this list may be 
huge. So having configurable spillable view (or RocksDB view) can be useful.  
   
   It is also possible to encapsulate all this in AbstractFileView and hide it 
from subclasses too. Let me know if you think that is a better solution.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1859: [HUDI-1072] Use replace metadata file to filter excluded files in views

2020-07-21 Thread GitBox


satishkotha commented on a change in pull request #1859:
URL: https://github.com/apache/hudi/pull/1859#discussion_r458513411



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
##
@@ -103,14 +105,19 @@ protected void init(HoodieTableMetaClient metaClient, 
HoodieTimeline visibleActi
* @param visibleActiveTimeline Visible Active Timeline
*/
   protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) {
-this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getCommitsAndCompactionTimeline();
+this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getCommitsReplaceAndCompactionTimeline();
   }
 
   /**
* Adds the provided statuses into the file system view, and also caches it 
inside this object.
*/
   protected List addFilesToView(FileStatus[] statuses) {
 HoodieTimer timer = new HoodieTimer().startTimer();
+final Map> partitionFileIdsToExclude = 
getFileIdsToExclude(visibleCommitsAndCompactionTimeline);

Review comment:
   Yes, for time travel, consider this scenario:
   t0 -> insert
   t1 -> insert overwrite1
   t2 -> insert overwrite2
   
   If we set high watermark to t1 for time travel, visibleCommitTimeline would 
not have t2.commit, t2.replace. So file groups in t1 would still show as active 
file groups.
   
   When we move to t2, visibleCommitTimeline will have t2 commit/replace. So 
file groups in t1 will not show as active





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on pull request #1853: [HUDI-1072] Add replace metadata file to timeline

2020-07-21 Thread GitBox


satishkotha commented on pull request #1853:
URL: https://github.com/apache/hudi/pull/1853#issuecomment-662220607


   > High level, introducing `replace` action changes seem fine to me, 
interested in learning how old_file_group -> new_file_group mapping is stored 
and accessed. Yet to review the other PRs if they have this info..
   
   This is not 1:1 mapping. So this is not being stored in replace metadata 
file. We can get new file groups created from corresponding commit metadata 
file.  As far as I can see, we only need this information for auditing, 
debugging. So I'm planning to build this with CLI task. Let me know if you see 
this useful in other scenarios.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1853: [HUDI-1072] Add replace metadata file to timeline

2020-07-21 Thread GitBox


satishkotha commented on a change in pull request #1853:
URL: https://github.com/apache/hudi/pull/1853#discussion_r458511328



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
##
@@ -126,6 +129,13 @@
*/
   HoodieTimeline getCommitsAndCompactionTimeline();
 
+  /**
+   * Timeline to just include replace instants that have valid 
(commit/deltacommit) actions.
+   *
+   * @return
+   */
+  HoodieTimeline getCompletedAndReplaceTimeline();

Review comment:
   yes, this is used to build filesystem view in #1859, can move this to 
that PR





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1853: [HUDI-1072] Add replace metadata file to timeline

2020-07-21 Thread GitBox


satishkotha commented on a change in pull request #1853:
URL: https://github.com/apache/hudi/pull/1853#discussion_r458510963



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
##
@@ -65,7 +65,8 @@
   COMMIT_EXTENSION, INFLIGHT_COMMIT_EXTENSION, REQUESTED_COMMIT_EXTENSION, 
DELTA_COMMIT_EXTENSION,
   INFLIGHT_DELTA_COMMIT_EXTENSION, REQUESTED_DELTA_COMMIT_EXTENSION, 
SAVEPOINT_EXTENSION,
   INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION, 
REQUESTED_CLEAN_EXTENSION, INFLIGHT_CLEAN_EXTENSION,
-  INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, 
INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION));
+  INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, 
INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION,

Review comment:
   I added this for tests. But can split this and related test out into a 
different review





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1853: [HUDI-1072] Add replace metadata file to timeline

2020-07-21 Thread GitBox


satishkotha commented on a change in pull request #1853:
URL: https://github.com/apache/hudi/pull/1853#discussion_r458510585



##
File path: hudi-common/src/main/avro/HoodieReplaceMetadata.avsc
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+ /*
+  * Note that all 'replace' instants are read for every query
+  * So it is important to keep this small. Please be careful
+  * before tracking additional information in this file.
+  * This will be used for 'insert_overwrite' (RFC-18) and also 'clustering' 
(RFC-19)
+  */
+{"namespace": "org.apache.hudi.avro.model",
+ "type": "record",
+ "name": "HoodieReplaceMetadata",
+ "fields": [
+ {"name": "totalFilesReplaced", "type": "int"},
+ {"name": "command", "type": "string"},
+ {"name": "partitionMetadata", "type": {

Review comment:
   This can be obtained from related commit file. Every t.replace has a 
corresponding t.[delta]commit  (we decided to write two files to reduce 
overhead of reading all commit files in the query path)
   
   Plus, it helps on query side to keep replace metadata files small. I also 
didn't want to repeat same information in two places.
   
   Let me know if you think its important to store new file group info here.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
##
@@ -304,6 +305,22 @@ public HoodieInstant 
transitionCleanRequestedToInflight(HoodieInstant requestedI
 return inflight;
   }
 
+  /**
+   * Transition Clean State from inflight to Committed.

Review comment:
   will update. thanks





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1858: [WIP] [1014] Part 1: Adding Upgrade or downgrade infra

2020-07-21 Thread GitBox


nsivabalan commented on a change in pull request #1858:
URL: https://github.com/apache/hudi/pull/1858#discussion_r458509862



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -151,6 +154,27 @@ public HoodieTableType getTableType() {
 : Option.empty();
   }
 
+  /**
+   * @return the table version from .hoodie properties file.
+   */
+  public HoodieTableVersion getHoodieTableVersionFromPropertyFile() {
+if (props.contains(HOODIE_TABLE_VERSION_PROP_NAME)) {
+  String propValue = props.getProperty(HOODIE_TABLE_VERSION_PROP_NAME);
+  if (propValue.equals(HoodieTableVersion.ZERO_SIX_ZERO.version)) {
+return HoodieTableVersion.ZERO_SIX_ZERO;
+  }
+}
+return DEFAULT_TABLE_VERSION;
+  }
+
+  /**
+   * @return the current hoodie table version.
+   */
+  public HoodieTableVersion getCurrentHoodieTableVersion() {
+// TODO: fetch current version dynamically

Review comment:
   @vinothchandar : sorry forgot to ask this question earlier. May I know 
how to fetch current hoodie version in use? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #1859: [HUDI-1072] Use replace metadata file to filter excluded files in views

2020-07-21 Thread GitBox


n3nash commented on a change in pull request #1859:
URL: https://github.com/apache/hudi/pull/1859#discussion_r458490961



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
##
@@ -103,14 +105,19 @@ protected void init(HoodieTableMetaClient metaClient, 
HoodieTimeline visibleActi
* @param visibleActiveTimeline Visible Active Timeline
*/
   protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) {
-this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getCommitsAndCompactionTimeline();
+this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getCommitsReplaceAndCompactionTimeline();
   }
 
   /**
* Adds the provided statuses into the file system view, and also caches it 
inside this object.
*/
   protected List addFilesToView(FileStatus[] statuses) {
 HoodieTimer timer = new HoodieTimer().startTimer();
+final Map> partitionFileIdsToExclude = 
getFileIdsToExclude(visibleCommitsAndCompactionTimeline);

Review comment:
   Does this mean that we can never go back to querying the older file 
groups once they have been replaced ? Can you still do time-travel for 
insert-overwrite use-cases ?
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1851: [HUDI-1113] Add user define metrics reporter

2020-07-21 Thread GitBox


leesf commented on a change in pull request #1851:
URL: https://github.com/apache/hudi/pull/1851#discussion_r458478829



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieMetricsConfig.java
##
@@ -58,6 +59,12 @@
 
   public static final String GRAPHITE_METRIC_PREFIX = GRAPHITE_PREFIX + 
".metric.prefix";
 
+  // User defined
+  public static final String USER_DEFINED_REPORTER_PREFIX = METRIC_PREFIX + 
".user.defined";
+  public static final String USER_DEFINED_REPORTER_CLASS = 
USER_DEFINED_REPORTER_PREFIX + ".class";

Review comment:
   would rename to METRICS_REPORTER_CLASS = "metrics.reporter.class" ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #1859: [HUDI-1072] Use replace metadata file to filter excluded files in views

2020-07-21 Thread GitBox


n3nash commented on a change in pull request #1859:
URL: https://github.com/apache/hudi/pull/1859#discussion_r458478588



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
##
@@ -103,14 +105,19 @@ protected void init(HoodieTableMetaClient metaClient, 
HoodieTimeline visibleActi
* @param visibleActiveTimeline Visible Active Timeline
*/
   protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) {
-this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getCommitsAndCompactionTimeline();
+this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getCommitsReplaceAndCompactionTimeline();

Review comment:
   This name irks me `getCommitsReplaceAndCompactionTimeline`..we should 
introduce another hierarchy to group our actions, {commit, delta, compaction, 
replace} introduce new file groups, {rollback, restore, clean} remove file 
groups etc. Need to think more





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1851: [HUDI-1113] Add user define metrics reporter

2020-07-21 Thread GitBox


leesf commented on a change in pull request #1851:
URL: https://github.com/apache/hudi/pull/1851#discussion_r458478400



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieMetricsConfig.java
##
@@ -58,6 +59,12 @@
 
   public static final String GRAPHITE_METRIC_PREFIX = GRAPHITE_PREFIX + 
".metric.prefix";
 
+  // User defined
+  public static final String USER_DEFINED_REPORTER_PREFIX = METRIC_PREFIX + 
".user.defined";
+  public static final String USER_DEFINED_REPORTER_CLASS = 
USER_DEFINED_REPORTER_PREFIX + ".class";
+
+  public static final String DEFAULT_USER_DEFINED_REPORTER_CLASS = 
DefaultUserDefinedMetricsReporter.class.getName();

Review comment:
   DEFUALT would be empty string. also please refer to 
https://github.com/apache/hudi/blob/5e7ab11e2ead23428ed5089421f2abd6433fe8e5/hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java#L42





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1851: [HUDI-1113] Add user define metrics reporter

2020-07-21 Thread GitBox


leesf commented on a change in pull request #1851:
URL: https://github.com/apache/hudi/pull/1851#discussion_r458477826



##
File path: 
hudi-client/src/main/java/org/apache/hudi/metrics/userdefined/DefaultUserDefinedMetricsReporter.java
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics.userdefined;
+
+import com.codahale.metrics.MetricRegistry;
+
+import java.io.Closeable;
+import java.util.Properties;
+
+/**
+ * Used for testing.

Review comment:
   if just used for testing, we would move it into test class.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1851: [HUDI-1113] Add user define metrics reporter

2020-07-21 Thread GitBox


leesf commented on a change in pull request #1851:
URL: https://github.com/apache/hudi/pull/1851#discussion_r458477465



##
File path: 
hudi-client/src/main/java/org/apache/hudi/metrics/MetricsReporterType.java
##
@@ -22,5 +22,5 @@
  * Types of the reporter. Right now we only support Graphite. We can include 
JMX and CSV in the future.
  */
 public enum MetricsReporterType {
-  GRAPHITE, INMEMORY, JMX, DATADOG
+  GRAPHITE, INMEMORY, JMX, DATADOG, USER_DEFINED

Review comment:
   would remove USER_DEFINED.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1851: [HUDI-1113] Add user define metrics reporter

2020-07-21 Thread GitBox


leesf commented on a change in pull request #1851:
URL: https://github.com/apache/hudi/pull/1851#discussion_r458477384



##
File path: 
hudi-client/src/main/java/org/apache/hudi/metrics/MetricsReporterFactory.java
##
@@ -48,6 +51,10 @@ public static MetricsReporter 
createReporter(HoodieWriteConfig config, MetricReg
   case DATADOG:
 reporter = new DatadogMetricsReporter(config, registry);
 break;
+  case USER_DEFINED:

Review comment:
   we would remove this, please refer to 
https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/HoodieIndex.java#L56





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #1853: [HUDI-1072] Add replace metadata file to timeline

2020-07-21 Thread GitBox


n3nash commented on a change in pull request #1853:
URL: https://github.com/apache/hudi/pull/1853#discussion_r458476891



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
##
@@ -126,6 +129,13 @@
*/
   HoodieTimeline getCommitsAndCompactionTimeline();
 
+  /**
+   * Timeline to just include replace instants that have valid 
(commit/deltacommit) actions.
+   *
+   * @return
+   */
+  HoodieTimeline getCompletedAndReplaceTimeline();

Review comment:
   May be you have used this API in a different PR, but where exactly are 
we going to use this specific method ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #1853: [HUDI-1072] Add replace metadata file to timeline

2020-07-21 Thread GitBox


n3nash commented on a change in pull request #1853:
URL: https://github.com/apache/hudi/pull/1853#discussion_r458476453



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
##
@@ -65,7 +65,8 @@
   COMMIT_EXTENSION, INFLIGHT_COMMIT_EXTENSION, REQUESTED_COMMIT_EXTENSION, 
DELTA_COMMIT_EXTENSION,
   INFLIGHT_DELTA_COMMIT_EXTENSION, REQUESTED_DELTA_COMMIT_EXTENSION, 
SAVEPOINT_EXTENSION,
   INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION, 
REQUESTED_CLEAN_EXTENSION, INFLIGHT_CLEAN_EXTENSION,
-  INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, 
INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION));
+  INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, 
INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION,

Review comment:
   Since the intention of this PR is to simply introduce a new action, do 
we need to change the VALID_ACTION in this PR ? Do the other methods you 
implemented in the ActiveTimeline require this ? I'm wondering if we can avoid 
this change in case we want to land this PR incrementally





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #1853: [HUDI-1072] Add replace metadata file to timeline

2020-07-21 Thread GitBox


n3nash commented on a change in pull request #1853:
URL: https://github.com/apache/hudi/pull/1853#discussion_r458476085



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
##
@@ -304,6 +305,22 @@ public HoodieInstant 
transitionCleanRequestedToInflight(HoodieInstant requestedI
 return inflight;
   }
 
+  /**
+   * Transition Clean State from inflight to Committed.

Review comment:
   s/Clean/Replace





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on a change in pull request #1853: [HUDI-1072] Add replace metadata file to timeline

2020-07-21 Thread GitBox


n3nash commented on a change in pull request #1853:
URL: https://github.com/apache/hudi/pull/1853#discussion_r458475548



##
File path: hudi-common/src/main/avro/HoodieReplaceMetadata.avsc
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+ /*
+  * Note that all 'replace' instants are read for every query
+  * So it is important to keep this small. Please be careful
+  * before tracking additional information in this file.
+  * This will be used for 'insert_overwrite' (RFC-18) and also 'clustering' 
(RFC-19)
+  */
+{"namespace": "org.apache.hudi.avro.model",
+ "type": "record",
+ "name": "HoodieReplaceMetadata",
+ "fields": [
+ {"name": "totalFilesReplaced", "type": "int"},
+ {"name": "command", "type": "string"},
+ {"name": "partitionMetadata", "type": {

Review comment:
   Does the partitionMetadata contain partitionPath -> new file groups ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #1770: [HUDI-708]Add temps show and unit test for TempViewCommand

2020-07-21 Thread GitBox


yanghua commented on a change in pull request #1770:
URL: https://github.com/apache/hudi/pull/1770#discussion_r458466053



##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/TempViewCommand.java
##
@@ -20,36 +20,55 @@
 
 import org.apache.hudi.cli.HoodieCLI;
 
+import org.apache.hudi.exception.HoodieException;
 import org.springframework.shell.core.CommandMarker;
 import org.springframework.shell.core.annotation.CliCommand;
 import org.springframework.shell.core.annotation.CliOption;
 import org.springframework.stereotype.Component;
 
-import java.io.IOException;
-
 /**
  * CLI command to query/delete temp views.
  */
 @Component
 public class TempViewCommand implements CommandMarker {
 
-  private static final String EMPTY_STRING = "";
+  public static final String QUERY_SUCCESS = "Query ran successfully!";
+  public static final String QUERY_FAIL = "Query ran failed!";
+  public static final String SHOW_SUCCESS = "Show all views name 
successfully!";
 
-  @CliCommand(value = "temp_query", help = "query against created temp view")
+  @CliCommand(value = {"temp_query", "temp query"}, help = "query against 
created temp view")
   public String query(
-  @CliOption(key = {"sql"}, mandatory = true, help = "select query to 
run against view") final String sql)
-  throws IOException {
+  @CliOption(key = {"sql"}, mandatory = true, help = "select query to 
run against view") final String sql) {
+
+try {
+  HoodieCLI.getTempViewProvider().runQuery(sql);
+  return QUERY_SUCCESS;
+} catch (HoodieException ex) {
+  return QUERY_FAIL;
+}
+
+  }
+
+  @CliCommand(value = {"temps_show", "temps show"}, help = "Show all views 
name")
+  public String showAll() {
 
-HoodieCLI.getTempViewProvider().runQuery(sql);
-return EMPTY_STRING;
+try {
+  HoodieCLI.getTempViewProvider().showAllViews();
+  return SHOW_SUCCESS;
+} catch (HoodieException ex) {
+  return "Show all views name failed!";
+}
   }
 
-  @CliCommand(value = "temp_delete", help = "Delete view name")
+  @CliCommand(value = {"temp_delete", "temp delete"}, help = "Delete view 
name")
   public String delete(
-  @CliOption(key = {"view"}, mandatory = true, help = "view name") 
final String tableName)
-  throws IOException {
+  @CliOption(key = {"view"}, mandatory = true, help = "view name") 
final String tableName) {
 
-HoodieCLI.getTempViewProvider().deleteTable(tableName);
-return EMPTY_STRING;
+try {
+  HoodieCLI.getTempViewProvider().deleteTable(tableName);
+  return String.format("Delete view %s successfully!", tableName);
+} catch (HoodieException ex) {
+  return String.format("Delete view %s failed!", tableName);

Review comment:
   ditto

##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/TempViewCommand.java
##
@@ -20,36 +20,55 @@
 
 import org.apache.hudi.cli.HoodieCLI;
 
+import org.apache.hudi.exception.HoodieException;
 import org.springframework.shell.core.CommandMarker;
 import org.springframework.shell.core.annotation.CliCommand;
 import org.springframework.shell.core.annotation.CliOption;
 import org.springframework.stereotype.Component;
 
-import java.io.IOException;
-
 /**
  * CLI command to query/delete temp views.
  */
 @Component
 public class TempViewCommand implements CommandMarker {
 
-  private static final String EMPTY_STRING = "";
+  public static final String QUERY_SUCCESS = "Query ran successfully!";
+  public static final String QUERY_FAIL = "Query ran failed!";
+  public static final String SHOW_SUCCESS = "Show all views name 
successfully!";
 
-  @CliCommand(value = "temp_query", help = "query against created temp view")
+  @CliCommand(value = {"temp_query", "temp query"}, help = "query against 
created temp view")
   public String query(
-  @CliOption(key = {"sql"}, mandatory = true, help = "select query to 
run against view") final String sql)
-  throws IOException {
+  @CliOption(key = {"sql"}, mandatory = true, help = "select query to 
run against view") final String sql) {
+
+try {
+  HoodieCLI.getTempViewProvider().runQuery(sql);
+  return QUERY_SUCCESS;
+} catch (HoodieException ex) {
+  return QUERY_FAIL;
+}
+
+  }
+
+  @CliCommand(value = {"temps_show", "temps show"}, help = "Show all views 
name")
+  public String showAll() {
 
-HoodieCLI.getTempViewProvider().runQuery(sql);
-return EMPTY_STRING;
+try {
+  HoodieCLI.getTempViewProvider().showAllViews();
+  return SHOW_SUCCESS;
+} catch (HoodieException ex) {
+  return "Show all views name failed!";

Review comment:
   ditto

##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/TempViewCommand.java
##
@@ -20,36 +20,55 @@
 
 import org.apache.hudi.cli.HoodieCLI;
 
+import org.apache.hudi.exception.HoodieException;
 import org.springframework.shell.core.CommandMarker;
 i

[GitHub] [hudi] bvaradar commented on issue #1852: [SUPPORT]

2020-07-21 Thread GitBox


bvaradar commented on issue #1852:
URL: https://github.com/apache/hudi/issues/1852#issuecomment-662177092


   MacBook-Pro:hudi balaji.varadarajan$ grep -c '\.clean.requested'  
~/Downloads/dot_hoodie_folder.txt 
   16
   MacBook-Pro:hudi balaji.varadarajan$ grep -c '\.deltacommit.requested'  
~/Downloads/dot_hoodie_folder.txt 
   266
   MacBook-Pro:hudi balaji.varadarajan$ grep -c '\.compaction.requested'  
~/Downloads/dot_hoodie_folder.txt 
   44
   MacBook-Pro:hudi balaji.varadarajan$ grep -c '\.compaction.inflight'  
~/Downloads/dot_hoodie_folder.txt 
   44
   MacBook-Pro:hudi balaji.varadarajan$ grep -c '\.commit'  
~/Downloads/dot_hoodie_folder.txt 
   19
   
   1. I can see that there are many compactions that are in inflight status but 
have not completed.  Can you add this patch : 
https://github.com/apache/hudi/pull/1857 to retry failed compactions 
automatically ?
   2. Those 2 warnings are fine and one of them implies that the embedded 
timeline-server is on. Can you provide the stack trace where you deduced that 
most time is spent on listing ?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha opened a new pull request #1859: [HUDI-1072] Use replace metadata file to filter excluded files in views

2020-07-21 Thread GitBox


satishkotha opened a new pull request #1859:
URL: https://github.com/apache/hudi/pull/1859


   
   
   ## What is the purpose of the pull request
   Follow up on #1853 
   Use metadata and filter excluded files from views.
   
   Changed base views. If general approach looks good, I can update RocksDB and 
spillable view implementations
   
   ## Brief change log
   Add new methods in Abstract view to filter files excluded by replace commits
   
   ## Verify this pull request
   Added unit tests
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-21 Thread GitBox


bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-662170951


   With 0.5.[1/2], Hudi stopped using renames for state transition. Hence, you 
are seeing separate state files for each action. All these files (except 
rollback) will be cleaned up as part of archiving. 
   
   For rollback , here is the tracker ticket to do 
https://issues.apache.org/jira/browse/HUDI-1118.
   
   Would you mind listing .hoodie completely and provide as a link ? Note that 
if there are any pending compactions, then archiving will not archive any 
commits before an earliest pending compaction. Also, note that the min/max 
values are for each actions (commit, clean, ...). 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1118) Cleanup rollback files residing in .hoodie folder

2020-07-21 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1118:
-
Fix Version/s: (was: 0.6.1)
   0.6.0

> Cleanup rollback files residing in .hoodie folder
> -
>
> Key: HUDI-1118
> URL: https://issues.apache.org/jira/browse/HUDI-1118
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Hudi Archiving takes care of archiving all metadata files in .hoodie folder 
> except rollback files. Rollback metadata also needs to cleanup in the same 
> way as others.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1118) Cleanup rollback files residing in .hoodie folder

2020-07-21 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1118:
-
Status: Open  (was: New)

> Cleanup rollback files residing in .hoodie folder
> -
>
> Key: HUDI-1118
> URL: https://issues.apache.org/jira/browse/HUDI-1118
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Hudi Archiving takes care of archiving all metadata files in .hoodie folder 
> except rollback files. Rollback metadata also needs to cleanup in the same 
> way as others.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1118) Cleanup rollback files residing in .hoodie folder

2020-07-21 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1118:


 Summary: Cleanup rollback files residing in .hoodie folder
 Key: HUDI-1118
 URL: https://issues.apache.org/jira/browse/HUDI-1118
 Project: Apache Hudi
  Issue Type: Task
  Components: Writer Core
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


Hudi Archiving takes care of archiving all metadata files in .hoodie folder 
except rollback files. Rollback metadata also needs to cleanup in the same way 
as others.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1117) Add tdunning json library to spark and utilities bundle

2020-07-21 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162402#comment-17162402
 ] 

Balaji Varadarajan edited comment on HUDI-1117 at 7/22/20, 12:07 AM:
-

THis can also be potentially solved by including hive-exec package inside 
bundle but that is more risky and involves more testing. 


was (Author: vbalaji):
THis can also be potentially solved by including hive-exec package inside 
bundle but that involves more testing. 

> Add tdunning json library to spark and utilities bundle
> ---
>
> Key: HUDI-1117
> URL: https://issues.apache.org/jira/browse/HUDI-1117
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Exception during Hive Sync:
> ```
> An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: 
> org/json/JSONException\n\tat 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat
>  org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat 
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat
>  org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat
> ```
> This is from using hudi-spark-bundle. 
> [https://github.com/apache/hudi/issues/1787]
> JSONException class is coming from 
> https://mvnrepository.com/artifact/org.json/json There is licensing issue and 
> hence not part of hudi bundle packages. The underlying issue is due to Hive 
> 1.x vs 2.x ( See 
> https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20)
> Spark Hive integration still brings in hive 1.x jars which depends on 
> org.json. I believe this was provided in user's environment and hence we have 
> not seen folks complaining about this issue.
> Even though this is not Hudi issue per se, let me check a jar with compatible 
> license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it 
> works, we will add to 0.6 bundles after discussing with community. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1117) Add tdunning json library to spark and utilities bundle

2020-07-21 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162402#comment-17162402
 ] 

Balaji Varadarajan commented on HUDI-1117:
--

THis can also be potentially solved by including hive-exec package inside 
bundle but that involves more testing. 

> Add tdunning json library to spark and utilities bundle
> ---
>
> Key: HUDI-1117
> URL: https://issues.apache.org/jira/browse/HUDI-1117
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Exception during Hive Sync:
> ```
> An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: 
> org/json/JSONException\n\tat 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat
>  org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat 
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat
>  org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat
> ```
> This is from using hudi-spark-bundle. 
> [https://github.com/apache/hudi/issues/1787]
> JSONException class is coming from 
> https://mvnrepository.com/artifact/org.json/json There is licensing issue and 
> hence not part of hudi bundle packages. The underlying issue is due to Hive 
> 1.x vs 2.x ( See 
> https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20)
> Spark Hive integration still brings in hive 1.x jars which depends on 
> org.json. I believe this was provided in user's environment and hence we have 
> not seen folks complaining about this issue.
> Even though this is not Hudi issue per se, let me check a jar with compatible 
> license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it 
> works, we will add to 0.6 bundles after discussing with community. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1117) Add tdunning json library to spark and utilities bundle

2020-07-21 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1117:


Assignee: Balaji Varadarajan

> Add tdunning json library to spark and utilities bundle
> ---
>
> Key: HUDI-1117
> URL: https://issues.apache.org/jira/browse/HUDI-1117
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Exception during Hive Sync:
> ```
> An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: 
> org/json/JSONException\n\tat 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat
>  org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat 
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat
>  org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat
> ```
> This is from using hudi-spark-bundle. 
> [https://github.com/apache/hudi/issues/1787]
> JSONException class is coming from 
> https://mvnrepository.com/artifact/org.json/json There is licensing issue and 
> hence not part of hudi bundle packages. The underlying issue is due to Hive 
> 1.x vs 2.x ( See 
> https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20)
> Spark Hive integration still brings in hive 1.x jars which depends on 
> org.json. I believe this was provided in user's environment and hence we have 
> not seen folks complaining about this issue.
> Even though this is not Hudi issue per se, let me check a jar with compatible 
> license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it 
> works, we will add to 0.6 bundles after discussing with community. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar commented on issue #1787: Exception During Insert

2020-07-21 Thread GitBox


bvaradar commented on issue #1787:
URL: https://github.com/apache/hudi/issues/1787#issuecomment-662166742


   @asheeshgarg : JSONException class is coming from 
https://mvnrepository.com/artifact/org.json/json There is licensing issue and 
hence not part of hudi bundle packages. The underlying issue is due to  Hive 
1.x vs 2.x ( See 
https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20)
  
   
   Spark Hive integration still brings in hive 1.x jars which depends on 
org.json.  I believe this was provided in user's environment and hence we have 
not seen folks complaining about this issue. 
   
   Even though this is not Hudi issue per se, let me check a jar with 
compatible license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 
and if it works, we will add to 0.6 bundles after discussing with community. 
Meanwhile, can you add the json jar ( 
https://mvnrepository.com/artifact/com.tdunning/json/1.8 or 
https://mvnrepository.com/artifact/org.json/json) in your classpath and this 
should resolve the issue.
   
   Tracking Jira: https://issues.apache.org/jira/browse/HUDI-1117



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1117) Add tdunning json library to spark and utilities bundle

2020-07-21 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1117:
-
Status: Open  (was: New)

> Add tdunning json library to spark and utilities bundle
> ---
>
> Key: HUDI-1117
> URL: https://issues.apache.org/jira/browse/HUDI-1117
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Exception during Hive Sync:
> ```
> An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: 
> org/json/JSONException\n\tat 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat
>  org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat 
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat
>  org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat
> ```
> This is from using hudi-spark-bundle. 
> [https://github.com/apache/hudi/issues/1787]
> JSONException class is coming from 
> https://mvnrepository.com/artifact/org.json/json There is licensing issue and 
> hence not part of hudi bundle packages. The underlying issue is due to Hive 
> 1.x vs 2.x ( See 
> https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20)
> Spark Hive integration still brings in hive 1.x jars which depends on 
> org.json. I believe this was provided in user's environment and hence we have 
> not seen folks complaining about this issue.
> Even though this is not Hudi issue per se, let me check a jar with compatible 
> license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it 
> works, we will add to 0.6 bundles after discussing with community. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1117) Add tdunning json library to spark and utilities bundle

2020-07-21 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1117:


 Summary: Add tdunning json library to spark and utilities bundle
 Key: HUDI-1117
 URL: https://issues.apache.org/jira/browse/HUDI-1117
 Project: Apache Hudi
  Issue Type: Task
  Components: Spark Integration
Reporter: Balaji Varadarajan
 Fix For: 0.6.0


Exception during Hive Sync:

```

An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: 
org/json/JSONException\n\tat 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat
 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat
 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat
 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat
 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat
 org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat 
org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat 
org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat 
org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat 
org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat 
org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat
 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat
 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat
 
org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat
 org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat

```

This is from using hudi-spark-bundle. 
[https://github.com/apache/hudi/issues/1787]

JSONException class is coming from 
https://mvnrepository.com/artifact/org.json/json There is licensing issue and 
hence not part of hudi bundle packages. The underlying issue is due to Hive 1.x 
vs 2.x ( See 
https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20)

Spark Hive integration still brings in hive 1.x jars which depends on org.json. 
I believe this was provided in user's environment and hence we have not seen 
folks complaining about this issue.

Even though this is not Hudi issue per se, let me check a jar with compatible 
license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it 
works, we will add to 0.6 bundles after discussing with community. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] yihua commented on a change in pull request #1149: [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-07-21 Thread GitBox


yihua commented on a change in pull request #1149:
URL: https://github.com/apache/hudi/pull/1149#discussion_r458381933



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -245,6 +250,16 @@ public int getMaxConsistencyCheckIntervalMs() {
 return 
Integer.parseInt(props.getProperty(MAX_CONSISTENCY_CHECK_INTERVAL_MS_PROP));
   }
 
+  public BulkInsertSortMode getBulkInsertSortMode() {
+String sortMode = props.getProperty(BULKINSERT_SORT_MODE);
+try {
+  return BulkInsertSortMode.valueOf(sortMode.toUpperCase());
+} catch (IllegalArgumentException e) {

Review comment:
   Makes sense.

##
File path: 
hudi-client/src/main/java/org/apache/hudi/execution/CopyOnWriteInsertHandler.java
##
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.execution.LazyInsertIterable.HoodieInsertValueGenResult;
+import org.apache.hudi.io.HoodieWriteHandle;
+import org.apache.hudi.io.WriteHandleFactory;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Consumes stream of hoodie records from in-memory queue and writes to one or 
more create-handles.
+ */
+public class CopyOnWriteInsertHandler
+extends

Review comment:
   Fixed.  It's due to my IDE's hard wrap at 100 chars.  

##
File path: 
hudi-client/src/main/java/org/apache/hudi/execution/CopyOnWriteInsertHandler.java
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.execution.LazyInsertIterable.HoodieInsertValueGenResult;
+import org.apache.hudi.io.HoodieWriteHandle;
+import org.apache.hudi.io.WriteHandleFactory;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Consumes stream of hoodie records from in-memory queue and writes to one or 
more create-handles.
+ */
+public class CopyOnWriteInsertHandler
+extends
+BoundedInMemoryQueueConsumer, 
List> {
+
+  protected HoodieWriteConfig config;
+  protected String instantTime;
+  protected HoodieTable hoodieTable;
+  protected String idPrefix;
+  protected int numFilesWritten;
+  protected SparkTaskContextSupplier sparkTaskContextSupplier;
+  protected WriteHandleFactory writeHandleFactory;
+
+  protected final List statuses = new ArrayList<>();
+  protected Map handles = new HashMap<>();
+
+  public CopyOnWriteInsertHandler(
+  HoodieWriteConfig config, String instantTime, HoodieTable 
hoodieTable, String idPrefix,
+  SparkTaskContextSupplier sparkTa

[GitHub] [hudi] vinothchandar commented on a change in pull request #1858: [WIP] [1014] Part 1: Adding Upgrade or downgrade infra

2020-07-21 Thread GitBox


vinothchandar commented on a change in pull request #1858:
URL: https://github.com/apache/hudi/pull/1858#discussion_r458412092



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/UpgradeDowngradeHelper.java
##
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTableVersion;
+import org.apache.hudi.exception.HoodieException;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.Properties;
+
+/**
+ * Helper class to assist in upgrading/downgrading Hoodie when there is a 
version change.
+ */
+public class UpgradeDowngradeHelper {
+
+  public static final String HOODIE_ORIG_PROPERTY_FILE = 
"hoodie.properties.orig";
+
+  /**
+   * Perform Upgrade or Downgrade steps if required and updated table version 
if need be.
+   * 
+   * Starting from version 0.6.0, this upgrade/downgrade step will be added in 
all write paths.
+   * Essentially, if a dataset was created using any pre 0.6.0(for eg 0.5.3), 
and Hoodie verion was upgraded to 0.6.0, there are some upgrade steps need
+   * to be executed before doing any writes.
+   * Similarly, if a dataset was created using 0.6.0 and then hoodie was 
downgraded, some downgrade steps need to be executed before proceeding w/ any 
writes.
+   * On a high level, these are the steps performed
+   * Step1 : Understand current hoodie version and table version from 
hoodie.properties file
+   * Step2 : Fix any residues from previous upgrade/downgrade
+   * Step3 : Check for version upgrade/downgrade.
+   * Step4 : If upgrade/downgrade is required, perform the steps required for 
the same.
+   * Step5 : Copy hoodie.properties -> hoodie.properties.orig
+   * Step6 : Update hoodie.properties file with current table version
+   * Step7 : Delete hoodie.properties.orig
+   * 
+   * @param metaClient instance of {@link HoodieTableMetaClient} to use
+   * @throws IOException
+   */
+  public static void doUpgradeOrDowngrade(HoodieTableMetaClient metaClient) 
throws IOException {
+// Fetch version from property file and current version
+HoodieTableVersion versionFromPropertyFile = 
metaClient.getTableConfig().getHoodieTableVersionFromPropertyFile();
+HoodieTableVersion currentVersion = 
metaClient.getTableConfig().getCurrentHoodieTableVersion();
+
+Path metaPath = new Path(metaClient.getMetaPath());
+Path originalHoodiePropertyPath = 
getOrigHoodiePropertyFilePath(metaPath.toString());
+
+boolean updateTableVersionInPropertyFile = false;
+
+if (metaClient.getFs().exists(originalHoodiePropertyPath)) {
+  // if hoodie.properties.orig exists, rename to hoodie.properties and 
skip upgrade/downgrade step
+  metaClient.getFs().rename(originalHoodiePropertyPath, 
getHoodiePropertyFilePath(metaPath.toString()));
+  updateTableVersionInPropertyFile = true;
+} else {
+  // upgrade or downgrade if there is a version mismatch
+  if (versionFromPropertyFile != currentVersion) {
+updateTableVersionInPropertyFile = true;
+if (versionFromPropertyFile == HoodieTableVersion.PRE_ZERO_SIZE_ZERO 
&& currentVersion == HoodieTableVersion.ZERO_SIX_ZERO) {
+  upgradeFromOlderToZeroSixZero();
+} else if (versionFromPropertyFile == HoodieTableVersion.ZERO_SIX_ZERO 
&& currentVersion == HoodieTableVersion.PRE_ZERO_SIZE_ZERO) {
+  downgradeFromZeroSixZeroToPreZeroSixZero();
+} else {
+  throw new HoodieException("Illegal state wrt table versions. Version 
from proerpty file " + versionFromPropertyFile + " and current version " + 
currentVersion);
+}
+  }
+}
+
+/**
+ * If table version needs to be updated in hoodie.properties file.
+ * Step1: Copy hoodie.properties to hoodie.properties.orig
+ * Step2: add table.version to hoodie.properties
+ * Step3: delete hoodie.properties.orig
+ */
+if (updateTableVer

[GitHub] [hudi] vinothchandar commented on a change in pull request #1858: [WIP] [1014] Part 1: Adding Upgrade or downgrade infra

2020-07-21 Thread GitBox


vinothchandar commented on a change in pull request #1858:
URL: https://github.com/apache/hudi/pull/1858#discussion_r458411863



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -190,6 +192,7 @@ public HoodieMetrics getMetrics() {
*/
   protected HoodieTable getTableAndInitCtx(WriteOperationType operationType) {
 HoodieTableMetaClient metaClient = createMetaClient(true);
+mayBeUpradeOrDowngrade(metaClient);

Review comment:
   looks ok. Mostly sure





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1858: [WIP] [1014] Part 1: Adding Upgrade or downgrade infra

2020-07-21 Thread GitBox


nsivabalan commented on a change in pull request #1858:
URL: https://github.com/apache/hudi/pull/1858#discussion_r458365915



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -190,6 +192,7 @@ public HoodieMetrics getMetrics() {
*/
   protected HoodieTable getTableAndInitCtx(WriteOperationType operationType) {
 HoodieTableMetaClient metaClient = createMetaClient(true);
+mayBeUpradeOrDowngrade(metaClient);

Review comment:
   @vinothchandar : is this the right place to call upgrade/downgrade. If 
not, please advise. 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/UpgradeDowngradeHelper.java
##
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTableVersion;
+import org.apache.hudi.exception.HoodieException;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.Properties;
+
+/**
+ * Helper class to assist in upgrading/downgrading Hoodie when there is a 
version change.
+ */
+public class UpgradeDowngradeHelper {
+
+  public static final String HOODIE_ORIG_PROPERTY_FILE = 
"hoodie.properties.orig";
+
+  /**
+   * Perform Upgrade or Downgrade steps if required and updated table version 
if need be.
+   * 
+   * Starting from version 0.6.0, this upgrade/downgrade step will be added in 
all write paths.
+   * Essentially, if a dataset was created using any pre 0.6.0(for eg 0.5.3), 
and Hoodie verion was upgraded to 0.6.0, there are some upgrade steps need
+   * to be executed before doing any writes.
+   * Similarly, if a dataset was created using 0.6.0 and then hoodie was 
downgraded, some downgrade steps need to be executed before proceeding w/ any 
writes.
+   * On a high level, these are the steps performed
+   * Step1 : Understand current hoodie version and table version from 
hoodie.properties file
+   * Step2 : Fix any residues from previous upgrade/downgrade
+   * Step3 : Check for version upgrade/downgrade.
+   * Step4 : If upgrade/downgrade is required, perform the steps required for 
the same.
+   * Step5 : Copy hoodie.properties -> hoodie.properties.orig
+   * Step6 : Update hoodie.properties file with current table version
+   * Step7 : Delete hoodie.properties.orig
+   * 
+   * @param metaClient instance of {@link HoodieTableMetaClient} to use
+   * @throws IOException
+   */
+  public static void doUpgradeOrDowngrade(HoodieTableMetaClient metaClient) 
throws IOException {
+// Fetch version from property file and current version
+HoodieTableVersion versionFromPropertyFile = 
metaClient.getTableConfig().getHoodieTableVersionFromPropertyFile();
+HoodieTableVersion currentVersion = 
metaClient.getTableConfig().getCurrentHoodieTableVersion();
+
+Path metaPath = new Path(metaClient.getMetaPath());
+Path originalHoodiePropertyPath = 
getOrigHoodiePropertyFilePath(metaPath.toString());
+
+boolean updateTableVersionInPropertyFile = false;
+
+if (metaClient.getFs().exists(originalHoodiePropertyPath)) {
+  // if hoodie.properties.orig exists, rename to hoodie.properties and 
skip upgrade/downgrade step
+  metaClient.getFs().rename(originalHoodiePropertyPath, 
getHoodiePropertyFilePath(metaPath.toString()));
+  updateTableVersionInPropertyFile = true;
+} else {
+  // upgrade or downgrade if there is a version mismatch
+  if (versionFromPropertyFile != currentVersion) {
+updateTableVersionInPropertyFile = true;
+if (versionFromPropertyFile == HoodieTableVersion.PRE_ZERO_SIZE_ZERO 
&& currentVersion == HoodieTableVersion.ZERO_SIX_ZERO) {
+  upgradeFromOlderToZeroSixZero();
+} else if (versionFromPropertyFile == HoodieTableVersion.ZERO_SIX_ZERO 
&& currentVersion == HoodieTableVersion.PRE_ZERO_SIZE_ZERO) {
+  downgradeFromZeroSixZeroToPreZeroSixZero();
+} else {
+ 

[GitHub] [hudi] vinothchandar commented on pull request #1765: [HUDI-1049] 0.5.3 Patch - In inline compaction mode, previously failed compactions needs to be retried before new compactions

2020-07-21 Thread GitBox


vinothchandar commented on pull request #1765:
URL: https://github.com/apache/hudi/pull/1765#issuecomment-662094985


   Closing this in favor of #1857 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar closed pull request #1765: [HUDI-1049] 0.5.3 Patch - In inline compaction mode, previously failed compactions needs to be retried before new compactions

2020-07-21 Thread GitBox


vinothchandar closed pull request #1765:
URL: https://github.com/apache/hudi/pull/1765


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan opened a new pull request #1858: [WIP] [1014] Part 1: Adding Upgrade or downgrade infra

2020-07-21 Thread GitBox


nsivabalan opened a new pull request #1858:
URL: https://github.com/apache/hudi/pull/1858


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-92) Include custom names for spark HUDI spark DAG stages for easier understanding

2020-07-21 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-92?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason resolved HUDI-92.

Resolution: Fixed

> Include custom names for spark HUDI spark DAG stages for easier understanding
> -
>
> Key: HUDI-92
> URL: https://issues.apache.org/jira/browse/HUDI-92
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: newbie, Usability
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-07-21 Thread GitBox


prashantwason commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r458355792



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java
##
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Queue;
+
+@SuppressWarnings("Duplicates")
+public class HoodieSortedMergeHandle extends 
HoodieMergeHandle {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSortedMergeHandle.class);
+
+  private Queue newRecordKeysSorted = new PriorityQueue<>();
+
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+   Iterator> recordItr, String partitionPath, String 
fileId, SparkTaskContextSupplier sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, 
sparkTaskContextSupplier);
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Called by compactor code path.
+   */
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+  Map> keyToNewRecordsOrig, String partitionPath, 
String fileId,
+  HoodieBaseFile dataFileToBeMerged, SparkTaskContextSupplier 
sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, keyToNewRecordsOrig, 
partitionPath, fileId, dataFileToBeMerged,
+sparkTaskContextSupplier);
+
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Go through an old record. Here if we detect a newer version shows up, we 
write the new one to the file.
+   */
+  @Override
+  public void write(GenericRecord oldRecord) {
+String key = 
oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString();
+
+// To maintain overall sorted order across updates and inserts, write any 
new inserts whose keys are less than
+// the oldRecord's key.
+while (!newRecordKeysSorted.isEmpty() && 
newRecordKeysSorted.peek().compareTo(key) <= 0) {
+  String keyToPreWrite = newRecordKeysSorted.remove();
+  if (keyToPreWrite.equals(key)) {
+// will be handled as an update later
+break;
+  }
+
+  // This is a new insert
+  HoodieRecord hoodieRecord = new 
HoodieRecord<>(keyToNewRecords.get(keyToPreWrite));
+  if (writtenRecordKeys.contains(keyToPreWrite)) {
+throw new HoodieUpsertException("Insert/Update not in sorted order");
+  }
+  try {
+if (useWriterSchema) {
+  writeRecord(hoodieRecord, 
hoodieRecord.getData().getInsertValue(writerSchema));
+} else {
+  writeRecord(hoodieRecord, 
hoodieRecord.getData().getInsertValue(originalSchema));
+}
+insertRecordsWritten++;
+writtenRecordKeys.add(keyToPreWrite);
+  } catch (IOException e) {
+throw new HoodieUpsertException("Failed to write records", e);
+  }
+}
+
+super.write(oldRecord);
+  }
+
+  @Override
+  public WriteStatus close() {
+// write out any pending records (this can happen when inserts are turned 
into updates)
+newRecordKeysSorted.stream().sorted().forEach(key -> {

Review comment:
   Yes, sorting is not required here. Leftover from earlier where I was not 
using PriorityQueue for newRecordKeysSorted.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the spe

[GitHub] [hudi] prashantwason commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-07-21 Thread GitBox


prashantwason commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r458355367



##
File path: 
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
##
@@ -0,0 +1,301 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.PositionedReadable;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.FSDataInputStreamWrapper;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Pair;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.bloom.BloomFilterFactory;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+public class HoodieHFileReader implements 
HoodieFileReader {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieHFileReader.class);
+  private Path path;
+  private Configuration conf;
+  private HFile.Reader reader;
+  private Schema schema;
+
+  public static final String KEY_SCHEMA = "schema";
+  public static final String KEY_BLOOM_FILTER_META_BLOCK = "bloomFilter";
+  public static final String KEY_BLOOM_FILTER_TYPE_CODE = 
"bloomFilterTypeCode";
+  public static final String KEY_MIN_RECORD = "minRecordKey";
+  public static final String KEY_MAX_RECORD = "maxRecordKey";
+
+  public HoodieHFileReader(Configuration configuration, Path path, CacheConfig 
cacheConfig) throws IOException {
+this.conf = configuration;
+this.path = path;
+this.reader = HFile.createReader(FSUtils.getFs(path.toString(), 
configuration), path, cacheConfig, conf);
+  }
+
+  public HoodieHFileReader(byte[] content) throws IOException {
+Configuration conf = new Configuration();
+Path path = new Path("hoodie");
+SeekableByteArrayInputStream bis = new 
SeekableByteArrayInputStream(content);
+FSDataInputStream fsdis = new FSDataInputStream(bis);
+this.reader = HFile.createReader(FSUtils.getFs("hoodie", conf), path, new 
FSDataInputStreamWrapper(fsdis),

Review comment:
   Exception will be thrown here.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
##
@@ -0,0 +1,301 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Link

[GitHub] [hudi] prashantwason commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-07-21 Thread GitBox


prashantwason commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r458355188



##
File path: 
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieFileReader.java
##
@@ -34,7 +35,17 @@
 
   public Set filterRowKeys(Set candidateRowKeys);
 
-  public Iterator getRecordIterator(Schema schema) throws IOException;
+  public Iterator getRecordIterator(Schema readerSchema) throws IOException;
+
+  default Iterator getRecordIterator() throws IOException {
+return getRecordIterator(getSchema());
+  }
+
+  public Option getRecordByKey(String key, Schema readerSchema) throws 
IOException;

Review comment:
   We can introduce that config but how will it be used? 
   
   I feel the lookup by key is only useful for internal features (like RFC-15 
or RFC-08) rather than a generic API for HUDI. HUDI record keys tend to be 
UUDIs which are large and looking them up is not a common usecase.
   
   


##
File path: 
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
##
@@ -0,0 +1,301 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.PositionedReadable;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.FSDataInputStreamWrapper;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Pair;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.bloom.BloomFilterFactory;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+public class HoodieHFileReader implements 
HoodieFileReader {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieHFileReader.class);
+  private Path path;
+  private Configuration conf;
+  private HFile.Reader reader;
+  private Schema schema;
+
+  public static final String KEY_SCHEMA = "schema";
+  public static final String KEY_BLOOM_FILTER_META_BLOCK = "bloomFilter";
+  public static final String KEY_BLOOM_FILTER_TYPE_CODE = 
"bloomFilterTypeCode";
+  public static final String KEY_MIN_RECORD = "minRecordKey";
+  public static final String KEY_MAX_RECORD = "maxRecordKey";
+
+  public HoodieHFileReader(Configuration configuration, Path path, CacheConfig 
cacheConfig) throws IOException {
+this.conf = configuration;
+this.path = path;
+this.reader = HFile.createReader(FSUtils.getFs(path.toString(), 
configuration), path, cacheConfig, conf);
+  }
+
+  public HoodieHFileReader(byte[] content) throws IOException {
+Configuration conf = new Configuration();
+Path path = new Path("hoodie");
+SeekableByteArrayInputStream bis = new 
SeekableByteArrayInputStream(content);
+FSDataInputStream fsdis = new FSDataInputStream(bis);
+this.reader = HFile.createReader(FSUtils.getFs("hoodie", conf), path, new 
FSDataInputStreamWrapper(fsdis),
+content.length, new CacheConfig(conf), conf);
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+Map fileInfo;

Review comment:
   Updated.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
##
@@ -0,0 +1,301 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor licen

[GitHub] [hudi] prashantwason commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-07-21 Thread GitBox


prashantwason commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r458355532



##
File path: 
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
##
@@ -0,0 +1,301 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.PositionedReadable;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.FSDataInputStreamWrapper;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Pair;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.bloom.BloomFilterFactory;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+public class HoodieHFileReader implements 
HoodieFileReader {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieHFileReader.class);
+  private Path path;
+  private Configuration conf;
+  private HFile.Reader reader;
+  private Schema schema;
+
+  public static final String KEY_SCHEMA = "schema";
+  public static final String KEY_BLOOM_FILTER_META_BLOCK = "bloomFilter";
+  public static final String KEY_BLOOM_FILTER_TYPE_CODE = 
"bloomFilterTypeCode";
+  public static final String KEY_MIN_RECORD = "minRecordKey";
+  public static final String KEY_MAX_RECORD = "maxRecordKey";
+
+  public HoodieHFileReader(Configuration configuration, Path path, CacheConfig 
cacheConfig) throws IOException {
+this.conf = configuration;
+this.path = path;
+this.reader = HFile.createReader(FSUtils.getFs(path.toString(), 
configuration), path, cacheConfig, conf);
+  }
+
+  public HoodieHFileReader(byte[] content) throws IOException {
+Configuration conf = new Configuration();
+Path path = new Path("hoodie");
+SeekableByteArrayInputStream bis = new 
SeekableByteArrayInputStream(content);

Review comment:
   This constructor is creating an HFile.Reader from a byte array (bytes 
from a HFile Data Block saved in a log file). HFile.createReader constructor 
requires a FSDataInputStreamWrapper which requires a IO stream implementing 
"Seekable" interface. 
   
   In other words, this is required for creating a HFile.reader out of an 
in-memory byte array and is not related to the internals of the HFile reading 
logic.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
##
@@ -0,0 +1,301 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io

[GitHub] [hudi] prashantwason commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-07-21 Thread GitBox


prashantwason commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r458354886



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java
##
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Queue;
+
+@SuppressWarnings("Duplicates")

Review comment:
   Added

##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
##
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.io.Writable;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/**
+ * HoodieHFileWriter writes IndexedRecords into an HFile. The record's key is 
used as the key and the
+ * AVRO encoded record bytes are saved as the value.
+ *
+ * Limitations (compared to columnar formats like Parquet or ORC):
+ *  1. Records should be added in order of keys
+ *  2. There are no column stats
+ */
+public class HoodieHFileWriter
+implements HoodieFileWriter {
+  private static AtomicLong recordIndex = new AtomicLong(1);
+  private static final Logger LOG = 
LogManager.getLogger(HoodieHFileWriter.class);
+
+  public static final String KEY_SCHEMA = "schema";
+  public static final String KEY_BLOOM_FILTER_META_BLOCK = "bloomFilter";
+  public static final String KEY_BLOOM_FILTER_TYPE_CODE = 
"bloomFilterTypeCode";
+  public static final String KEY_MIN_RECORD = "minRecordKey";
+  public static final String KEY_MAX_RECORD = "maxRecordKey";
+
+  private final Path file;
+  private HoodieHFileConfig hfileConfig;
+  private final HoodieWrapperFileSystem fs;
+  private final long maxFileSize;
+  private final String instantTime;
+  private 

[GitHub] [hudi] xushiyan commented on issue #1846: [SUPPORT] HoodieSnapshotCopier example

2020-07-21 Thread GitBox


xushiyan commented on issue #1846:
URL: https://github.com/apache/hudi/issues/1846#issuecomment-662059975


   @tooptoop4 actually you should be able to achieve that with 
`HoodieDeltaStreamer`: just point the source to the existing hudi table and 
write to another dir, make sure set file size properties properly, which 
include but not limit to 
https://hudi.apache.org/docs/configurations.html#limitFileSize
   
   So on the other side, we should also prevent the scope of Exporter's 
features from exploding...maybe it should be limited to back-up scenarios.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-07-21 Thread GitBox


prashantwason commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r458319202



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlock.java
##
@@ -110,7 +110,7 @@ public long getLogBlockLength() {
* Type of the log block WARNING: This enum is serialized as the ordinal. 
Only add new enums at the end.
*/
   public enum HoodieLogBlockType {
-COMMAND_BLOCK, DELETE_BLOCK, CORRUPT_BLOCK, AVRO_DATA_BLOCK
+COMMAND_BLOCK, DELETE_BLOCK, CORRUPT_BLOCK, AVRO_DATA_BLOCK, 
HFILE_DATA_BLOCK

Review comment:
   Yes, a separate DELETE block is not required for HFile. The delete 
functionality is implemented independent of the data blocks which only save 
record updates.
   
   DELETE_BLOCK saves record keys which have been deleted since. While reading 
the log blocks (HoodieMergedLogRecordScanner), if a DELETE block is encountered 
then we save a EmptyPayload which represents a delete marker for the record. 
Such records wont be written out (compaction) or processed 
(RealtimeRecordReader) thereby representing a delete.
   
   >> So, we might have to fetch all values and resolve to the latest one to 
find if the value represents delete or active.
   Deleted records are never saved. Only deleted keys are saved within the 
DELETE block. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-767) Support transformation when export to Hudi

2020-07-21 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-767:
---

Assignee: (was: Raymond Xu)

> Support transformation when export to Hudi
> --
>
> Key: HUDI-767
> URL: https://issues.apache.org/jira/browse/HUDI-767
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.6.1
>
>
> Main logic described in 
> https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410
> In HoodieSnapshotExporter, we could extend the feature to include 
> transformation when --output-format hudi, using a custom Transformer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] xushiyan commented on issue #1846: [SUPPORT] HoodieSnapshotCopier example

2020-07-21 Thread GitBox


xushiyan commented on issue #1846:
URL: https://github.com/apache/hudi/issues/1846#issuecomment-662038478


   > @xushiyan I want to replace contents of existing table. ie read existing 
10k small files from tableA and replace tableA with 20 big files
   
   @tooptoop4 as i mentioned, HoodieSnapshotCopier is meant for fresh back-up 
via plain copying. What you want is rewriting data, which involves a hoodie 
writer job, not available in HoodieSnapshotCopier. Nevertheless, it might make 
sense for `HoodieSnapshotExporter` to incorporate such feature in the future.
   
   cc @vinothchandar @bvaradar Seems like ad-hoc transformation/re-writing is 
getting more often. This resizing could be another use case? a related feature 
is noted here https://issues.apache.org/jira/browse/HUDI-767



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 commented on issue #1846: [SUPPORT] HoodieSnapshotCopier example

2020-07-21 Thread GitBox


tooptoop4 commented on issue #1846:
URL: https://github.com/apache/hudi/issues/1846#issuecomment-662028709


   @xushiyan I want to replace contents of existing table. ie read existing 10k 
small files from tableA and replace tableA with 20 big files



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar opened a new pull request #1857: [HUDI-1029] In inline compaction mode, previously failed compactions …

2020-07-21 Thread GitBox


vinothchandar opened a new pull request #1857:
URL: https://github.com/apache/hudi/pull/1857


   …needs to be retried before new compactions
   
 - Prevents failed compactions from causing issues with future commits
 - Need to add tests
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch hudi_test_suite_refactor updated (247d923 -> ea2c616)

2020-07-21 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard 247d923  [HUDI-394] Provide a basic implementation of test suite
 add ea2c616  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (247d923)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (ea2c616)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[GitHub] [hudi] xushiyan commented on issue #1846: [SUPPORT] HoodieSnapshotCopier example

2020-07-21 Thread GitBox


xushiyan commented on issue #1846:
URL: https://github.com/apache/hudi/issues/1846#issuecomment-662007289


   > Can I use it to read all 0.4.6 COW hoodie data from one path and write 
back into less files in 0.5.3 format on same path?
   
   IIUC, this is to perform write operation from one table to another? Does not 
seem like a use case for HoodieSnapshotCopier, which is meant for back-up, 
i.e., copy an existing table to another empty dir for a fresh copy.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #1850: [HUDI-994] Move TestHoodieIndex test cases to unit tests

2020-07-21 Thread GitBox


vinothchandar merged pull request #1850:
URL: https://github.com/apache/hudi/pull/1850


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-994] Move TestHoodieIndex test cases to unit tests (#1850)

2020-07-21 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 5e7ab11  [HUDI-994] Move TestHoodieIndex test cases to unit tests 
(#1850)
5e7ab11 is described below

commit 5e7ab11e2ead23428ed5089421f2abd6433fe8e5
Author: Raymond Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Tue Jul 21 10:23:43 2020 -0700

[HUDI-994] Move TestHoodieIndex test cases to unit tests (#1850)
---
 .../org/apache/hudi/index/TestHoodieIndex.java | 155 +-
 .../apache/hudi/index/TestHoodieIndexConfigs.java  | 181 +
 2 files changed, 187 insertions(+), 149 deletions(-)

diff --git 
a/hudi-client/src/test/java/org/apache/hudi/index/TestHoodieIndex.java 
b/hudi-client/src/test/java/org/apache/hudi/index/TestHoodieIndex.java
index 66b8ae6..ab3e888 100644
--- a/hudi-client/src/test/java/org/apache/hudi/index/TestHoodieIndex.java
+++ b/hudi-client/src/test/java/org/apache/hudi/index/TestHoodieIndex.java
@@ -26,7 +26,6 @@ import org.apache.hudi.common.model.EmptyHoodieRecordPayload;
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodiePartitionMetadata;
 import org.apache.hudi.common.model.HoodieRecord;
-import org.apache.hudi.common.model.HoodieRecordPayload;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
 import org.apache.hudi.common.table.view.FileSystemViewStorageType;
@@ -34,17 +33,10 @@ import org.apache.hudi.common.util.FileIOUtils;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieCompactionConfig;
-import org.apache.hudi.config.HoodieHBaseIndexConfig;
 import org.apache.hudi.config.HoodieIndexConfig;
 import org.apache.hudi.config.HoodieStorageConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.exception.HoodieException;
-import org.apache.hudi.exception.HoodieIndexException;
 import org.apache.hudi.index.HoodieIndex.IndexType;
-import org.apache.hudi.index.bloom.HoodieBloomIndex;
-import org.apache.hudi.index.bloom.HoodieGlobalBloomIndex;
-import org.apache.hudi.index.hbase.HBaseIndex;
-import org.apache.hudi.index.simple.HoodieSimpleIndex;
 import org.apache.hudi.table.HoodieTable;
 import org.apache.hudi.testutils.HoodieClientTestHarness;
 import org.apache.hudi.testutils.HoodieClientTestUtils;
@@ -55,9 +47,7 @@ import org.apache.avro.Schema;
 import org.apache.hadoop.fs.Path;
 import org.apache.spark.api.java.JavaPairRDD;
 import org.apache.spark.api.java.JavaRDD;
-import org.apache.spark.api.java.JavaSparkContext;
 import org.junit.jupiter.api.AfterEach;
-import org.junit.jupiter.api.Test;
 import org.junit.jupiter.params.ParameterizedTest;
 import org.junit.jupiter.params.provider.EnumSource;
 
@@ -76,7 +66,6 @@ import scala.Tuple2;
 import static org.apache.hudi.testutils.Assertions.assertNoWriteErrors;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertFalse;
-import static org.junit.jupiter.api.Assertions.assertThrows;
 import static org.junit.jupiter.api.Assertions.assertTrue;
 
 public class TestHoodieIndex extends HoodieClientTestHarness {
@@ -85,22 +74,19 @@ public class TestHoodieIndex extends 
HoodieClientTestHarness {
   private IndexType indexType;
   private HoodieIndex index;
   private HoodieWriteConfig config;
-  private String schemaStr;
   private Schema schema;
 
   private void setUp(IndexType indexType) throws Exception {
-setUp(indexType, true);
-  }
-
-  private void setUp(IndexType indexType, boolean initializeIndex) throws 
Exception {
 this.indexType = indexType;
 initResources();
 // We have some records to be tagged (two different partitions)
-schemaStr = 
FileIOUtils.readAsUTFString(getClass().getResourceAsStream("/exampleSchema.txt"));
+String schemaStr = 
FileIOUtils.readAsUTFString(getClass().getResourceAsStream("/exampleSchema.txt"));
 schema = HoodieAvroUtils.addMetadataFields(new 
Schema.Parser().parse(schemaStr));
-if (initializeIndex) {
-  instantiateIndex();
-}
+config = getConfigBuilder()
+
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(indexType)
+.build()).withAutoCommit(false).build();
+writeClient = getHoodieWriteClient(config);
+this.index = writeClient.getIndex();
   }
 
   @AfterEach
@@ -109,75 +95,6 @@ public class TestHoodieIndex extends 
HoodieClientTestHarness {
   }
 
   @ParameterizedTest
-  @EnumSource(value = IndexType.class, names = {"BLOOM", "GLOBAL_BLOOM", 
"SIMPLE", "GLOBAL_SIMPLE", "HBASE"})
-  public void testCreateIndex(IndexType indexType) throws Exception {
-setUp(indexType, false);
-HoodieWriteConfig.Builder clientConfigBuilder = 
Hood

[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458247881



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HudiMergeOnReadRDD.scala
##
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.log.{HoodieMergedLogRecordScanner, 
LogReaderUtils}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hadoop.config.{HadoopSerializableConfiguration, 
HoodieRealtimeConfig}
+import 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.HOODIE_RECORD_KEY_COL_POS
+
+import org.apache.avro.Schema
+import org.apache.hadoop.conf.Configuration
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.vectorized.ColumnarBatch
+
+import scala.collection.JavaConverters._
+import scala.util.Try
+
+case class HudiMergeOnReadPartition(index: Int, split: 
HudiMergeOnReadFileSplit) extends Partition
+
+class HudiMergeOnReadRDD(sc: SparkContext,
+ broadcastedConf: 
Broadcast[HadoopSerializableConfiguration],
+ dataReadFunction: PartitionedFile => Iterator[Any],
+ dataSchema: StructType,
+ hudiRealtimeFileSplits: 
List[HudiMergeOnReadFileSplit])
+  extends RDD[InternalRow](sc, Nil) {
+
+  // Broadcast the hadoop Configuration to executors.
+  def this(sc: SparkContext,
+   config: Configuration,
+   dataReadFunction: PartitionedFile => Iterator[Any],
+   dataSchema: StructType,
+   hudiRealtimeFileSplits: List[HudiMergeOnReadFileSplit]) = {
+this(
+  sc,
+  sc.broadcast(new HadoopSerializableConfiguration(config))
+  .asInstanceOf[Broadcast[HadoopSerializableConfiguration]],
+  dataReadFunction,
+  dataSchema,
+  hudiRealtimeFileSplits)
+  }
+
+  override def compute(split: Partition, context: TaskContext): 
Iterator[InternalRow] = {
+val mergeParquetPartition = split.asInstanceOf[HudiMergeOnReadPartition]
+mergeParquetPartition.split match {
+  case dataFileOnlySplit if dataFileOnlySplit.logPaths.isEmpty =>
+read(mergeParquetPartition.split.dataFile, dataReadFunction)
+  case unMergeSplit if unMergeSplit.skipMerge =>
+unMergeFileIterator(mergeParquetPartition.split, dataReadFunction)
+  case mergeSplit if !mergeSplit.skipMerge =>
+mergeFileIterator(mergeParquetPartition.split, dataReadFunction)
+  case _ => throw new HoodieException("Unable to select an Iterator to 
read the Hudi MOR File Split")
+}
+  }
+
+  override protected def getPartitions: Array[Partition] = {
+hudiRealtimeFileSplits.zipWithIndex.map(file => 
HudiMergeOnReadPartition(file._2, file._1)).toArray
+  }
+
+  private def getConfig(): Configuration = {
+broadcastedConf.value.config
+  }
+
+  private def read(partitionedFile: PartitionedFile,
+   readFileFunction: PartitionedFile => Iterator[Any]): 
Iterator[InternalRow] = {
+val fileIterator = readFileFunction(partitionedFile)
+val rows = fileIterator.flatMap(_ match {
+  case r: InternalRow => Seq(r)
+  case b: ColumnarBatch => b.rowIterator().asScala
+})
+rows
+  }
+
+  private def unMergeFileIterator(split: HudiMergeOnReadFileSplit,
+  readFileFunction: PartitionedFile => 
Iterator[Any]): Iterator[InternalRow] =
+new Iterator[InternalRow] {
+  private val dataFileIterator = read(split.dataFile, readFileFunction)
+  private val logSchema = getLogAvroSchema(split)
+  private val sparkTypes = 
SchemaConverters.toSqlType(logSchema).dataType.asInstanceOf[StructType]
+  private val converter = new AvroDeserializer(logSchema, sparkTypes)
+  private val hudiLogRecords = scanLog(split, 

[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458245661



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HudiMergeOnReadRDD.scala
##
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.log.{HoodieMergedLogRecordScanner, 
LogReaderUtils}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hadoop.config.{HadoopSerializableConfiguration, 
HoodieRealtimeConfig}
+import 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.HOODIE_RECORD_KEY_COL_POS
+
+import org.apache.avro.Schema
+import org.apache.hadoop.conf.Configuration
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.vectorized.ColumnarBatch
+
+import scala.collection.JavaConverters._
+import scala.util.Try
+
+case class HudiMergeOnReadPartition(index: Int, split: 
HudiMergeOnReadFileSplit) extends Partition
+
+class HudiMergeOnReadRDD(sc: SparkContext,
+ broadcastedConf: 
Broadcast[HadoopSerializableConfiguration],
+ dataReadFunction: PartitionedFile => Iterator[Any],
+ dataSchema: StructType,
+ hudiRealtimeFileSplits: 
List[HudiMergeOnReadFileSplit])
+  extends RDD[InternalRow](sc, Nil) {
+
+  // Broadcast the hadoop Configuration to executors.
+  def this(sc: SparkContext,
+   config: Configuration,
+   dataReadFunction: PartitionedFile => Iterator[Any],
+   dataSchema: StructType,
+   hudiRealtimeFileSplits: List[HudiMergeOnReadFileSplit]) = {
+this(
+  sc,
+  sc.broadcast(new HadoopSerializableConfiguration(config))
+  .asInstanceOf[Broadcast[HadoopSerializableConfiguration]],
+  dataReadFunction,
+  dataSchema,
+  hudiRealtimeFileSplits)
+  }
+
+  override def compute(split: Partition, context: TaskContext): 
Iterator[InternalRow] = {
+val mergeParquetPartition = split.asInstanceOf[HudiMergeOnReadPartition]
+mergeParquetPartition.split match {
+  case dataFileOnlySplit if dataFileOnlySplit.logPaths.isEmpty =>
+read(mergeParquetPartition.split.dataFile, dataReadFunction)
+  case unMergeSplit if unMergeSplit.skipMerge =>
+unMergeFileIterator(mergeParquetPartition.split, dataReadFunction)
+  case mergeSplit if !mergeSplit.skipMerge =>
+mergeFileIterator(mergeParquetPartition.split, dataReadFunction)
+  case _ => throw new HoodieException("Unable to select an Iterator to 
read the Hudi MOR File Split")
+}
+  }
+
+  override protected def getPartitions: Array[Partition] = {
+hudiRealtimeFileSplits.zipWithIndex.map(file => 
HudiMergeOnReadPartition(file._2, file._1)).toArray
+  }
+
+  private def getConfig(): Configuration = {
+broadcastedConf.value.config
+  }
+
+  private def read(partitionedFile: PartitionedFile,
+   readFileFunction: PartitionedFile => Iterator[Any]): 
Iterator[InternalRow] = {
+val fileIterator = readFileFunction(partitionedFile)
+val rows = fileIterator.flatMap(_ match {
+  case r: InternalRow => Seq(r)
+  case b: ColumnarBatch => b.rowIterator().asScala
+})
+rows
+  }
+
+  private def unMergeFileIterator(split: HudiMergeOnReadFileSplit,
+  readFileFunction: PartitionedFile => 
Iterator[Any]): Iterator[InternalRow] =
+new Iterator[InternalRow] {
+  private val dataFileIterator = read(split.dataFile, readFileFunction)
+  private val logSchema = getLogAvroSchema(split)
+  private val sparkTypes = 
SchemaConverters.toSqlType(logSchema).dataType.asInstanceOf[StructType]
+  private val converter = new AvroDeserializer(logSchema, sparkTypes)
+  private val hudiLogRecords = scanLog(split, 

[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458245661



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HudiMergeOnReadRDD.scala
##
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.log.{HoodieMergedLogRecordScanner, 
LogReaderUtils}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hadoop.config.{HadoopSerializableConfiguration, 
HoodieRealtimeConfig}
+import 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.HOODIE_RECORD_KEY_COL_POS
+
+import org.apache.avro.Schema
+import org.apache.hadoop.conf.Configuration
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.vectorized.ColumnarBatch
+
+import scala.collection.JavaConverters._
+import scala.util.Try
+
+case class HudiMergeOnReadPartition(index: Int, split: 
HudiMergeOnReadFileSplit) extends Partition
+
+class HudiMergeOnReadRDD(sc: SparkContext,
+ broadcastedConf: 
Broadcast[HadoopSerializableConfiguration],
+ dataReadFunction: PartitionedFile => Iterator[Any],
+ dataSchema: StructType,
+ hudiRealtimeFileSplits: 
List[HudiMergeOnReadFileSplit])
+  extends RDD[InternalRow](sc, Nil) {
+
+  // Broadcast the hadoop Configuration to executors.
+  def this(sc: SparkContext,
+   config: Configuration,
+   dataReadFunction: PartitionedFile => Iterator[Any],
+   dataSchema: StructType,
+   hudiRealtimeFileSplits: List[HudiMergeOnReadFileSplit]) = {
+this(
+  sc,
+  sc.broadcast(new HadoopSerializableConfiguration(config))
+  .asInstanceOf[Broadcast[HadoopSerializableConfiguration]],
+  dataReadFunction,
+  dataSchema,
+  hudiRealtimeFileSplits)
+  }
+
+  override def compute(split: Partition, context: TaskContext): 
Iterator[InternalRow] = {
+val mergeParquetPartition = split.asInstanceOf[HudiMergeOnReadPartition]
+mergeParquetPartition.split match {
+  case dataFileOnlySplit if dataFileOnlySplit.logPaths.isEmpty =>
+read(mergeParquetPartition.split.dataFile, dataReadFunction)
+  case unMergeSplit if unMergeSplit.skipMerge =>
+unMergeFileIterator(mergeParquetPartition.split, dataReadFunction)
+  case mergeSplit if !mergeSplit.skipMerge =>
+mergeFileIterator(mergeParquetPartition.split, dataReadFunction)
+  case _ => throw new HoodieException("Unable to select an Iterator to 
read the Hudi MOR File Split")
+}
+  }
+
+  override protected def getPartitions: Array[Partition] = {
+hudiRealtimeFileSplits.zipWithIndex.map(file => 
HudiMergeOnReadPartition(file._2, file._1)).toArray
+  }
+
+  private def getConfig(): Configuration = {
+broadcastedConf.value.config
+  }
+
+  private def read(partitionedFile: PartitionedFile,
+   readFileFunction: PartitionedFile => Iterator[Any]): 
Iterator[InternalRow] = {
+val fileIterator = readFileFunction(partitionedFile)
+val rows = fileIterator.flatMap(_ match {
+  case r: InternalRow => Seq(r)
+  case b: ColumnarBatch => b.rowIterator().asScala
+})
+rows
+  }
+
+  private def unMergeFileIterator(split: HudiMergeOnReadFileSplit,
+  readFileFunction: PartitionedFile => 
Iterator[Any]): Iterator[InternalRow] =
+new Iterator[InternalRow] {
+  private val dataFileIterator = read(split.dataFile, readFileFunction)
+  private val logSchema = getLogAvroSchema(split)
+  private val sparkTypes = 
SchemaConverters.toSqlType(logSchema).dataType.asInstanceOf[StructType]
+  private val converter = new AvroDeserializer(logSchema, sparkTypes)
+  private val hudiLogRecords = scanLog(split, 

[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458242933



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HudiMergeOnReadRDD.scala
##
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.log.{HoodieMergedLogRecordScanner, 
LogReaderUtils}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hadoop.config.{HadoopSerializableConfiguration, 
HoodieRealtimeConfig}
+import 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.HOODIE_RECORD_KEY_COL_POS
+
+import org.apache.avro.Schema
+import org.apache.hadoop.conf.Configuration
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.vectorized.ColumnarBatch
+
+import scala.collection.JavaConverters._
+import scala.util.Try
+
+case class HudiMergeOnReadPartition(index: Int, split: 
HudiMergeOnReadFileSplit) extends Partition
+
+class HudiMergeOnReadRDD(sc: SparkContext,
+ broadcastedConf: 
Broadcast[HadoopSerializableConfiguration],
+ dataReadFunction: PartitionedFile => Iterator[Any],
+ dataSchema: StructType,
+ hudiRealtimeFileSplits: 
List[HudiMergeOnReadFileSplit])
+  extends RDD[InternalRow](sc, Nil) {
+
+  // Broadcast the hadoop Configuration to executors.
+  def this(sc: SparkContext,
+   config: Configuration,
+   dataReadFunction: PartitionedFile => Iterator[Any],
+   dataSchema: StructType,
+   hudiRealtimeFileSplits: List[HudiMergeOnReadFileSplit]) = {
+this(
+  sc,
+  sc.broadcast(new HadoopSerializableConfiguration(config))
+  .asInstanceOf[Broadcast[HadoopSerializableConfiguration]],
+  dataReadFunction,
+  dataSchema,
+  hudiRealtimeFileSplits)
+  }
+
+  override def compute(split: Partition, context: TaskContext): 
Iterator[InternalRow] = {
+val mergeParquetPartition = split.asInstanceOf[HudiMergeOnReadPartition]
+mergeParquetPartition.split match {
+  case dataFileOnlySplit if dataFileOnlySplit.logPaths.isEmpty =>
+read(mergeParquetPartition.split.dataFile, dataReadFunction)
+  case unMergeSplit if unMergeSplit.skipMerge =>
+unMergeFileIterator(mergeParquetPartition.split, dataReadFunction)
+  case mergeSplit if !mergeSplit.skipMerge =>
+mergeFileIterator(mergeParquetPartition.split, dataReadFunction)
+  case _ => throw new HoodieException("Unable to select an Iterator to 
read the Hudi MOR File Split")
+}
+  }
+
+  override protected def getPartitions: Array[Partition] = {
+hudiRealtimeFileSplits.zipWithIndex.map(file => 
HudiMergeOnReadPartition(file._2, file._1)).toArray
+  }
+
+  private def getConfig(): Configuration = {
+broadcastedConf.value.config
+  }
+
+  private def read(partitionedFile: PartitionedFile,
+   readFileFunction: PartitionedFile => Iterator[Any]): 
Iterator[InternalRow] = {
+val fileIterator = readFileFunction(partitionedFile)
+val rows = fileIterator.flatMap(_ match {
+  case r: InternalRow => Seq(r)
+  case b: ColumnarBatch => b.rowIterator().asScala
+})
+rows
+  }
+
+  private def unMergeFileIterator(split: HudiMergeOnReadFileSplit,
+  readFileFunction: PartitionedFile => 
Iterator[Any]): Iterator[InternalRow] =
+new Iterator[InternalRow] {
+  private val dataFileIterator = read(split.dataFile, readFileFunction)
+  private val logSchema = getLogAvroSchema(split)
+  private val sparkTypes = 
SchemaConverters.toSqlType(logSchema).dataType.asInstanceOf[StructType]
+  private val converter = new AvroDeserializer(logSchema, sparkTypes)
+  private val hudiLogRecords = scanLog(split, 

[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458241891



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HudiMergeOnReadRDD.scala
##
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.log.{HoodieMergedLogRecordScanner, 
LogReaderUtils}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hadoop.config.{HadoopSerializableConfiguration, 
HoodieRealtimeConfig}
+import 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.HOODIE_RECORD_KEY_COL_POS
+
+import org.apache.avro.Schema
+import org.apache.hadoop.conf.Configuration
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.vectorized.ColumnarBatch
+
+import scala.collection.JavaConverters._
+import scala.util.Try
+
+case class HudiMergeOnReadPartition(index: Int, split: 
HudiMergeOnReadFileSplit) extends Partition
+
+class HudiMergeOnReadRDD(sc: SparkContext,
+ broadcastedConf: 
Broadcast[HadoopSerializableConfiguration],
+ dataReadFunction: PartitionedFile => Iterator[Any],
+ dataSchema: StructType,
+ hudiRealtimeFileSplits: 
List[HudiMergeOnReadFileSplit])
+  extends RDD[InternalRow](sc, Nil) {
+
+  // Broadcast the hadoop Configuration to executors.
+  def this(sc: SparkContext,
+   config: Configuration,
+   dataReadFunction: PartitionedFile => Iterator[Any],
+   dataSchema: StructType,
+   hudiRealtimeFileSplits: List[HudiMergeOnReadFileSplit]) = {
+this(
+  sc,
+  sc.broadcast(new HadoopSerializableConfiguration(config))
+  .asInstanceOf[Broadcast[HadoopSerializableConfiguration]],
+  dataReadFunction,
+  dataSchema,
+  hudiRealtimeFileSplits)
+  }
+
+  override def compute(split: Partition, context: TaskContext): 
Iterator[InternalRow] = {
+val mergeParquetPartition = split.asInstanceOf[HudiMergeOnReadPartition]
+mergeParquetPartition.split match {
+  case dataFileOnlySplit if dataFileOnlySplit.logPaths.isEmpty =>
+read(mergeParquetPartition.split.dataFile, dataReadFunction)
+  case unMergeSplit if unMergeSplit.skipMerge =>
+unMergeFileIterator(mergeParquetPartition.split, dataReadFunction)
+  case mergeSplit if !mergeSplit.skipMerge =>
+mergeFileIterator(mergeParquetPartition.split, dataReadFunction)
+  case _ => throw new HoodieException("Unable to select an Iterator to 
read the Hudi MOR File Split")
+}
+  }
+
+  override protected def getPartitions: Array[Partition] = {
+hudiRealtimeFileSplits.zipWithIndex.map(file => 
HudiMergeOnReadPartition(file._2, file._1)).toArray
+  }
+
+  private def getConfig(): Configuration = {
+broadcastedConf.value.config
+  }
+
+  private def read(partitionedFile: PartitionedFile,
+   readFileFunction: PartitionedFile => Iterator[Any]): 
Iterator[InternalRow] = {
+val fileIterator = readFileFunction(partitionedFile)
+val rows = fileIterator.flatMap(_ match {
+  case r: InternalRow => Seq(r)
+  case b: ColumnarBatch => b.rowIterator().asScala
+})
+rows
+  }
+
+  private def unMergeFileIterator(split: HudiMergeOnReadFileSplit,
+  readFileFunction: PartitionedFile => 
Iterator[Any]): Iterator[InternalRow] =
+new Iterator[InternalRow] {
+  private val dataFileIterator = read(split.dataFile, readFileFunction)
+  private val logSchema = getLogAvroSchema(split)
+  private val sparkTypes = 
SchemaConverters.toSqlType(logSchema).dataType.asInstanceOf[StructType]
+  private val converter = new AvroDeserializer(logSchema, sparkTypes)
+  private val hudiLogRecords = scanLog(split, 

[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458237536



##
File path: hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala
##
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils
+import 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMaxCompactionMemoryInBytes
+
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapred.JobConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.sources.{BaseRelation, TableScan}
+import org.apache.spark.sql.types.StructType
+
+import scala.collection.JavaConverters._
+
+case class HudiMergeOnReadFileSplit(dataFile: PartitionedFile,
+logPaths: Option[List[String]],
+latestCommit: String,
+tablePath: String,
+maxCompactionMemoryInBytes: Long,
+skipMerge: Boolean)
+
+class SnapshotRelation (val sqlContext: SQLContext,
+val optParams: Map[String, String],
+val userSchema: StructType,
+val globPaths: Seq[Path],
+val metaClient: HoodieTableMetaClient)
+  extends BaseRelation with TableScan with Logging{
+
+  private val conf = sqlContext.sparkContext.hadoopConfiguration
+
+  // use schema from latest metadata, if not present, read schema from the 
data file
+  private val latestSchema = {
+val schemaUtil = new TableSchemaResolver(metaClient)
+val tableSchema = 
HoodieAvroUtils.createHoodieWriteSchema(schemaUtil.getTableAvroSchemaWithoutMetadataFields)
+AvroConversionUtils.convertAvroSchemaToStructType(tableSchema)
+  }
+
+  private val skipMerge = optParams.getOrElse(
+DataSourceReadOptions.REALTIME_SKIP_MERGE_KEY,
+DataSourceReadOptions.DEFAULT_REALTIME_SKIP_MERGE_VAL).toBoolean
+  private val maxCompactionMemoryInBytes = getMaxCompactionMemoryInBytes(new 
JobConf(conf))
+  private val fileIndex = buildFileIndex()
+
+  override def schema: StructType = latestSchema
+
+  override def needConversion: Boolean = false
+
+  override def buildScan(): RDD[Row] = {
+val parquetReaderFunction = new 
ParquetFileFormat().buildReaderWithPartitionValues(

Review comment:
   I think we can make `HoodieLogFileFormat` to read log files in the 
future. Wrapping the `FileFormat` here gives us a lot of flexibility to adopt 
other formats like ORC. Wrapping inside the FileFormat could also possible by 
`override buildReaderWithPartitionValues` and call 
`super.buildReaderWithPartitionValues` to get the iterator. The downside would 
be we probably need two separate classes to handle ORC and parquet.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ssomuah edited a comment on issue #1852: [SUPPORT]

2020-07-21 Thread GitBox


ssomuah edited a comment on issue #1852:
URL: https://github.com/apache/hudi/issues/1852#issuecomment-661970919


   I don't see any exceptions in the driver logs or executor logs. 
   
   I see these two warnings in driver logs
   ```
   20/07/21 13:12:28 WARN IncrementalTimelineSyncFileSystemView: Incremental 
Sync of timeline is turned off or deemed unsafe. Will revert to full syncing
   ```
   ```
   20/07/21 13:12:29 WARN CleanPlanner: Incremental Cleaning mode is enabled. 
Looking up partition-paths that have since changed since last cleaned at 
20200721032203. New Instant to retain : 
Option{val=[20200721032203__commit__COMPLETED]}
   ```
   
   These are the contests of the timeline 
   
[dot_hoodie_folder.txt](https://github.com/apache/hudi/files/4954820/dot_hoodie_folder.txt)
   
   The timeline only has files from the current day but I see log files in the 
data folder from over a week ago, do you have any idea what might be causing so 
many log files



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ssomuah commented on issue #1852: [SUPPORT]

2020-07-21 Thread GitBox


ssomuah commented on issue #1852:
URL: https://github.com/apache/hudi/issues/1852#issuecomment-661970919


   I don't see any exceptions in the driver logs or executor logs. 
   
   I see these two warnings in driver logs
   ```
   20/07/21 13:12:28 WARN IncrementalTimelineSyncFileSystemView: Incremental 
Sync of timeline is turned off or deemed unsafe. Will revert to full syncing
   ```
   ```
   20/07/21 13:12:29 WARN CleanPlanner: Incremental Cleaning mode is enabled. 
Looking up partition-paths that have since changed since last cleaned at 
20200721032203. New Instant to retain : 
Option{val=[20200721032203__commit__COMPLETED]}
   ```
   
   These are the contests of the timeline 
   
[dot_hoodie_folder.txt](https://github.com/apache/hudi/files/4954820/dot_hoodie_folder.txt)
   
   The timeline only has files from the current day but I see log files in the 
data folder from over a week ago. 
   
   
   Do you have any ideas of what might be causing so many log files



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458232450



##
File path: hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala
##
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils
+import 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMaxCompactionMemoryInBytes
+
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapred.JobConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.sources.{BaseRelation, TableScan}
+import org.apache.spark.sql.types.StructType
+
+import scala.collection.JavaConverters._
+
+case class HudiMergeOnReadFileSplit(dataFile: PartitionedFile,
+logPaths: Option[List[String]],
+latestCommit: String,
+tablePath: String,
+maxCompactionMemoryInBytes: Long,
+skipMerge: Boolean)
+
+class SnapshotRelation (val sqlContext: SQLContext,
+val optParams: Map[String, String],
+val userSchema: StructType,
+val globPaths: Seq[Path],
+val metaClient: HoodieTableMetaClient)
+  extends BaseRelation with TableScan with Logging{
+
+  private val conf = sqlContext.sparkContext.hadoopConfiguration
+
+  // use schema from latest metadata, if not present, read schema from the 
data file
+  private val latestSchema = {
+val schemaUtil = new TableSchemaResolver(metaClient)
+val tableSchema = 
HoodieAvroUtils.createHoodieWriteSchema(schemaUtil.getTableAvroSchemaWithoutMetadataFields)
+AvroConversionUtils.convertAvroSchemaToStructType(tableSchema)
+  }
+
+  private val skipMerge = optParams.getOrElse(
+DataSourceReadOptions.REALTIME_SKIP_MERGE_KEY,
+DataSourceReadOptions.DEFAULT_REALTIME_SKIP_MERGE_VAL).toBoolean
+  private val maxCompactionMemoryInBytes = getMaxCompactionMemoryInBytes(new 
JobConf(conf))
+  private val fileIndex = buildFileIndex()
+
+  override def schema: StructType = latestSchema
+
+  override def needConversion: Boolean = false
+
+  override def buildScan(): RDD[Row] = {
+val parquetReaderFunction = new 
ParquetFileFormat().buildReaderWithPartitionValues(
+  sparkSession = sqlContext.sparkSession,
+  dataSchema = latestSchema,
+  partitionSchema = StructType(Nil),
+  requiredSchema = latestSchema,
+  filters = Seq.empty,

Review comment:
   This field required `Seq[Filter]`. With `PrunedFilteredScan` we can just 
pass whatever Spark passed to `buildScan(xxx, filter: Seq[Filter]` to here. 
This field could be an empty `Seq`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458230756



##
File path: hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala
##
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils
+import 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMaxCompactionMemoryInBytes
+
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapred.JobConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.sources.{BaseRelation, TableScan}
+import org.apache.spark.sql.types.StructType
+
+import scala.collection.JavaConverters._
+
+case class HudiMergeOnReadFileSplit(dataFile: PartitionedFile,
+logPaths: Option[List[String]],
+latestCommit: String,
+tablePath: String,
+maxCompactionMemoryInBytes: Long,
+skipMerge: Boolean)
+
+class SnapshotRelation (val sqlContext: SQLContext,
+val optParams: Map[String, String],
+val userSchema: StructType,
+val globPaths: Seq[Path],
+val metaClient: HoodieTableMetaClient)
+  extends BaseRelation with TableScan with Logging{
+
+  private val conf = sqlContext.sparkContext.hadoopConfiguration
+
+  // use schema from latest metadata, if not present, read schema from the 
data file
+  private val latestSchema = {
+val schemaUtil = new TableSchemaResolver(metaClient)
+val tableSchema = 
HoodieAvroUtils.createHoodieWriteSchema(schemaUtil.getTableAvroSchemaWithoutMetadataFields)
+AvroConversionUtils.convertAvroSchemaToStructType(tableSchema)
+  }
+
+  private val skipMerge = optParams.getOrElse(
+DataSourceReadOptions.REALTIME_SKIP_MERGE_KEY,
+DataSourceReadOptions.DEFAULT_REALTIME_SKIP_MERGE_VAL).toBoolean
+  private val maxCompactionMemoryInBytes = getMaxCompactionMemoryInBytes(new 
JobConf(conf))
+  private val fileIndex = buildFileIndex()
+
+  override def schema: StructType = latestSchema
+
+  override def needConversion: Boolean = false
+
+  override def buildScan(): RDD[Row] = {
+val parquetReaderFunction = new 
ParquetFileFormat().buildReaderWithPartitionValues(
+  sparkSession = sqlContext.sparkSession,
+  dataSchema = latestSchema,
+  partitionSchema = StructType(Nil),
+  requiredSchema = latestSchema,

Review comment:
   This is related to `PrunedFilteredScan`. We need to support merging two 
records with different schemas if we don't read all fields here. This will be 
targeting for 0.6.0 release





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458224048



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -110,6 +112,10 @@ object DataSourceReadOptions {
*/
   val INCR_PATH_GLOB_OPT_KEY = "hoodie.datasource.read.incr.path.glob"
   val DEFAULT_INCR_PATH_GLOB_OPT_VAL = ""
+
+
+  val REALTIME_SKIP_MERGE_KEY = REALTIME_SKIP_MERGE_PROP

Review comment:
   Agree. Maybe something like `SNAPSHOT_READ_STRATEGY`? So we can control 
the logic for `merge` `unmerge` `mergeWithBootstrap` e.t.c





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-21 Thread GitBox


tooptoop4 commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-661955977


   i'm facing the same entries under .hoodie 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458215093



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
##
@@ -147,12 +146,4 @@ public Schema getWriterSchema() {
   public Schema getHiveSchema() {
 return hiveSchema;
   }
-
-  public long getMaxCompactionMemoryInBytes() {

Review comment:
   Moved to a utils class. Need to call this method from 
`HoodieMergeOnReadRDD`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-21 Thread GitBox


garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r458213831



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/config/HadoopSerializableConfiguration.java
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.config;
+
+import org.apache.hadoop.conf.Configuration;
+
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import java.io.Serializable;
+
+public class HadoopSerializableConfiguration implements Serializable {

Review comment:
   Didn't know this. Will switch over.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-21 Thread GitBox


shenh062326 commented on a change in pull request #1819:
URL: https://github.com/apache/hudi/pull/1819#discussion_r458181162



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java
##
@@ -66,8 +74,9 @@ public OverwriteWithLatestAvroPayload 
preCombine(OverwriteWithLatestAvroPayload
 }
 
 GenericRecord genericRecord = (GenericRecord) recordOption.get();
-// combining strategy here trivially ignores currentValue on disk and 
writes this record
-Object deleteMarker = genericRecord.get("_hoodie_is_deleted");
+// combining strategy here trivially ignores currentValue on disk and 
writes this record吗
+String deleteField = isDeletedField == null ? "_hoodie_is_deleted" : 
isDeletedField;

Review comment:
   > I don't see any tests being added as part of the patch. Would be nice 
to have some tests covering the new code that was added at all levels.
   > 
   > * WriteClient
   > * Datasource if there is an existing suite of tests for other write 
operations
   > * Deltastreamer
   
   Sorry for late, I will add some testcases and fix the comments.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-21 Thread GitBox


shenh062326 commented on a change in pull request #1819:
URL: https://github.com/apache/hudi/pull/1819#discussion_r458181162



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java
##
@@ -66,8 +74,9 @@ public OverwriteWithLatestAvroPayload 
preCombine(OverwriteWithLatestAvroPayload
 }
 
 GenericRecord genericRecord = (GenericRecord) recordOption.get();
-// combining strategy here trivially ignores currentValue on disk and 
writes this record
-Object deleteMarker = genericRecord.get("_hoodie_is_deleted");
+// combining strategy here trivially ignores currentValue on disk and 
writes this record吗
+String deleteField = isDeletedField == null ? "_hoodie_is_deleted" : 
isDeletedField;

Review comment:
   > I don't see any tests being added as part of the patch. Would be nice 
to have some tests covering the new code that was added at all levels.
   > 
   > * WriteClient
   > * Datasource if there is an existing suite of tests for other write 
operations
   > * Deltastreamer
   
   Sorry for late, I will add some testcase and fix the comments.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] GurRonenExplorium edited a comment on issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

2020-07-21 Thread GitBox


GurRonenExplorium edited a comment on issue #1856:
URL: https://github.com/apache/hudi/issues/1856#issuecomment-661909397


   Additional context, the hudi configuration:
   ```
   val hudiOptions = Map[String, String](
 HoodieWriteConfig.TABLE_NAME -> tableName,
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> uuidColumn,
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> partitionColumn,
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> precombineField,
 DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
 DataSourceWriteOptions.OPERATION_OPT_KEY -> "insert",
 DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
 DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> databaseName,
 DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
 DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> 
partitionColumn,
 DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName,
 DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
 HoodieStorageConfig.PARQUET_COMPRESSION_CODEC -> "snappy",
 HoodieCompactionConfig.COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE -> 
String.valueOf(500),
 HoodieCompactionConfig.COPY_ON_WRITE_TABLE_AUTO_SPLIT_INSERTS -> 
String.valueOf(true),
 HoodieCompactionConfig.COPY_ON_WRITE_TABLE_RECORD_SIZE_ESTIMATE -> 
String.valueOf(200)
   )
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] GurRonenExplorium commented on issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

2020-07-21 Thread GitBox


GurRonenExplorium commented on issue #1856:
URL: https://github.com/apache/hudi/issues/1856#issuecomment-661909397


   Additional context, the hudi configuration:
   ```
   val hudiOptions = Map[String, String](
 HoodieWriteConfig.TABLE_NAME -> tableName,
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> uuidColumn,
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> partitionColumn,
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> precombineField,
 DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
 DataSourceWriteOptions.OPERATION_OPT_KEY -> 
ingestMode.getOrElse(DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL),
 DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
 DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> databaseName,
 DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
 DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> 
partitionColumn,
 DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName,
 DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
 HoodieStorageConfig.PARQUET_COMPRESSION_CODEC -> "snappy",
 HoodieCompactionConfig.COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE -> 
String.valueOf(500),
 HoodieCompactionConfig.COPY_ON_WRITE_TABLE_AUTO_SPLIT_INSERTS -> 
String.valueOf(true),
 HoodieCompactionConfig.COPY_ON_WRITE_TABLE_RECORD_SIZE_ESTIMATE -> 
String.valueOf(200)
   )
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] GurRonenExplorium opened a new issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

2020-07-21 Thread GitBox


GurRonenExplorium opened a new issue #1856:
URL: https://github.com/apache/hudi/issues/1856


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   Hey,
   tl;dr: Hive Sync is failing on `alter table cascade`
   
   I am running a PoC with Hudi and started working with a timeseries dataset 
we have, input is partitioned by insertion_time with late data being maximum 
48hr. output is the same dataset, with event_time partitions and some 
additional fields (all of them are row-by-row with no aggregations)
   
   Setup: AWS EMR, setting up transient clusters (spark for the job itself, 
hive for access to glue metastore for the HiveSync tool - btw if there is a 
better way I'm happy to hear)
   
   Steps i did:
   1. load 1 day of data (worked well)
   2. loaded a few extra days with 1 partition batches each time (so each run 
was a single insertion time partition) everything synced well to
   3. run on a full month of data in a single job
   4. Successfully load data to hudi, HiveSync failed with alter table error
   
   A clear and concise description of the problem.
   
   
   **Expected behavior**
   
   Hive Sync shouldn't crash when syncing to glue catalog
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.4.5
   
   * Hive version : 2.3.6
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   EMR: 5.30.1
   
   **Stacktrace**
   stacktrace is a bit redacted, if anything more is needed i can get it
   ```
   20/07/19 19:27:47 ERROR HiveSyncTool: Got runtime exception when hive syncing
   org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL ALTER 
TABLE `#DB_NAME#`.`#TABLE_NAME#` REPLACE COLUMNS(`_hoodie_commit_time` string, 
`_hoodie_commit_seqno` string, `_hoodie_record_key` string, 
`_hoodie_partition_path` string, `_hoodie_file_name` string, `utc_timestamp` 
string, `local_timestamp_with_timezone` string, `utc_timestamp_with_timezone` 
string, `#COL1#` string, `#COL2#` string, `#COL3#` double, `#COL4#` double, 
`#COL5#` string, `#COL6#` string, `#COL7#` double, `#COL8#` double, `#COL9#` 
string, `#COL10#` bigint, `#COL11#` string, `#COL12#` string, `#COL13#` string, 
`#COL14#` string, `#COL15#` string, `#COL16#` string, `#COL17#` string, 
`#COL18#` int, `hash_id` string, `#REDACTED#_6` string, `#REDACTED#_7` string, 
`#REDACTED#_8` string, `#REDACTED#_9` string, `#REDACTED#_10` string, 
`#REDACTED#_11` string, `offset_year` int, `offset_month` int, 
`offset_dayofmonth` int, `offset_dayofweek` int, `offset_hourofday` int ) 
cascade
   at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:482)
   at 
org.apache.hudi.hive.HoodieHiveClient.updateTableDefinition(HoodieHiveClient.java:261)
   at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:164)
   at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:114)
   at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:87)
   at 
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
   at 
org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
   at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
   at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
   at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
   at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:194)
   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
   at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
   at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:112)
   at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFr

[GitHub] [hudi] asheeshgarg commented on issue #1787: Exception During Insert

2020-07-21 Thread GitBox


asheeshgarg commented on issue #1787:
URL: https://github.com/apache/hudi/issues/1787#issuecomment-661874955


   @bvaradar any recommendation on this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-21 Thread GitBox


asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-661874341


   @bvaradar so the insert are looking fine now the COW compaction is 
generating 2 parquet file for each date.
   I also set the following properties
   
   "hoodie.keep.min.commits": 2,
   "hoodie.keep.max.commits": 4,
   But still I see lot of entries get accumulated under .hoodie like
   2020-07-21 12:20:440 Bytes 20200721122042.rollback.inflight
   2020-07-21 12:21:451.2 KiB 20200721122143.rollback
   2020-07-21 12:21:450 Bytes 20200721122143.rollback.inflight
   2020-07-21 12:31:051.0 KiB 20200721123102.rollback
   2020-07-21 12:31:050 Bytes 20200721123102.rollback.inflight
   ...
   2020-07-21 13:43:15  950 Bytes 20200721134301.clean.inflight
   2020-07-21 13:43:15  950 Bytes 20200721134301.clean.requested
   2020-07-21 13:43:123.9 KiB 20200721134301.commit
   2020-07-21 13:43:020 Bytes 20200721134301.commit.requested
   2020-07-21 13:43:061.0 KiB 20200721134301.inflight
   
   Do I need to set anything to clean up?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-21 Thread GitBox


Mathieu1124 commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r458060234



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -632,6 +632,21 @@ public FileSystemViewStorageConfig 
getClientSpecifiedViewStorageConfig() {
 return clientSpecifiedViewStorageConfig;
   }
 
+  /**
+   * Commit call back configs.
+   */
+  public boolean writeCommitCallbackOn() {
+return 
Boolean.parseBoolean(props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_ON));
+  }
+
+  public String getCallbackType() {

Review comment:
   > IMO, this property is not necessary. We can only depend on 
`getCallbackClass ` to specify the detailed implementation. This is a simple 
**SPI** mode. Otherwise, it causes two entry points in the factory you 
implemented. WDYT?
   
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1116) Support time travel using timestamp type

2020-07-21 Thread linshan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161997#comment-17161997
 ] 

linshan commented on HUDI-1116:
---

hi,[~vbalaji]  

    would you describe the problem in detail? I want to get involved

> Support time travel using timestamp type
> 
>
> Key: HUDI-1116
> URL: https://issues.apache.org/jira/browse/HUDI-1116
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: linshan
>Priority: Major
>
>  
> {{Currently, we use commit time to mimic time-travel queries. We need ability 
> to handle time-travel with a proper timestamp provided.}}
> {{}}
> {{For e:g: }}
> {{spark.read  .format(“hudi”).option(“timestampAsOf”, 
> “2019-01-01”).load(“/path/to/my/table”)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] leesf commented on pull request #1855: [HUDI-871] Add support for Tencent Cloud Object Storage(COS)

2020-07-21 Thread GitBox


leesf commented on pull request #1855:
URL: https://github.com/apache/hudi/pull/1855#issuecomment-661790185


   close to retrigger



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf closed pull request #1855: [HUDI-871] Add support for Tencent Cloud Object Storage(COS)

2020-07-21 Thread GitBox


leesf closed pull request #1855:
URL: https://github.com/apache/hudi/pull/1855


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1109) Support Spark Structured Streaming read from Hudi table

2020-07-21 Thread linshan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

linshan reassigned HUDI-1109:
-

Assignee: linshan

> Support Spark Structured Streaming read from Hudi table
> ---
>
> Key: HUDI-1109
> URL: https://issues.apache.org/jira/browse/HUDI-1109
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: leesf
>Assignee: linshan
>Priority: Major
>
> Now hudi do not support spark structured streaming reading from hudi dataset, 
> we would support it so that hudi support both streaming write and read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1116) Support time travel using timestamp type

2020-07-21 Thread linshan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

linshan reassigned HUDI-1116:
-

Assignee: linshan

> Support time travel using timestamp type
> 
>
> Key: HUDI-1116
> URL: https://issues.apache.org/jira/browse/HUDI-1116
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: linshan
>Priority: Major
>
>  
> {{Currently, we use commit time to mimic time-travel queries. We need ability 
> to handle time-travel with a proper timestamp provided.}}
> {{}}
> {{For e:g: }}
> {{spark.read  .format(“hudi”).option(“timestampAsOf”, 
> “2019-01-01”).load(“/path/to/my/table”)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-871) Add support for Tencent cloud COS

2020-07-21 Thread deyzhong (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161900#comment-17161900
 ] 

deyzhong commented on HUDI-871:
---

I have submit a pr([https://github.com/apache/hudi/pull/1855]), please help to 
review, I hope to get your opinions and suggestions

thank you.

> Add support for Tencent cloud COS
> -
>
> Key: HUDI-871
> URL: https://issues.apache.org/jira/browse/HUDI-871
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Canbin Zheng
>Assignee: deyzhong
>Priority: Major
>  Labels: newbie, pull-request-available, starter
>
> Tencent cloud COS is becoming a widely used Object Storage Service, more and 
> more users use COS as the backend storage system, therefore this ticket 
> proposes to add support for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-871) Add support for Tencent cloud COS

2020-07-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-871:

Labels: newbie pull-request-available starter  (was: newbie starter)

> Add support for Tencent cloud COS
> -
>
> Key: HUDI-871
> URL: https://issues.apache.org/jira/browse/HUDI-871
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Canbin Zheng
>Assignee: deyzhong
>Priority: Major
>  Labels: newbie, pull-request-available, starter
>
> Tencent cloud COS is becoming a widely used Object Storage Service, more and 
> more users use COS as the backend storage system, therefore this ticket 
> proposes to add support for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] DeyinZhong opened a new pull request #1855: [HUDI-871] Add support for Tencent Cloud Object Storage(COS)

2020-07-21 Thread GitBox


DeyinZhong opened a new pull request #1855:
URL: https://github.com/apache/hudi/pull/1855


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   - Add hudi support Tencent Cloud Object Storage(COS)
   
   ## Brief change log
   
   - add cosn schema in StorageSchemes.java
   
   - compile hudi after modified codes
   ```
   mvn clean package -DskipTests -DskipITs -Dhadoop.version=2.8.5 
-Dhive.version=2.3.5 -Dspark.version=2.4.3
   ```
   
![image](https://user-images.githubusercontent.com/44561252/88037478-8a4bed80-cb77-11ea-8bac-e2c09528ec1c.png)
   
   
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   You can refer to the documents: 
   http://hudi.apache.org/docs/docker_demo.html
   
   Also, We have implemented this feature on Tencent cloud EMR product, please 
read the link: https://cloud.tencent.com/document/product/589/42955
   
   environments:
   
   - hadoop: 2.8.5
   
   - hive: 2.3.5
   
   - spark:  2.4.3
   
   - hudi: release-0.5.1-incubating
   
   The general steps for hudi in tencent object storage(cos) as follows:
   
   - step1: Upload config to cos
   
   ```
   hdfs dfs -mkdir -p cosn://[bucket]/hudi/config
   hdfs dfs -copyFromLocal demo/config/*  cosn://[bucket]/hudi/config/
   ```
   
   - Step 2: Incrementally ingest data from Kafka, and write to cos
   
   ```
   spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --master yarn 
./hudi-utilities-bundle_2.11-0.5.1-incubating.jar   --table-type COPY_ON_WRITE 
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource 
--source-ordering-field ts  --target-base-path 
cosn://[bucket]/usr/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props cosn://[bucket]/hudi/config/kafka-source.properties 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
   
   
   spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  --master yarn 
./hudi-utilities-bundle_2.11-0.5.1-incubating.jar  --table-type MERGE_ON_READ 
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource 
--source-ordering-field ts  --target-base-path 
cosn://[bucket]/usr/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props cosn://[bucket]/hudi/config/kafka-source.properties 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
--disable-compaction
   ```
   
   
   - Step3: Sync with Hive when data on cos
   ```
   bin/run_sync_tool.sh  --jdbc-url 
jdbc:hive2://[hiveserver2_ip:hiveserver2_port] --user hadoop --pass isd@cloud 
--partitioned-by dt --base-path 
cosn://[bucket]/usr/hive/warehouse/stock_ticks_cow --database default --table 
stock_ticks_cow
   
   bin/run_sync_tool.sh  --jdbc-url 
jdbc:hive2://[hiveserver2_ip:hiveserver2_port] --user hadoop --pass hive 
--partitioned-by dt --base-path 
cosn://[bucket]/usr/hive/warehouse/stock_ticks_mor --database default --table 
stock_ticks_mor --skip-ro-suffix
   
   ```
   
   - Step4: Query hudi table by hive or spark sql engine
   
   ```
   
   beeline -u jdbc:hive2://[hiveserver2_ip:hiveserver2_port] -n hadoop 
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat 
--hiveconf hive.stats.autogather=false
   
   spark-sql --master yarn --conf spark.sql.hive.convertMetastoreParquet=false
   
   hivesqls:
   select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 
'GOOG';
   select `_hoodie_commit_time`, symbol, ts, volume, open, close  from 
stock_ticks_cow where  symbol = 'GOOG';
   select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 
'GOOG';
   select `_hoodie_commit_time`, symbol, ts, volume, open, close  from 
stock_ticks_mor where  symbol = 'GOOG';
   select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol 
= 'GOOG';
   select `_hoodie_commit_time`, symbol, ts, volume, open, close  from 
stock_ticks_mor_rt where  symbol = 'GOOG';
   ```
   
   - Step5:  Run Compaction when data in cos
   
   ```
   cli/bin/hudi-cli.sh
   connect --path cosn://[bucket]/usr/hive/warehouse/stock_ticks_mor
   compactions show all
   compaction schedule
   compaction run --compactionInstant [requestid]  --parallelism 2 
--sparkMemory 1G  --schemaFilePath cosn://[bucket]/hudi/config/schema.avsc 
--retry 1
   ```
   
   
   
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git S

[jira] [Commented] (HUDI-1116) Support time travel using timestamp type

2020-07-21 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161883#comment-17161883
 ] 

Balaji Varadarajan commented on HUDI-1116:
--

One option is to provide a mapping utility which can help translate a timestamp 
to nearest commit time for the user to run query. 

 

A udf like implementation would be ideal but not sure if it is possible.

> Support time travel using timestamp type
> 
>
> Key: HUDI-1116
> URL: https://issues.apache.org/jira/browse/HUDI-1116
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Priority: Major
>
>  
> {{Currently, we use commit time to mimic time-travel queries. We need ability 
> to handle time-travel with a proper timestamp provided.}}
> {{}}
> {{For e:g: }}
> {{spark.read  .format(“hudi”).option(“timestampAsOf”, 
> “2019-01-01”).load(“/path/to/my/table”)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >