[GitHub] [hudi] wangxianghu commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492493739



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return taskContextSupplier;
+  }
+
+  public  List map(List data, Function func) {

Review comment:
   > when the use passes in a regular lambda, into `rdd.map()`, this is 
what it gets converted into
   
   The serializable issue can be solved by introducing a seriableFuncition to 
replace `java.util.function.Function`
   
   ```
   public interface SerializableFunction extends Serializable {
 O call(I v1) throws Exception;
   }
   ```
   `HoodieEngineContext` can be
   ```
   public abstract class HoodieEngineContext {
 public abstract   List map(List data, SerializableFunction func, int parallelism) ;
   }
   ```
   
   `HoodieSparkEngineContext` can be
   ```
   public class HoodieSparkEngineContext extends HoodieEngineContext {
 private static JavaSparkContext jsc;
   
 // tmp
 static {
   SparkConf conf = new SparkConf()
   .setMaster("local[4]")
   .set("spark.driver.host","localhost")
   .setAppName("HoodieSparkEngineContext");
   
   jsc = new JavaSparkContext(conf);
 }
 
 @Override
 public  List map(List data, SerializableFunction func, 
int parallelism) {
   return jsc.parallelize(data, parallelism).map(func::call).collect();
 }
   }
   ```
   this works :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Karl-WangSK commented on pull request #2096: [HUDI-1284] preCombine all HoodieRecords and update all fields(which is not DefaultValue) according to orderingVal

2020-09-21 Thread GitBox


Karl-WangSK commented on pull request #2096:
URL: https://github.com/apache/hudi/pull/2096#issuecomment-696089004


   @vinothchandar hi. Can you look at this pr when you are fre?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2098: [SUPPORT] File does not exisit(parquet) while reading Hudi Table from Spark

2020-09-21 Thread GitBox


n3nash commented on issue #2098:
URL: https://github.com/apache/hudi/issues/2098#issuecomment-696209149


   @RajasekarSribalan A FileNotFound error indicates that you are reading a 
version of the parquet file that has been deleted or no longer exists. This can 
happen due to the following reason : Say your job in writing to the Hudi table 
every 15 mins and you have chosen to keep only the latest version of the 
parquet file. Now, your snapshot job runs every 1 hr and takes around 1 hr to 
finish. What can happen is that the snapshot job ends up reading an older 
version of the parquet file while the new version is being created by the 
ingestion job and the cleaner deletes the older version.
   Since your job was running for many days, it seems like either a) The 
frequency of the ingestion job to Hudi or the snapshot job to Hive changed b) 
The snapshot job runs for longer period of time causing file not found c) It 
was just working by chance
   To fix this issue, please make sure you keep enough number of file versions 
so that a long running job (like the snapshot job) can find the file it started 
to read in the first place. 
   Please take a look at your configurations for the cleaner policy and then 
tune them using this config -> 
https://hudi.apache.org/docs/configurations.html#withCleanerPolicy



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] liujinhui1994 commented on a change in pull request #1968: [HUDI-1192] Make create hive database automatically configurable

2020-09-21 Thread GitBox


liujinhui1994 commented on a change in pull request #1968:
URL: https://github.com/apache/hudi/pull/1968#discussion_r492448849



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -290,6 +290,7 @@ object DataSourceWriteOptions {
   val HIVE_ASSUME_DATE_PARTITION_OPT_KEY = 
"hoodie.datasource.hive_sync.assume_date_partitioning"
   val HIVE_USE_PRE_APACHE_INPUT_FORMAT_OPT_KEY = 
"hoodie.datasource.hive_sync.use_pre_apache_input_format"
   val HIVE_USE_JDBC_OPT_KEY = "hoodie.datasource.hive_sync.use_jdbc"
+  val HIVE_AUTO_CREATE_DATABASE_OPT_KEY = 
"hoodie.datasource.hive_sync.auto.create.database"

Review comment:
   ok,thanks for the suggestion





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492468540



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return taskContextSupplier;
+  }
+
+  public  List map(List data, Function func) {

Review comment:
   > Is it possible to take a `java.util.function.Function` and then within 
`HoodieSparkEngineContext#map` wrap that into a 
`org.apache.spark.api.java.function.Function` ?
   
   let me try





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-21 Thread GitBox


prashantwason edited a comment on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


vinothchandar commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492474495



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return taskContextSupplier;
+  }
+
+  public  List map(List data, Function func) {

Review comment:
   when the use passes in a regular lambda, into `rdd.map()`, this is what 
it gets converted into





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


vinothchandar commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492474108



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return taskContextSupplier;
+  }
+
+  public  List map(List data, Function func) {

Review comment:
   In Spark, there is a functional interface defined like this
   
   ```
   package org.apache.spark.api.java.function;
   
   import java.io.Serializable;
   
   /**
* Base interface for functions whose return types do not create special 
RDDs. PairFunction and
* DoubleFunction are handled separately, to allow PairRDDs and DoubleRDDs 
to be constructed
* when mapping RDDs of other types.
*/
   @FunctionalInterface
   public interface Function extends Serializable {
 R call(T1 v1) throws Exception;
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-09-21 Thread GitBox


vinothchandar commented on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-696264459


   @bschell our spark install may be 2.11 on these images. As for 
hudi_spark_2.12 bundle, if we run integ-test with 2_12, I think it would happen 
automatically?
   @bvaradar should we switch to 2.12 and repush the images? 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2096: [HUDI-1284] preCombine all HoodieRecords and update all fields(which is not DefaultValue) according to orderingVal

2020-09-21 Thread GitBox


vinothchandar commented on pull request #2096:
URL: https://github.com/apache/hudi/pull/2096#issuecomment-696182257


   @Karl-WangSK Will do! 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] liujinhui1994 closed pull request #1984: [HUDI-1200] Fix NullPointerException, CustomKeyGenerator does not work

2020-09-21 Thread GitBox


liujinhui1994 closed pull request #1984:
URL: https://github.com/apache/hudi/pull/1984


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-21 Thread GitBox


vinothchandar commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r492405639



##
File path: 
hudi-client/src/main/java/org/apache/hudi/metadata/HoodieMetadata.java
##
@@ -0,0 +1,272 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.avro.model.HoodieRestoreMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieMetadataException;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+/**
+ * Defines the interface through which file listing metadata is created, 
updated and accessed. This metadata is
+ * saved within an internal Metadata Table.
+ *
+ * If file listing metadata is disabled, the functions default to using 
listStatus(...) RPC calls to retrieve
+ * file listings from the file system.
+ */
+public class HoodieMetadata {
+  private static final Logger LOG = LogManager.getLogger(HoodieMetadata.class);
+
+  // Instances of metadata for each basePath
+  private static Map instances = new HashMap<>();

Review comment:
   I don't know if sharing this object leads to any real perf gains. for 
the use-case you mention, those are two different hudi tables anyway right.  
   
   yes, we need to pass this around the layers. best way is to stick this into 
`HoodieTable` just like it has a member variable for `index`. Then you can 
easily access this across layers. This also has to be made `Serializable`. In 
general, global static objects, when clearly there is a scope for separating 
the instances is kind of an anti pattern in Java land. Its okay if you want to 
defer this to me. Just saying that we will probably have to change this code 
eventually for these reasons. 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/metadata/HoodieMetadataImpl.java
##
@@ -0,0 +1,1064 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.

[GitHub] [hudi] vishalpathak1986 commented on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


vishalpathak1986 commented on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696242666







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha edited a comment on pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-21 Thread GitBox


satishkotha edited a comment on pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#issuecomment-696327041


   > @satishkotha : Please ping me in the PR when you have updates and I can 
give incremental comments if needed.
   
   @bvaradar Incremental FileSystem resotre is the only big pending item. I'll 
get to it in later part of this week.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-09-21 Thread GitBox


bvaradar commented on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-696339666


   @bschell : For running integration tests with hudi packages built with scala 
2.12, we just need to change scripts/run_travis_tests.sh. The docker container 
should automatically load those jars for running integration tests.
   
   `diff --git a/scripts/run_travis_tests.sh b/scripts/run_travis_tests.sh
   index 63fb959c..b77b4f64 100755
   --- a/scripts/run_travis_tests.sh
   +++ b/scripts/run_travis_tests.sh
   @@ -35,7 +35,7 @@ elif [ "$mode" = "integration" ]; then
  export SPARK_HOME=$PWD/spark-${sparkVersion}-bin-hadoop${hadoopVersion}
  mkdir /tmp/spark-events/
  echo "Running Integration Tests"
   -  mvn verify -Pintegration-tests -B
   +  mvn verify -Pintegration-tests -Dscala-2.12 -B
else
  echo "Unknown mode $mode"
  exit 1
   `



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


vinothchandar commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492460719



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return taskContextSupplier;
+  }
+
+  public  List map(List data, Function func) {

Review comment:
   Is it possible to take a `java.util.function.Function` and then within 
`HoodieSparkEngineContext#map` wrap that into a 
`org.apache.spark.api.java.function.Function` ? 
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-21 Thread GitBox


vinothchandar commented on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-696429882







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-21 Thread GitBox


satishkotha commented on a change in pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#discussion_r492303122



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/IncrementalTimelineSyncFileSystemView.java
##
@@ -251,6 +262,28 @@ private void addRollbackInstant(HoodieTimeline timeline, 
HoodieInstant instant)
 LOG.info("Done Syncing rollback instant (" + instant + ")");
   }
 
+  /**
+   * Add newly found REPLACE instant.
+   *
+   * @param timeline Hoodie Timeline
+   * @param instant REPLACE Instant
+   */
+  private void addReplaceInstant(HoodieTimeline timeline, HoodieInstant 
instant) throws IOException {

Review comment:
   I need to understand this flow a bit more. But, have a question on why 
we need to track commit-action-type and timestamp. Today, 
HoodieRollbackMetadata tracks successFiles, deletedFiles etc.  Do you think we 
can add replacedFileIds also there? This will be empty for regular commits. But 
for replace commits, it will have some content.  If this value is present, we 
can remove corresponding fileIds from View#replacedFileGroups. Let me know if 
i'm missing anything with this approach.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vishalpathak1986 edited a comment on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


vishalpathak1986 edited a comment on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696242666


   @n3nash Thanks for your comment. Can you also please elaborate on how an 
index will help this? Also, please let me know if you think it is possible to 
turn off writing inserts to parquet as a default using an option, just as we 
have option for small files (hoodie.parquet.small.file.limit)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


n3nash commented on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696212548







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #2012: [HUDI-1129] Deltastreamer Add support for schema evolution

2020-09-21 Thread GitBox


bvaradar commented on a change in pull request #2012:
URL: https://github.com/apache/hudi/pull/2012#discussion_r492198773



##
File path: hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala
##
@@ -364,4 +366,40 @@ object AvroConversionHelper {
 }
 }
   }
+
+  /**
+   * Remove namespace from fixed field.
+   * org.apache.spark.sql.avro.SchemaConverters.toAvroType method adds 
namespace to fixed avro field
+   * 
https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L177
+   * So, we need to remove that namespace so that reader schema without 
namespace do not throw erorr like this one
+   * org.apache.avro.AvroTypeException: Found 
hoodie.source.hoodie_source.height.fixed, expecting fixed
+   *
+   * @param schema Schema from which namespace needs to be removed for fixed 
fields
+   * @return input schema with namespace removed for fixed fields, if any
+   */
+  def removeNamespaceFromFixedFields(schema: Schema): Schema  ={

Review comment:
   @sathyaprakashg : What happens to existing records in hudi dataset which 
have namespace for fixed fields ? 
   
https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/MergeHelper.java#L70





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu edited a comment on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-696120725


   @vinothchandar I have refactored `org.apache.hudi.table.SparkMarkerFiles` 
with `parallelDo ` function, it works ok in my 
local(`org.apache.hudi.table.TestMarkerFiles` passed),  and the Ci has passed 
in my repository by now.
   
   do you have other concerns about the `parallelDo`? can I start to refactor 
this pr with `parallelDo ` function now?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #2099: [HUDI-1268] fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss

2020-09-21 Thread GitBox


leesf commented on pull request #2099:
URL: https://github.com/apache/hudi/pull/2099#issuecomment-696461170


   @nsivabalan would you please review this PR?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-695785338







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar merged pull request #2097: [MINOR] Add description to remind users that Hudis docker images have mounted the projects workspace

2020-09-21 Thread GitBox


bvaradar merged pull request #2097:
URL: https://github.com/apache/hudi/pull/2097


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492434405



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return taskContextSupplier;
+  }
+
+  public  List map(List data, Function func) {

Review comment:
   > I think we should leave this abstract and let the engines implement 
this? even for Java. Its better to have a `HoodieJavaEngineContext`. From what 
I can see, this is not overridden in `HoodieSparkEngineContext` and thus we 
lose the parallel execution that we currently have with Spark with this change.
   
   as we discussed before, parallelDo model need a function as input parameter, 
Unfortunately, different engines need different type function, its hard to 
align them in an abstract parallelDo method. so we agree to use ` 
java.util.function.Function` as the unified input function. in this way, there 
is no need to distinguish spark and flink, and the parallelism is not needed 
too.

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return t

[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-21 Thread GitBox


prashantwason edited a comment on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968


   Remaining work items:
   
   - [ ] 1. Support for rollbacks in MOR Table
   - [ ] 2. Rollback of metadata if commit eventually fails on dataset 
   - [x] 3. HUDI-CLI extensions for metadata debugging
   - [ ] 4. Ensure partial rollbacks do not use metadata table as it does not 
contain partial info 
   - [ ] 5. Fix initialization when Async jobs are scheduled - these jobs have 
older timestamp than INIT timestamp on metadata table
   - [ ] 6. Check if MergedBlockReader will neglect log blocks based on 
uncommitted commits.
   - [ ] 7. Unit test for rollback of partial commits
   - [ ] 8. Schema evolution strategy for metadata table
   - [ ] 9. Unit test for marker based rollback
   - [ ] 10. Can all compaction strategies work off of metadata table itself? 
Does it have all the data
   - [ ] 11. Async Clean and Async Compaction - how will they work with 
metadata table updates - check multi writer
   - [ ] 12. Query-side use of metadata table
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-696181800


   @wangxianghu let me check that out and circle back by your morning time/EOD 
PST. we can go from there. Thanks! 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


vinothchandar commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492262964



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/HoodieSparkEngineContext.java
##
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SQLContext;
+
+/**
+ * A Spark engine implementation of HoodieEngineContext.
+ */
+public class HoodieSparkEngineContext extends HoodieEngineContext {

Review comment:
   can we implement versiosn of `map`, `flatMap`, `forEach` here which use 
`javaSparkContext.parallelize()` ? It would be good to keep this PR free of any 
changes in terms of whether we are executing the deletes/lists in parallel or 
in serial. 

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/SparkMarkerFiles.java
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.hudi.bifunction.wrapper.ThrowingFunction;
+import org.apache.hudi.common.HoodieEngineContext;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.bifunction.wrapper.BiFunctionWrapper.throwingConsumerWrapper;
+import static 
org.apache.hudi.bifunction.wrapper.BiFunctionWrapper.throwingFlatMapWrapper;
+
+public class SparkMarkerFiles extends BaseMarkerFiles {
+
+  private static final Logger LOG = 
LogManager.getLogger(SparkMarkerFiles.class);
+
+  public SparkMarkerFiles(HoodieTable table, String instantTime) {
+super(table, instantTime);
+  }
+
+  public SparkMarkerFiles(FileSystem fs, String basePath, String 
markerFolderPath, String instantTime) {
+super(fs, basePath, markerFolderPath, instantTime);
+  }
+
+  @Override
+  public boolean deleteMarkerDir(HoodieEngineContext context, int parallelism) 
{
+try {
+  if (fs.exists(markerDirPath)) {
+FileStatus[] fileStatuses = fs.listStatus(markerDirPath);
+List markerDirSubPaths = Arrays.stream(fileStatuses)
+.map(fileStatus -> fileStatus.getPath().toString())
+.collect(Collectors.toList());
+
+if (markerDirSubPaths.size() > 0) {
+  SerializableConfiguration conf = new 
SerializableConfiguration(fs.getConf());
+  context.foreach(markerDirSubPaths, 
throwingConsumerWrapper(subPathStr -> {
+Path subPath = new Path(subPathStr);
+FileSystem fileSystem = subPath.getFileSystem(conf.get());
+fileSystem.delete(subPath, true

[GitHub] [hudi] bvaradar commented on pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-21 Thread GitBox


bvaradar commented on pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#issuecomment-696208094


   @satishkotha : Please ping me in the PR  when you have updates and I can 
give incremental comments if needed.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vishalpathak1986 closed issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


vishalpathak1986 closed issue #2095:
URL: https://github.com/apache/hudi/issues/2095


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-21 Thread GitBox


satishkotha commented on pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#issuecomment-696327041


   > @satishkotha : Please ping me in the PR when you have updates and I can 
give incremental comments if needed.
   
   @bvaradar IncrementalTimeline resotre is the only big pending item. I'll get 
to it in later part of this week.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar edited a comment on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-09-21 Thread GitBox


bvaradar edited a comment on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-696339666


   @bschell : For running integration tests with hudi packages built with scala 
2.12, we just need to change scripts/run_travis_tests.sh. The docker container 
should automatically load those jars for running integration tests.
   
   ```diff --git a/scripts/run_travis_tests.sh b/scripts/run_travis_tests.sh
   index 63fb959c..b77b4f64 100755
   --- a/scripts/run_travis_tests.sh
   +++ b/scripts/run_travis_tests.sh
   @@ -35,7 +35,7 @@ elif [ "$mode" = "integration" ]; then
  export SPARK_HOME=$PWD/spark-${sparkVersion}-bin-hadoop${hadoopVersion}
  mkdir /tmp/spark-events/
  echo "Running Integration Tests"
   -  mvn verify -Pintegration-tests -B
   +  mvn verify -Pintegration-tests -Dscala-2.12 -B
else
  echo "Unknown mode $mode"
  exit 1
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #2097: [MINOR] Add description to remind users that Hudis docker images have mounted the projects workspace

2020-09-21 Thread GitBox


bvaradar commented on a change in pull request #2097:
URL: https://github.com/apache/hudi/pull/2097#discussion_r492303063



##
File path: docs/_docs/0_4_docker_demo.cn.md
##
@@ -1106,6 +1106,9 @@ and compose scripts are carefully implemented so that 
they serve dual-purpose
 2. For running integration-tests, we need the jars generated locally to be 
used for running services within docker. The
docker-compose scripts (see 
`docker/compose/docker-compose_hadoop284_hive233_spark231.yml`) ensures local 
jars override
inbuilt jars by mounting local HUDI workspace over the docker location
+3. Due to the docker images have mounted local HUDI workspace, so any changes 
happen in the workspace would reflect into the 

Review comment:
   Reword To :  "As these docker containers have mounted local HUDI 
workspace, any changes that happen in the workspace would automatically reflect 
in the  containers. This is a convenient way for developing and verifying Hudi 
for 
developers who do not own a distributed environment. Note that this is how 
integration tests are run."





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #1968: [HUDI-1192] Make create hive database automatically configurable

2020-09-21 Thread GitBox


yanghua commented on a change in pull request #1968:
URL: https://github.com/apache/hudi/pull/1968#discussion_r491761859



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -290,6 +290,7 @@ object DataSourceWriteOptions {
   val HIVE_ASSUME_DATE_PARTITION_OPT_KEY = 
"hoodie.datasource.hive_sync.assume_date_partitioning"
   val HIVE_USE_PRE_APACHE_INPUT_FORMAT_OPT_KEY = 
"hoodie.datasource.hive_sync.use_pre_apache_input_format"
   val HIVE_USE_JDBC_OPT_KEY = "hoodie.datasource.hive_sync.use_jdbc"
+  val HIVE_AUTO_CREATE_DATABASE_OPT_KEY = 
"hoodie.datasource.hive_sync.auto.create.database"

Review comment:
   Following the existing style, it would be better 
`hoodie.datasource.hive_sync.auto_create_database`. WDYT?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hddong commented on pull request #1242: [HUDI-544] Archived commits command code cleanup

2020-09-21 Thread GitBox


hddong commented on pull request #1242:
URL: https://github.com/apache/hudi/pull/1242#issuecomment-695916250


   @n3nash: Had rebase this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1291) integration of replace with consolidated metadata

2020-09-21 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1291:
-
Status: Open  (was: New)

> integration of replace with consolidated metadata
> -
>
> Key: HUDI-1291
> URL: https://issues.apache.org/jira/browse/HUDI-1291
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.7.0
>
>
> Consolidated metadata supports two operations:
> 1) add files
> 2) remove files
> If file group is replaced, we don't want to immediately remove files from 
> metadata (to support restore/rollback). So we may have to change consolidated 
> metadata to track additional information for replaced fileIds
> Today, Replaced FileIds are tracked as part of .replacecommit file. We can 
> continue this for short term, but may be more efficient to combine with 
> consolidated metadata



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


vinothchandar commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492449543



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return taskContextSupplier;
+  }
+
+  public  List map(List data, Function func) {

Review comment:
   @wangxianghu functionality wise, you are correct. it can be implemented 
just using Java. but, we do parallelization of different pieces of code e.g 
deletion of files in parallel using spark for a reason. It significantly speeds 
these up, for large tables. 
   
   All I am saying is to implement the `HoodieSparkEngineContext#map` like below
   
   ```
public  List map(List data, Function func, int 
parallelism) {
   return javaSparkContext.parallelize(data, 
parallelism).map(func).collect();
}
   ```
   
   similarly for the other two methods. I don't see any issues with this. do 
you?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] liujinhui1994 commented on a change in pull request #1968: [HUDI-1192] Make create hive database automatically configurable

2020-09-21 Thread GitBox


liujinhui1994 commented on a change in pull request #1968:
URL: https://github.com/apache/hudi/pull/1968#discussion_r492448849



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -290,6 +290,7 @@ object DataSourceWriteOptions {
   val HIVE_ASSUME_DATE_PARTITION_OPT_KEY = 
"hoodie.datasource.hive_sync.assume_date_partitioning"
   val HIVE_USE_PRE_APACHE_INPUT_FORMAT_OPT_KEY = 
"hoodie.datasource.hive_sync.use_pre_apache_input_format"
   val HIVE_USE_JDBC_OPT_KEY = "hoodie.datasource.hive_sync.use_jdbc"
+  val HIVE_AUTO_CREATE_DATABASE_OPT_KEY = 
"hoodie.datasource.hive_sync.auto.create.database"

Review comment:
   ok,thanks for the suggestion





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: Travis CI build asf-site

2020-09-21 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 6cc1c37  Travis CI build asf-site
6cc1c37 is described below

commit 6cc1c3711ed8959e186a57520ec348299af3e8df
Author: CI 
AuthorDate: Tue Sep 22 02:14:28 2020 +

Travis CI build asf-site
---
 content/cn/docs/docker_demo.html | 3 +++
 content/docs/docker_demo.html| 3 +++
 2 files changed, 6 insertions(+)

diff --git a/content/cn/docs/docker_demo.html b/content/cn/docs/docker_demo.html
index 19902fc..05f3395 100644
--- a/content/cn/docs/docker_demo.html
+++ b/content/cn/docs/docker_demo.html
@@ -1467,6 +1467,9 @@ and compose scripts are carefully implemented so that 
they serve dual-purposeFor running integration-tests, we need the jars generated locally to be 
used for running services within docker. The
 docker-compose scripts (see docker/compose/docker-compose_hadoop284_hive233_spark231.yml)
 ensures local jars override
 inbuilt jars by mounting local HUDI workspace over the docker location
+  As these docker containers have mounted local HUDI workspace, any 
changes that happen in the workspace would automatically 
+reflect in the containers. This is a convenient way for developing and 
verifying Hudi for
+developers who do not own a distributed environment. Note that this is how 
integration tests are run.
 
 
 This helps avoid maintaining separate docker images and avoids the costly 
step of building HUDI docker images locally.
diff --git a/content/docs/docker_demo.html b/content/docs/docker_demo.html
index 0d36fae..11c955c 100644
--- a/content/docs/docker_demo.html
+++ b/content/docs/docker_demo.html
@@ -1565,6 +1565,9 @@ and compose scripts are carefully implemented so that 
they serve dual-purposeFor running integration-tests, we need the jars generated locally to be 
used for running services within docker. The
 docker-compose scripts (see docker/compose/docker-compose_hadoop284_hive233_spark231.yml)
 ensures local jars override
 inbuilt jars by mounting local HUDI workspace over the docker location
+  As these docker containers have mounted local HUDI workspace, any 
changes that happen in the workspace would automatically 
+reflect in the containers. This is a convenient way for developing and 
verifying Hudi for
+developers who do not own a distributed environment. Note that this is how 
integration tests are run.
 
 
 This helps avoid maintaining separate docker images and avoids the costly 
step of building HUDI docker images locally.



[GitHub] [hudi] wangxianghu commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492435531



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/SparkMarkerFiles.java
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.hudi.bifunction.wrapper.ThrowingFunction;
+import org.apache.hudi.common.HoodieEngineContext;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.bifunction.wrapper.BiFunctionWrapper.throwingConsumerWrapper;
+import static 
org.apache.hudi.bifunction.wrapper.BiFunctionWrapper.throwingFlatMapWrapper;
+
+public class SparkMarkerFiles extends BaseMarkerFiles {

Review comment:
   > Given this file is now free of Spark, we dont have the need of 
breaking these into base and child classes right.
   
   Yes, this is an example to show you the bi function, if you agree with this 
implementation, I'll rollback them in one class





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492434405



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return taskContextSupplier;
+  }
+
+  public  List map(List data, Function func) {

Review comment:
   > I think we should leave this abstract and let the engines implement 
this? even for Java. Its better to have a `HoodieJavaEngineContext`. From what 
I can see, this is not overridden in `HoodieSparkEngineContext` and thus we 
lose the parallel execution that we currently have with Spark with this change.
   
   as we discussed before, parallelDo model need a function as input parameter, 
Unfortunately, different engines need different type function, its hard to 
align them in an abstract parallelDo method. so we agreed to use ` 
java.util.function.Function` as the unified input function. in this way, there 
is no need to distinguish spark and flink, no need to make it abstract and the 
parallelism is not needed too. its just java, can be implemented directly.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492434405



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return taskContextSupplier;
+  }
+
+  public  List map(List data, Function func) {

Review comment:
   > I think we should leave this abstract and let the engines implement 
this? even for Java. Its better to have a `HoodieJavaEngineContext`. From what 
I can see, this is not overridden in `HoodieSparkEngineContext` and thus we 
lose the parallel execution that we currently have with Spark with this change.
   
   as we discussed before, parallelDo model need a function as input parameter, 
Unfortunately, different engines need different type function, its hard to 
align them in an abstract parallelDo method. so we agreed to use ` 
java.util.function.Function` as the unified input function. in this way, there 
is no need to distinguish spark and flink, and the parallelism is not needed 
too.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492434405



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;
+
+  private TaskContextSupplier taskContextSupplier;
+
+  public HoodieEngineContext(SerializableConfiguration hadoopConf, 
TaskContextSupplier taskContextSupplier) {
+this.hadoopConf = hadoopConf;
+this.taskContextSupplier = taskContextSupplier;
+  }
+
+  public SerializableConfiguration getHadoopConf() {
+return hadoopConf;
+  }
+
+  public TaskContextSupplier getTaskContextSupplier() {
+return taskContextSupplier;
+  }
+
+  public  List map(List data, Function func) {

Review comment:
   > I think we should leave this abstract and let the engines implement 
this? even for Java. Its better to have a `HoodieJavaEngineContext`. From what 
I can see, this is not overridden in `HoodieSparkEngineContext` and thus we 
lose the parallel execution that we currently have with Spark with this change.
   
   as we discussed before, parallelDo model need a function as input parameter, 
Unfortunately, different engines need different type function, its hard to 
align them in an abstract parallelDo method. so we agree to use ` 
java.util.function.Function` as the unified input function. in this way, there 
is no need to distinguish spark and flink, and the parallelism is not needed 
too.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-696467101


   > My primary motive of suggesting parallelDo model, is to avoid splitting 
the classes and still reap benefits of parallel execution, provided by each 
engine. I don't think we are realizing them, as this stage yet. Please let me 
know your thoughts.
   
   @vinothchandar The `org.apache.hudi.table.SparkMarkerFiles` is used in many 
places and the refactor work is huge if rollback them in one class, so I 
refactored the function using bi function first(without packaging them in one 
class)  just to show you the functional changes, thinking if you agree with 
this refactor then I can rollback them without splitting classes.
   It is just an example.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: [MINOR] Add description to remind users that Hudis docker images have mounted the projects workspace (#2097)

2020-09-21 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new e786ae4  [MINOR] Add description to remind users that Hudis docker 
images have mounted the projects workspace (#2097)
e786ae4 is described below

commit e786ae4ef0666cc96d2f9ae44e60459ed5ac243c
Author: vinoyang 
AuthorDate: Tue Sep 22 09:29:01 2020 +0800

[MINOR] Add description to remind users that Hudis docker images have 
mounted the projects workspace (#2097)
---
 docs/_docs/0_4_docker_demo.cn.md | 3 +++
 docs/_docs/0_4_docker_demo.md| 3 +++
 2 files changed, 6 insertions(+)

diff --git a/docs/_docs/0_4_docker_demo.cn.md b/docs/_docs/0_4_docker_demo.cn.md
index e313f52..d0005b2 100644
--- a/docs/_docs/0_4_docker_demo.cn.md
+++ b/docs/_docs/0_4_docker_demo.cn.md
@@ -1106,6 +1106,9 @@ and compose scripts are carefully implemented so that 
they serve dual-purpose
 2. For running integration-tests, we need the jars generated locally to be 
used for running services within docker. The
docker-compose scripts (see 
`docker/compose/docker-compose_hadoop284_hive233_spark231.yml`) ensures local 
jars override
inbuilt jars by mounting local HUDI workspace over the docker location
+3. As these docker containers have mounted local HUDI workspace, any changes 
that happen in the workspace would automatically 
+   reflect in the containers. This is a convenient way for developing and 
verifying Hudi for
+   developers who do not own a distributed environment. Note that this is how 
integration tests are run.
 
 This helps avoid maintaining separate docker images and avoids the costly step 
of building HUDI docker images locally.
 But if users want to test hudi from locations with lower network bandwidth, 
they can still build local images
diff --git a/docs/_docs/0_4_docker_demo.md b/docs/_docs/0_4_docker_demo.md
index 22efbe9..6c108eb 100644
--- a/docs/_docs/0_4_docker_demo.md
+++ b/docs/_docs/0_4_docker_demo.md
@@ -1186,6 +1186,9 @@ and compose scripts are carefully implemented so that 
they serve dual-purpose
 2. For running integration-tests, we need the jars generated locally to be 
used for running services within docker. The
docker-compose scripts (see 
`docker/compose/docker-compose_hadoop284_hive233_spark231.yml`) ensures local 
jars override
inbuilt jars by mounting local HUDI workspace over the docker location
+3. As these docker containers have mounted local HUDI workspace, any changes 
that happen in the workspace would automatically 
+   reflect in the containers. This is a convenient way for developing and 
verifying Hudi for
+   developers who do not own a distributed environment. Note that this is how 
integration tests are run.
 
 This helps avoid maintaining separate docker images and avoids the costly step 
of building HUDI docker images locally.
 But if users want to test hudi from locations with lower network bandwidth, 
they can still build local images



[GitHub] [hudi] bvaradar merged pull request #2097: [MINOR] Add description to remind users that Hudis docker images have mounted the projects workspace

2020-09-21 Thread GitBox


bvaradar merged pull request #2097:
URL: https://github.com/apache/hudi/pull/2097


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #2099: [HUDI-1268] fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss

2020-09-21 Thread GitBox


leesf commented on pull request #2099:
URL: https://github.com/apache/hudi/pull/2099#issuecomment-696461170


   @nsivabalan would you please review this PR?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1291) integration of replace with consolidated metadata

2020-09-21 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1291:
-
Fix Version/s: 0.7.0

> integration of replace with consolidated metadata
> -
>
> Key: HUDI-1291
> URL: https://issues.apache.org/jira/browse/HUDI-1291
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.7.0
>
>
> Consolidated metadata supports two operations:
> 1) add files
> 2) remove files
> If file group is replaced, we don't want to immediately remove files from 
> metadata (to support restore/rollback). So we may have to change consolidated 
> metadata to track additional information for replaced fileIds
> Today, Replaced FileIds are tracked as part of .replacecommit file. We can 
> continue this for short term, but may be more efficient to combine with 
> consolidated metadata



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1291) integration of replace with consolidated metadata

2020-09-21 Thread satish (Jira)
satish created HUDI-1291:


 Summary: integration of replace with consolidated metadata
 Key: HUDI-1291
 URL: https://issues.apache.org/jira/browse/HUDI-1291
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: satish
Assignee: satish


Consolidated metadata supports two operations:
1) add files
2) remove files

If file group is replaced, we don't want to immediately remove files from 
metadata (to support restore/rollback). So we may have to change consolidated 
metadata to track additional information for replaced fileIds

Today, Replaced FileIds are tracked as part of .replacecommit file. We can 
continue this for short term, but may be more efficient to combine with 
consolidated metadata



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-21 Thread GitBox


vinothchandar commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r492406923



##
File path: 
hudi-client/src/main/java/org/apache/hudi/metadata/HoodieMetadataImpl.java
##
@@ -0,0 +1,1064 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.avro.model.HoodieRestoreMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.ClientUtils;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.fs.ConsistencyGuardConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.table.view.TableFileSystemView.SliceView;
+import org.apache.hudi.common.util.CleanerUtils;
+import org.apache.hudi.common.util.FileIOUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.SpillableMapUtils;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.config.HoodieMetricsConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieMetadataException;
+import org.apache.hudi.io.storage.HoodieFileReader;
+import org.apache.hudi.io.storage.HoodieFileReaderFactory;
+import org.apache.hudi.metrics.HoodieMetrics;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import com.codahale.metrics.Timer;
+
+import scala.Tuple2;
+
+/**
+ * Metadata implementation which saves partition and file listing within an 
internal MOR table
+ * called Metadata Table. This table is created by listing files and 
partitions (first time) and kept in sync
+ * using the instants on the main dataset.
+ */
+public class HoodieMetadataImpl {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataImpl.class);
+
+  // Table name suffix
+  private static final String METADATA_TABLE_NAME_SUFFIX = "_metadata";
+  // Timestamp for a commit when the base dataset had not had any commits yet.
+  private static final String SOLO_COMMIT_TIMESTAMP = "00";
+
+  //

[GitHub] [hudi] vinothchandar commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-21 Thread GitBox


vinothchandar commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r492405639



##
File path: 
hudi-client/src/main/java/org/apache/hudi/metadata/HoodieMetadata.java
##
@@ -0,0 +1,272 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.avro.model.HoodieRestoreMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieMetadataException;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+/**
+ * Defines the interface through which file listing metadata is created, 
updated and accessed. This metadata is
+ * saved within an internal Metadata Table.
+ *
+ * If file listing metadata is disabled, the functions default to using 
listStatus(...) RPC calls to retrieve
+ * file listings from the file system.
+ */
+public class HoodieMetadata {
+  private static final Logger LOG = LogManager.getLogger(HoodieMetadata.class);
+
+  // Instances of metadata for each basePath
+  private static Map instances = new HashMap<>();

Review comment:
   I don't know if sharing this object leads to any real perf gains. for 
the use-case you mention, those are two different hudi tables anyway right.  
   
   yes, we need to pass this around the layers. best way is to stick this into 
`HoodieTable` just like it has a member variable for `index`. Then you can 
easily access this across layers. This also has to be made `Serializable`. In 
general, global static objects, when clearly there is a scope for separating 
the instances is kind of an anti pattern in Java land. Its okay if you want to 
defer this to me. Just saying that we will probably have to change this code 
eventually for these reasons. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-21 Thread GitBox


vinothchandar commented on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-696430277


   Food for thought: we should also think about how we are going to add new 
metadata partitions in the background, as writers/cleaner/compactors keep 
running. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-21 Thread GitBox


vinothchandar commented on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-696429882


   cc @bvaradar @n3nash as well
   
   @prashantwason Here is a corner case with syncing completed compaction from 
data timeline to metadata timeline. Consider the following sequence of events 
   
   t0: writer schedules compaction at time instant `c`
   t1: Compactor starts processing `c`'s plan
   t2: compaction finishes with `c.commit` published on the data timeline (not 
yet synced to metadata timeline) 
   t3: Next round of writing, writer opens metadata table, which adds the base 
file produced in c.commit to metadata table.
   
   Any queries running between t2 and t3, cannot rely on metadata since the new 
base file will not be present in metadata table. The timeline will indicate 
that the compaction completed, and the latest file slice will be computed  as 
simply the logs written to the file groups since compaction. This will lead to 
incorrect results. 
   
   If we consider just writer alone, we may be okay since we first sync the 
metadata table before we do anything for the delta commit at t3. But in general 
for queries, we should advise enabling metadata table based listings only, 
after all writers/cleaner/compactor have been enabled to use metadata and been 
successfully using it to publish new/deleted files directly to the metadata 
table. In short, queries cannot rely on metadata table, with the syncing 
mechanism as the main thing that keeps data and metadata timelines together. 
   
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-09-21 Thread GitBox


bvaradar commented on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-696339666


   @bschell : For running integration tests with hudi packages built with scala 
2.12, we just need to change scripts/run_travis_tests.sh. The docker container 
should automatically load those jars for running integration tests.
   
   `diff --git a/scripts/run_travis_tests.sh b/scripts/run_travis_tests.sh
   index 63fb959c..b77b4f64 100755
   --- a/scripts/run_travis_tests.sh
   +++ b/scripts/run_travis_tests.sh
   @@ -35,7 +35,7 @@ elif [ "$mode" = "integration" ]; then
  export SPARK_HOME=$PWD/spark-${sparkVersion}-bin-hadoop${hadoopVersion}
  mkdir /tmp/spark-events/
  echo "Running Integration Tests"
   -  mvn verify -Pintegration-tests -B
   +  mvn verify -Pintegration-tests -Dscala-2.12 -B
else
  echo "Unknown mode $mode"
  exit 1
   `



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar edited a comment on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-09-21 Thread GitBox


bvaradar edited a comment on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-696339666


   @bschell : For running integration tests with hudi packages built with scala 
2.12, we just need to change scripts/run_travis_tests.sh. The docker container 
should automatically load those jars for running integration tests.
   
   ```diff --git a/scripts/run_travis_tests.sh b/scripts/run_travis_tests.sh
   index 63fb959c..b77b4f64 100755
   --- a/scripts/run_travis_tests.sh
   +++ b/scripts/run_travis_tests.sh
   @@ -35,7 +35,7 @@ elif [ "$mode" = "integration" ]; then
  export SPARK_HOME=$PWD/spark-${sparkVersion}-bin-hadoop${hadoopVersion}
  mkdir /tmp/spark-events/
  echo "Running Integration Tests"
   -  mvn verify -Pintegration-tests -B
   +  mvn verify -Pintegration-tests -Dscala-2.12 -B
else
  echo "Unknown mode $mode"
  exit 1
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #2097: [MINOR] Add description to remind users that Hudis docker images have mounted the projects workspace

2020-09-21 Thread GitBox


bvaradar commented on a change in pull request #2097:
URL: https://github.com/apache/hudi/pull/2097#discussion_r492303063



##
File path: docs/_docs/0_4_docker_demo.cn.md
##
@@ -1106,6 +1106,9 @@ and compose scripts are carefully implemented so that 
they serve dual-purpose
 2. For running integration-tests, we need the jars generated locally to be 
used for running services within docker. The
docker-compose scripts (see 
`docker/compose/docker-compose_hadoop284_hive233_spark231.yml`) ensures local 
jars override
inbuilt jars by mounting local HUDI workspace over the docker location
+3. Due to the docker images have mounted local HUDI workspace, so any changes 
happen in the workspace would reflect into the 

Review comment:
   Reword To :  "As these docker containers have mounted local HUDI 
workspace, any changes that happen in the workspace would automatically reflect 
in the  containers. This is a convenient way for developing and verifying Hudi 
for 
developers who do not own a distributed environment. Note that this is how 
integration tests are run."





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-21 Thread GitBox


satishkotha commented on a change in pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#discussion_r492303122



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/IncrementalTimelineSyncFileSystemView.java
##
@@ -251,6 +262,28 @@ private void addRollbackInstant(HoodieTimeline timeline, 
HoodieInstant instant)
 LOG.info("Done Syncing rollback instant (" + instant + ")");
   }
 
+  /**
+   * Add newly found REPLACE instant.
+   *
+   * @param timeline Hoodie Timeline
+   * @param instant REPLACE Instant
+   */
+  private void addReplaceInstant(HoodieTimeline timeline, HoodieInstant 
instant) throws IOException {

Review comment:
   I need to understand this flow a bit more. But, have a question on why 
we need to track commit-action-type and timestamp. Today, 
HoodieRollbackMetadata tracks successFiles, deletedFiles etc.  Do you think we 
can add replacedFileIds also there? This will be empty for regular commits. But 
for replace commits, it will have some content.  If this value is present, we 
can remove corresponding fileIds from View#replacedFileGroups. Let me know if 
i'm missing anything with this approach.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha edited a comment on pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-21 Thread GitBox


satishkotha edited a comment on pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#issuecomment-696327041


   > @satishkotha : Please ping me in the PR when you have updates and I can 
give incremental comments if needed.
   
   @bvaradar Incremental FileSystem resotre is the only big pending item. I'll 
get to it in later part of this week.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-21 Thread GitBox


satishkotha commented on pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#issuecomment-696327041


   > @satishkotha : Please ping me in the PR when you have updates and I can 
give incremental comments if needed.
   
   @bvaradar IncrementalTimeline resotre is the only big pending item. I'll get 
to it in later part of this week.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-21 Thread GitBox


prashantwason edited a comment on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968


   Remaining work items:
   
   - [ ] 1. Support for rollbacks in MOR Table
   - [ ] 2. Rollback of metadata if commit eventually fails on dataset 
   - [x] 3. HUDI-CLI extensions for metadata debugging
   - [ ] 4. Ensure partial rollbacks do not use metadata table as it does not 
contain partial info 
   - [ ] 5. Fix initialization when Async jobs are scheduled - these jobs have 
older timestamp than INIT timestamp on metadata table
   - [ ] 6. Check if MergedBlockReader will neglect log blocks based on 
uncommitted commits.
   - [ ] 7. Unit test for rollback of partial commits
   - [ ] 8. Schema evolution strategy for metadata table
   - [ ] 9. Unit test for marker based rollback
   - [ ] 10. Can all compaction strategies work off of metadata table itself? 
Does it have all the data
   - [ ] 11. Async Clean and Async Compaction - how will they work with 
metadata table updates - check multi writer
   - [ ] 12. Query-side use of metadata table
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


vinothchandar commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r492262964



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/HoodieSparkEngineContext.java
##
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SQLContext;
+
+/**
+ * A Spark engine implementation of HoodieEngineContext.
+ */
+public class HoodieSparkEngineContext extends HoodieEngineContext {

Review comment:
   can we implement versiosn of `map`, `flatMap`, `forEach` here which use 
`javaSparkContext.parallelize()` ? It would be good to keep this PR free of any 
changes in terms of whether we are executing the deletes/lists in parallel or 
in serial. 

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/SparkMarkerFiles.java
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.hudi.bifunction.wrapper.ThrowingFunction;
+import org.apache.hudi.common.HoodieEngineContext;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.model.IOType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.bifunction.wrapper.BiFunctionWrapper.throwingConsumerWrapper;
+import static 
org.apache.hudi.bifunction.wrapper.BiFunctionWrapper.throwingFlatMapWrapper;
+
+public class SparkMarkerFiles extends BaseMarkerFiles {
+
+  private static final Logger LOG = 
LogManager.getLogger(SparkMarkerFiles.class);
+
+  public SparkMarkerFiles(HoodieTable table, String instantTime) {
+super(table, instantTime);
+  }
+
+  public SparkMarkerFiles(FileSystem fs, String basePath, String 
markerFolderPath, String instantTime) {
+super(fs, basePath, markerFolderPath, instantTime);
+  }
+
+  @Override
+  public boolean deleteMarkerDir(HoodieEngineContext context, int parallelism) 
{
+try {
+  if (fs.exists(markerDirPath)) {
+FileStatus[] fileStatuses = fs.listStatus(markerDirPath);
+List markerDirSubPaths = Arrays.stream(fileStatuses)
+.map(fileStatus -> fileStatus.getPath().toString())
+.collect(Collectors.toList());
+
+if (markerDirSubPaths.size() > 0) {
+  SerializableConfiguration conf = new 
SerializableConfiguration(fs.getConf());
+  context.foreach(markerDirSubPaths, 
throwingConsumerWrapper(subPathStr -> {
+Path subPath = new Path(subPathStr);
+FileSystem fileSystem = subPath.getFileSystem(conf.get());
+fileSystem.delete(subPath, true

[GitHub] [hudi] vishalpathak1986 commented on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


vishalpathak1986 commented on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696265004


   Got it. Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vishalpathak1986 closed issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


vishalpathak1986 closed issue #2095:
URL: https://github.com/apache/hudi/issues/2095


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-09-21 Thread GitBox


vinothchandar commented on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-696264459


   @bschell our spark install may be 2.11 on these images. As for 
hudi_spark_2.12 bundle, if we run integ-test with 2_12, I think it would happen 
automatically?
   @bvaradar should we switch to 2.12 and repush the images? 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


n3nash commented on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696259058


   @vishalpathak1986 Hudi internally maintains an Index to tag the incoming 
records with the fileId that it maps to. If inserts are written to log files, 
we require a way from the index to know which log file a particular record was 
written to. We have indexes such as BloomIndex which are written to parquet 
files and hence we are able to figure out if a record is present in a parquet 
file or not, but we don't have such an index for log files. 
   There is no config to turn off writing inserts as parquet, you just have to 
use an index implementation that can index log files. Currently, only the 
HbaseIndex can index log files -> 
https://github.com/apache/hudi/blob/c8e19e2def0c33415bc3945ffb81f524c484c924/hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java#L483.
 In the future, the record level index I pointed out earlier will be able to 
index the log files which will eliminate the need of an external K-V store for 
this feature.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vishalpathak1986 edited a comment on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


vishalpathak1986 edited a comment on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696242666


   @n3nash Thanks for your comment. Can you also please elaborate on how an 
index will help this? Also, please let me know if you think it is possible to 
turn off writing inserts to parquet as a default using an option, just as we 
have option for small files (hoodie.parquet.small.file.limit)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vishalpathak1986 commented on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


vishalpathak1986 commented on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696242666


   @n3nash Thanks for your comment. Can you also please elaborate on how an 
index will help this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1290) Implement Debezium avro source for Delta Streamer

2020-09-21 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199499#comment-17199499
 ] 

liwei commented on HUDI-1290:
-

i meet some users  want hudi support Debezium or canel collect mysql binlog

> Implement Debezium avro source for Delta Streamer
> -
>
> Key: HUDI-1290
> URL: https://issues.apache.org/jira/browse/HUDI-1290
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> We need to implement transformer and payloads for seamlessly pulling change 
> logs emitted by debezium in Kafka. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1268) Fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss

2020-09-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1268:
-
Labels: pull-request-available  (was: )

> Fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss
> 
>
> Key: HUDI-1268
> URL: https://issues.apache.org/jira/browse/HUDI-1268
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: leesf
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> 一、issue:in UpgradeDowngrade.run()
> fs.rename(updatedPropsFilePath, propsFilePath); 
> the fs.rename have different action,such as:
> // a) for hdfs : if propsFilePath already exist,fs.rename will not replace 
> propsFilePath, but just return false
> // b) for localfs: if propsFilePath already exist,fs.rename will replace 
> propsFilePath, and return ture
> // c) for aliyun ossfs: if propsFilePath already exist,will throw 
> FileAlreadyExistsException
> // so we should delete the old propsFilePath. also upgrade and downgrade is 
> Idempotent
>  
> 二、for aliyun oss filesystem
> when using HoodieWriteClient API to write data to hudi with following config:
> ```
> Properties properties = new Properties();
>  properties.setProperty(HoodieTableConfig.HOODIE_TABLE_NAME_PROP_NAME, 
> tableName);
>  properties.setProperty(HoodieTableConfig.HOODIE_TABLE_TYPE_PROP_NAME, 
> tableType.name());
>  properties.setProperty(HoodieTableConfig.HOODIE_PAYLOAD_CLASS_PROP_NAME, 
> OverwriteWithLatestAvroPayload.class.getName());
>  properties.setProperty(HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, 
> "archived");
>  return HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, basePath, 
> properties);
> ```
> the exception will be thrown with FileAlreadyExistsException in aliyun OSS, 
> after debugging, it is the following code throws the exception.
>  
> ```
> // Rename the .updated file to hoodie.properties. This is atomic in hdfs, but 
> not in cloud stores.
>  // But as long as this does not leave a partial hoodie.properties file, we 
> are okay.
>  fs.rename(updatedPropsFilePath, propsFilePath);
> ```
> however, we would ignore the FileAlreadyExistsException since 
> hoodie.properties already exists.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] lw309637554 opened a new pull request #2099: [HUDI-1268] fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss

2020-09-21 Thread GitBox


lw309637554 opened a new pull request #2099:
URL: https://github.com/apache/hudi/pull/2099


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   in UpgradeDowngrade.run()
   
   fs.rename(updatedPropsFilePath, propsFilePath); 
   
   the fs.rename have different action,such as:
   // a) for hdfs : if propsFilePath already exist,fs.rename will not replace 
propsFilePath, but just return false
   // b) for localfs: if propsFilePath already exist,fs.rename will replace 
propsFilePath, and return ture
   // c) for aliyun ossfs: if propsFilePath already exist,will throw 
FileAlreadyExistsException
   // so we should delete the old propsFilePath. also upgrade and downgrade is 
Idempotent
   
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #2012: [HUDI-1129] Deltastreamer Add support for schema evolution

2020-09-21 Thread GitBox


bvaradar commented on a change in pull request #2012:
URL: https://github.com/apache/hudi/pull/2012#discussion_r492198773



##
File path: hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala
##
@@ -364,4 +366,40 @@ object AvroConversionHelper {
 }
 }
   }
+
+  /**
+   * Remove namespace from fixed field.
+   * org.apache.spark.sql.avro.SchemaConverters.toAvroType method adds 
namespace to fixed avro field
+   * 
https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L177
+   * So, we need to remove that namespace so that reader schema without 
namespace do not throw erorr like this one
+   * org.apache.avro.AvroTypeException: Found 
hoodie.source.hoodie_source.height.fixed, expecting fixed
+   *
+   * @param schema Schema from which namespace needs to be removed for fixed 
fields
+   * @return input schema with namespace removed for fixed fields, if any
+   */
+  def removeNamespaceFromFixedFields(schema: Schema): Schema  ={

Review comment:
   @sathyaprakashg : What happens to existing records in hudi dataset which 
have namespace for fixed fields ? 
   
https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/MergeHelper.java#L70





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1268) Fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss

2020-09-21 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1268:

Description: 
一、issue:in UpgradeDowngrade.run()

fs.rename(updatedPropsFilePath, propsFilePath); 

the fs.rename have different action,such as:
// a) for hdfs : if propsFilePath already exist,fs.rename will not replace 
propsFilePath, but just return false
// b) for localfs: if propsFilePath already exist,fs.rename will replace 
propsFilePath, and return ture
// c) for aliyun ossfs: if propsFilePath already exist,will throw 
FileAlreadyExistsException
// so we should delete the old propsFilePath. also upgrade and downgrade is 
Idempotent

 

二、for aliyun oss filesystem

when using HoodieWriteClient API to write data to hudi with following config:

```

Properties properties = new Properties();
 properties.setProperty(HoodieTableConfig.HOODIE_TABLE_NAME_PROP_NAME, 
tableName);
 properties.setProperty(HoodieTableConfig.HOODIE_TABLE_TYPE_PROP_NAME, 
tableType.name());
 properties.setProperty(HoodieTableConfig.HOODIE_PAYLOAD_CLASS_PROP_NAME, 
OverwriteWithLatestAvroPayload.class.getName());
 properties.setProperty(HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, 
"archived");
 return HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, basePath, 
properties);

```

the exception will be thrown with FileAlreadyExistsException in aliyun OSS, 
after debugging, it is the following code throws the exception.

 

```

// Rename the .updated file to hoodie.properties. This is atomic in hdfs, but 
not in cloud stores.
 // But as long as this does not leave a partial hoodie.properties file, we are 
okay.
 fs.rename(updatedPropsFilePath, propsFilePath);

```

however, we would ignore the FileAlreadyExistsException since hoodie.properties 
already exists.

 

  was:
when using HoodieWriteClient API to write data to hudi with following config:

```

Properties properties = new Properties();
properties.setProperty(HoodieTableConfig.HOODIE_TABLE_NAME_PROP_NAME, 
tableName);
properties.setProperty(HoodieTableConfig.HOODIE_TABLE_TYPE_PROP_NAME, 
tableType.name());
properties.setProperty(HoodieTableConfig.HOODIE_PAYLOAD_CLASS_PROP_NAME, 
OverwriteWithLatestAvroPayload.class.getName());
properties.setProperty(HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, 
"archived");
return HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, basePath, 
properties);

```

the exception will be thrown with FileAlreadyExistsException in aliyun OSS, 
after debugging, it is the following code throws the exception.

 

```

// Rename the .updated file to hoodie.properties. This is atomic in hdfs, but 
not in cloud stores.
// But as long as this does not leave a partial hoodie.properties file, we are 
okay.
fs.rename(updatedPropsFilePath, propsFilePath);

```

however, we would ignore the FileAlreadyExistsException since hoodie.properties 
already exists.

 


> Fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss
> 
>
> Key: HUDI-1268
> URL: https://issues.apache.org/jira/browse/HUDI-1268
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: leesf
>Assignee: liwei
>Priority: Major
> Fix For: 0.6.1
>
>
> 一、issue:in UpgradeDowngrade.run()
> fs.rename(updatedPropsFilePath, propsFilePath); 
> the fs.rename have different action,such as:
> // a) for hdfs : if propsFilePath already exist,fs.rename will not replace 
> propsFilePath, but just return false
> // b) for localfs: if propsFilePath already exist,fs.rename will replace 
> propsFilePath, and return ture
> // c) for aliyun ossfs: if propsFilePath already exist,will throw 
> FileAlreadyExistsException
> // so we should delete the old propsFilePath. also upgrade and downgrade is 
> Idempotent
>  
> 二、for aliyun oss filesystem
> when using HoodieWriteClient API to write data to hudi with following config:
> ```
> Properties properties = new Properties();
>  properties.setProperty(HoodieTableConfig.HOODIE_TABLE_NAME_PROP_NAME, 
> tableName);
>  properties.setProperty(HoodieTableConfig.HOODIE_TABLE_TYPE_PROP_NAME, 
> tableType.name());
>  properties.setProperty(HoodieTableConfig.HOODIE_PAYLOAD_CLASS_PROP_NAME, 
> OverwriteWithLatestAvroPayload.class.getName());
>  properties.setProperty(HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, 
> "archived");
>  return HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, basePath, 
> properties);
> ```
> the exception will be thrown with FileAlreadyExistsException in aliyun OSS, 
> after debugging, it is the following code throws the exception.
>  
> ```
> // Rename the .updated file to hoodie.properties. This is atomic in hdfs, but 
> not in cloud stores.
>  // But as long as this does not leave a partial hoodie.properties file, we 
> are okay.
>

[jira] [Updated] (HUDI-1268) Fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss

2020-09-21 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1268:

Summary: Fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss  
(was: Fix UpgradeDowngrade Rename Exception in aliyun  OSS)

> Fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss
> 
>
> Key: HUDI-1268
> URL: https://issues.apache.org/jira/browse/HUDI-1268
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: leesf
>Assignee: liwei
>Priority: Major
> Fix For: 0.6.1
>
>
> when using HoodieWriteClient API to write data to hudi with following config:
> ```
> Properties properties = new Properties();
> properties.setProperty(HoodieTableConfig.HOODIE_TABLE_NAME_PROP_NAME, 
> tableName);
> properties.setProperty(HoodieTableConfig.HOODIE_TABLE_TYPE_PROP_NAME, 
> tableType.name());
> properties.setProperty(HoodieTableConfig.HOODIE_PAYLOAD_CLASS_PROP_NAME, 
> OverwriteWithLatestAvroPayload.class.getName());
> properties.setProperty(HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, 
> "archived");
> return HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, basePath, 
> properties);
> ```
> the exception will be thrown with FileAlreadyExistsException in aliyun OSS, 
> after debugging, it is the following code throws the exception.
>  
> ```
> // Rename the .updated file to hoodie.properties. This is atomic in hdfs, but 
> not in cloud stores.
> // But as long as this does not leave a partial hoodie.properties file, we 
> are okay.
> fs.rename(updatedPropsFilePath, propsFilePath);
> ```
> however, we would ignore the FileAlreadyExistsException since 
> hoodie.properties already exists.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] n3nash commented on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


n3nash commented on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696212548


   @vishalpathak1986 Currently, Hudi supports writing inserts in columnar file 
fomat (parquet) for MOR tables. All inserts goto parquet while updates goto 
AVRO file. 
   This is done for 2 reasons a) If you only have inserts, you don't have to 
compact again and have your data written in columnar file format to start with 
b) Absence of an index that can index log file.
   This feature will soon be supported with -> 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
 or you can try the 
[HbaseIndex](https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java)
 in the meantime which requires a Hbase cluster. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2098: [SUPPORT] File does not exisit(parquet) while reading Hudi Table from Spark

2020-09-21 Thread GitBox


n3nash commented on issue #2098:
URL: https://github.com/apache/hudi/issues/2098#issuecomment-696209149


   @RajasekarSribalan A FileNotFound error indicates that you are reading a 
version of the parquet file that has been deleted or no longer exists. This can 
happen due to the following reason : Say your job in writing to the Hudi table 
every 15 mins and you have chosen to keep only the latest version of the 
parquet file. Now, your snapshot job runs every 1 hr and takes around 1 hr to 
finish. What can happen is that the snapshot job ends up reading an older 
version of the parquet file while the new version is being created by the 
ingestion job and the cleaner deletes the older version.
   Since your job was running for many days, it seems like either a) The 
frequency of the ingestion job to Hudi or the snapshot job to Hive changed b) 
The snapshot job runs for longer period of time causing file not found c) It 
was just working by chance
   To fix this issue, please make sure you keep enough number of file versions 
so that a long running job (like the snapshot job) can find the file it started 
to read in the first place. 
   Please take a look at your configurations for the cleaner policy and then 
tune them using this config -> 
https://hudi.apache.org/docs/configurations.html#withCleanerPolicy



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-21 Thread GitBox


bvaradar commented on pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#issuecomment-696208094


   @satishkotha : Please ping me in the PR  when you have updates and I can 
give incremental comments if needed.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1290) Implement Debezium avro source for Delta Streamer

2020-09-21 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1290:
-
Status: Open  (was: New)

> Implement Debezium avro source for Delta Streamer
> -
>
> Key: HUDI-1290
> URL: https://issues.apache.org/jira/browse/HUDI-1290
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> We need to implement transformer and payloads for seamlessly pulling change 
> logs emitted by debezium in Kafka. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1290) Implement Debezium avro source for Delta Streamer

2020-09-21 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1290:


Assignee: Balaji Varadarajan

> Implement Debezium avro source for Delta Streamer
> -
>
> Key: HUDI-1290
> URL: https://issues.apache.org/jira/browse/HUDI-1290
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> We need to implement transformer and payloads for seamlessly pulling change 
> logs emitted by debezium in Kafka. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1290) Implement Debezium avro source for Delta Streamer

2020-09-21 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1290:


 Summary: Implement Debezium avro source for Delta Streamer
 Key: HUDI-1290
 URL: https://issues.apache.org/jira/browse/HUDI-1290
 Project: Apache Hudi
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


We need to implement transformer and payloads for seamlessly pulling change 
logs emitted by debezium in Kafka. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar commented on pull request #2096: [HUDI-1284] preCombine all HoodieRecords and update all fields(which is not DefaultValue) according to orderingVal

2020-09-21 Thread GitBox


vinothchandar commented on pull request #2096:
URL: https://github.com/apache/hudi/pull/2096#issuecomment-696182257


   @Karl-WangSK Will do! 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-696181800


   @wangxianghu let me check that out and circle back by your morning time/EOD 
PST. we can go from there. Thanks! 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu edited a comment on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-696120725


   @vinothchandar I have refactored `org.apache.hudi.table.SparkMarkerFiles` 
with `parallelDo ` function, it works ok in my 
local(`org.apache.hudi.table.TestMarkerFiles` passed),  and the Ci has passed 
in my repository by now.
   
   do you have other concerns about the `parallelDo`? can I start to refactor 
this pr with `parallelDo ` function now?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-21 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-696120725


   @vinothchandar I have refactored `org.apache.hudi.table.SparkMarkerFiles` 
with `parallelDo ` function, it works ok in my 
local(`org.apache.hudi.table.TestMarkerFiles` passed),  and the Ci has passed 
in my repository by now.
   
   do you have other concerns about the `parallelDo`? can I start to refactor 
this or with `parallelDo ` function now?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Karl-WangSK commented on pull request #2096: [HUDI-1284] preCombine all HoodieRecords and update all fields(which is not DefaultValue) according to orderingVal

2020-09-21 Thread GitBox


Karl-WangSK commented on pull request #2096:
URL: https://github.com/apache/hudi/pull/2096#issuecomment-696089004


   @vinothchandar hi. Can you look at this pr when you are fre?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] liujinhui1994 closed pull request #1984: [HUDI-1200] Fix NullPointerException, CustomKeyGenerator does not work

2020-09-21 Thread GitBox


liujinhui1994 closed pull request #1984:
URL: https://github.com/apache/hudi/pull/1984


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] RajasekarSribalan opened a new issue #2098: [SUPPORT]

2020-09-21 Thread GitBox


RajasekarSribalan opened a new issue #2098:
URL: https://github.com/apache/hudi/issues/2098


   Hi Team,
   
   We are  getting Parquet not found error while reading a Hudi table from 
Spark.
   
   What we do?
   
   We read a Hudi table from Spark(Select * from table) and do an insert and 
overwrrite on another hive table.Basically we take the snapshot of Hudi table 
and insert it in another table. Last couple of days, we are not able to read 
the entire table from Spark. We are getting bellow error.
   
   Caused by: java.io.FileNotFoundException: File does not exist: 
hdfs://baikal-prod/user/tempuser/hudi/hudipath/tbale_name/343f7bec-e29d-4b1e-a429-463c8efb09fb-0_91-23390-2236222_20200921075346.parquet
   at 
org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
   at 
org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
   at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
   at 
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
   at 
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
   at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:252)
   at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95)
   at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81)
   at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
   at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:297)
   at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:246)
   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:203)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
   at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
   at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
   at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
   at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
   at org.apache.spark.scheduler.Task.run(Task.scala:108)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   
   
   Note:  Parallely we have Streaming job which populating the source hudi 
table.
   
   How to fix these kind of issues? Pls assist. If there are any Hudi metadata 
problem? It was running perfectly for these many days but not sure why it 
throws file not found error. Whether cleaner didnot work prooperly? Pls help



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org