[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-10-26 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r1006287106


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##
@@ -109,4 +109,22 @@ public static List> 
getIOColumnNameAndTypes(Configuration co
 .collect(Collectors.toList());
   }
 
+  /**
+   * if schema contains timestamp columns, this method is used for 
compatibility when there is no timestamp fields
+   * We expect 3 cases to use parquet-avro reader to read timestamp column:
+   *  1. read columns contain timestamp type
+   *  2. no read columns and exists original columns contain timestamp type
+   *  3. no read columns and no original columns, but avro schema contains type
+   */
+  public static boolean supportTimestamp(Configuration conf) {
+List reads = Arrays.asList(getReadColumnNames(conf));
+if (reads.isEmpty()) {
+  return getIOColumnTypes(conf).contains("timestamp");
+}
+List names = getIOColumns(conf);
+List types = getIOColumnTypes(conf);
+return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> 
reads.contains(names.get(i)))

Review Comment:
   @xushiyan 
   I was trying to think of a worst case scenario if I couldn't find the 
columns. I had encountered a similar problem with some of the test cases when 
running the azure test cases, so I've added this treatment here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-16 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r899667357


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java:
##
@@ -76,8 +76,10 @@ public class HiveSyncConfig extends HoodieSyncConfig {
   @Parameter(names = {"--help", "-h"}, help = true)
   public Boolean help = false;
 
-  @Parameter(names = {"--support-timestamp"}, description = "'INT64' with 
original type TIMESTAMP_MICROS is converted to hive 'timestamp' type."
-  + "Disabled by default for backward compatibility.")
+  @Parameter(names = {"--support-timestamp"},

Review Comment:
   @xushiyan 
   Hi~ I've created a new issue in jira 
https://issues.apache.org/jira/browse/HUDI-4274
   Please see if the jira issue is OK. After this PR merged, I will plan to 
complete the test/fix in Presto/Trino before version 0.12



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-16 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r899666955


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetInputFormat.java:
##
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.avro;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.parquet.hadoop.ParquetInputFormat;
+import org.apache.parquet.hadoop.util.ContextUtil;
+
+import java.io.IOException;
+
+public class HoodieAvroParquetInputFormat extends 
ParquetInputFormat {

Review Comment:
   @codope 
   Hi~ I've created a new issue in jira 
https://issues.apache.org/jira/browse/HUDI-4274
   After this PR merged, I will complete the test/fix in Presto/Trino before 
version 0.12



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-16 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r899667357


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java:
##
@@ -76,8 +76,10 @@ public class HiveSyncConfig extends HoodieSyncConfig {
   @Parameter(names = {"--help", "-h"}, help = true)
   public Boolean help = false;
 
-  @Parameter(names = {"--support-timestamp"}, description = "'INT64' with 
original type TIMESTAMP_MICROS is converted to hive 'timestamp' type."
-  + "Disabled by default for backward compatibility.")
+  @Parameter(names = {"--support-timestamp"},

Review Comment:
   @xushiyan 
   Hi~ I've created a new issue in jira 
https://issues.apache.org/jira/browse/HUDI-4274
   After this PR merged, I will complete the test/fix in Presto/Trino before 
version 0.12



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-16 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r899666955


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetInputFormat.java:
##
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.avro;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.parquet.hadoop.ParquetInputFormat;
+import org.apache.parquet.hadoop.util.ContextUtil;
+
+import java.io.IOException;
+
+public class HoodieAvroParquetInputFormat extends 
ParquetInputFormat {

Review Comment:
   @codope 
   Hi~ I've created a new issue in jira 
https://issues.apache.org/jira/browse/HUDI-4274
   After this PR is resolved, I will complete the test/fix in Presto/Trino 
before version 0.12



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-13 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r895736114


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetInputFormat.java:
##
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.avro;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.parquet.hadoop.ParquetInputFormat;
+import org.apache.parquet.hadoop.util.ContextUtil;
+
+import java.io.IOException;
+
+public class HoodieAvroParquetInputFormat extends 
ParquetInputFormat {

Review Comment:
   @codope 
   The question you have considered is very good. I might have been thoughtless.
   This PR mainly focus on hive. It also tests and verifies that spark reads 
hive tables. I don't know much about presto/trino. I hope @XuQianJin-Stars can 
help verify this and give feedback on the follow-up issues/prs of more aspects 
of presto/trino. Thanks ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-13 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r895728574


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.avro;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hudi.hadoop.HoodieColumnProjectionUtils;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils;
+import org.apache.parquet.avro.AvroReadSupport;
+import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetInputSplit;
+import org.apache.parquet.hadoop.ParquetRecordReader;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import static org.apache.parquet.hadoop.ParquetInputFormat.getFilter;
+
+public class HoodieAvroParquetReader extends RecordReader 
{
+
+  private final ParquetRecordReader parquetRecordReader;
+
+  public HoodieAvroParquetReader(InputSplit inputSplit, Configuration conf) 
throws IOException {
+AvroReadSupport avroReadSupport = new AvroReadSupport<>();
+// if exists read columns, we need to filter columns.
+List readColNames = 
Arrays.asList(HoodieColumnProjectionUtils.getReadColumnNames(conf));
+if (!readColNames.isEmpty()) {
+  // get base schema
+  ParquetMetadata fileFooter =
+  ParquetFileReader.readFooter(conf, ((ParquetInputSplit) 
inputSplit).getPath(), ParquetMetadataConverter.NO_FILTER);
+  MessageType messageType = fileFooter.getFileMetaData().getSchema();
+  Schema baseSchema = new AvroSchemaConverter(conf).convert(messageType);
+  // filter schema for reading
+  final Schema filterSchema = Schema.createRecord(baseSchema.getName(),
+  baseSchema.getDoc(), baseSchema.getNamespace(), 
baseSchema.isError(),
+  baseSchema.getFields().stream()
+  .filter(f -> readColNames.contains(f.name()))
+  .map(f -> new Schema.Field(f.name(), f.schema(), 
f.doc(), f.defaultVal()))
+  .collect(Collectors.toList()));
+  avroReadSupport.setAvroReadSchema(conf, filterSchema);
+  avroReadSupport.setRequestedProjection(conf, filterSchema);
+}
+parquetRecordReader = new ParquetRecordReader<>(avroReadSupport, 
getFilter(conf));
+  }
+
+  @Override
+  public void initialize(InputSplit split, TaskAttemptContext context) throws 
IOException, InterruptedException {
+parquetRecordReader.initialize(split, context);
+  }
+
+  @Override
+  public boolean nextKeyValue() throws IOException, InterruptedException {
+return parquetRecordReader.nextKeyValue();
+  }
+
+  @Override
+  public Void getCurrentKey() throws IOException, InterruptedException {
+return parquetRecordReader.getCurrentKey();
+  }
+
+  @Override
+  public ArrayWritable getCurrentValue() throws IOException, 
InterruptedException {
+GenericRecord record = parquetRecordReader.getCurrentValue();
+return (ArrayWritable) 
HoodieRealtimeRecordReaderUtils.avroToArrayWritable(record, record.getSchema(), 
true);

Review Comment:
   @codope 
   If we use this class, it means that we need to use the timestamp type, so I 
directly set it to true here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-13 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r895724346


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.avro;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hudi.hadoop.HoodieColumnProjectionUtils;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils;
+import org.apache.parquet.avro.AvroReadSupport;
+import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetInputSplit;
+import org.apache.parquet.hadoop.ParquetRecordReader;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import static org.apache.parquet.hadoop.ParquetInputFormat.getFilter;
+
+public class HoodieAvroParquetReader extends RecordReader 
{
+
+  private final ParquetRecordReader parquetRecordReader;
+
+  public HoodieAvroParquetReader(InputSplit inputSplit, Configuration conf) 
throws IOException {
+AvroReadSupport avroReadSupport = new AvroReadSupport<>();
+// if exists read columns, we need to filter columns.
+List readColNames = 
Arrays.asList(HoodieColumnProjectionUtils.getReadColumnNames(conf));
+if (!readColNames.isEmpty()) {
+  // get base schema
+  ParquetMetadata fileFooter =

Review Comment:
   @codope 
   Hi~. Your consideration is consistent with my original idea.
   This condition only occurs when there are query columns. Based on the 
`parquet-avro` methods, if there is a read schema, we need to manually specify 
the schema of the relevant fields to correctly output data. Otherwise, schema 
reading will wrong, I found this problem when I checked the test exception of 
Ci, and then I did this.
   
   
   
https://github.com/apache/parquet-mr/blob/a89df8f9932b6ef6633d06069e50c9b7970bebd1/parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java#L86-L139
   ```java
 @Override
 public ReadContext init(Configuration configuration,
 Map keyValueMetaData,
 MessageType fileSchema) {
   MessageType projection = fileSchema;
   Map metadata = new LinkedHashMap();
   
   String requestedProjectionString = 
configuration.get(AVRO_REQUESTED_PROJECTION);
   if (requestedProjectionString != null) {
 Schema avroRequestedProjection = new 
Schema.Parser().parse(requestedProjectionString);
 projection = new 
AvroSchemaConverter(configuration).convert(avroRequestedProjection);
   }
   
   String avroReadSchema = configuration.get(AVRO_READ_SCHEMA);
   if (avroReadSchema != null) {
 metadata.put(AVRO_READ_SCHEMA_METADATA_KEY, avroReadSchema);
   }
   
   if (configuration.getBoolean(AVRO_COMPATIBILITY, 
AVRO_DEFAULT_COMPATIBILITY)) {
 metadata.put(AVRO_COMPATIBILITY, "true");
   }
   
   return new ReadContext(projection, metadata);
 }
   
 @Override
 public RecordMaterializer prepareForRead(
 Configuration configuration, Map keyValueMetaData,
 MessageType fileSchema, ReadContext readContext) {
   Map metadata = readContext.getReadSupportMetadata();
   MessageType parquetSchema = readContext.getRequestedSchema();
   Schema avroSchema;
   
   if (metadata.get(AVRO_READ_SCHEMA_METADATA_KEY) != null) {
 // use the Avro read schema provided by the user
 avroSchema = new 
Schema.Parser().parse(metadata.get(AVRO_READ_SCHEMA_METADATA_KEY));
   } else if 

[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-05-27 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r883598776


##
hudi-common/src/main/java/org/apache/hudi/common/util/Option.java:
##
@@ -76,6 +76,10 @@ public static  Option of(T value) {
 return new Option<>(value);
   }
 
+  public static  Option ofNullable(Supplier value) {
+return null == value ? empty() : ofNullable(value.get());
+  }

Review Comment:
   @xushiyan It's my mistake. Suppliers are supported in java9+, so I want to 
add it, but I lack consideration when defining methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-05-23 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r879378219


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java:
##
@@ -146,4 +154,74 @@ public static List 
getIncrementalTableNames(JobContext job) {
   public static boolean isIncrementalUseDatabase(Configuration conf) {
 return conf.getBoolean(HOODIE_INCREMENTAL_USE_DATABASE, false);
   }
+
+  public static boolean SUPPORT_TIMESTAMP_WRITEABLE_V2;
+  private static Class timestampClass = null;
+  private static Method setTimeInMillis = null;
+  private static Constructor timestampWriteableV2constructor = null;
+
+  public static boolean SUPPORT_DATE_WRITEABLE_V2;
+  private static Constructor dateWriteableV2constructor = null;
+
+  static {
+// timestamp
+try {
+  timestampClass = 
Class.forName("org.apache.hadoop.hive.common.type.Timestamp");
+  setTimeInMillis = timestampClass.getDeclaredMethod("setTimeInMillis", 
long.class);
+  Class twV2Class = 
Class.forName("org.apache.hadoop.hive.serde2.io.TimestampWritableV2");
+  timestampWriteableV2constructor = 
twV2Class.getConstructor(timestampClass);

Review Comment:
   @xiarixiaoyao 
   My mistake, I will modify it right now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-05-20 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r878161873


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HudiAvroParquetReader.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.avro;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hudi.hadoop.HoodieColumnProjectionUtils;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils;
+import org.apache.parquet.avro.AvroReadSupport;
+import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetInputSplit;
+import org.apache.parquet.hadoop.ParquetRecordReader;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import static org.apache.parquet.hadoop.ParquetInputFormat.getFilter;
+
+public class HudiAvroParquetReader extends RecordReader {
+
+  private final ParquetRecordReader parquetRecordReader;
+
+  public HudiAvroParquetReader(InputSplit inputSplit, Configuration conf) 
throws IOException {
+AvroReadSupport avroReadSupport = new AvroReadSupport<>();
+// if exists read columns, we need to filter columns.
+List readColNames = 
Arrays.asList(HoodieColumnProjectionUtils.getReadColumnNames(conf));
+if (!readColNames.isEmpty()) {
+  // get base schema
+  ParquetMetadata fileFooter =
+  ParquetFileReader.readFooter(conf, ((ParquetInputSplit) 
inputSplit).getPath(), ParquetMetadataConverter.NO_FILTER);
+  MessageType messageType = fileFooter.getFileMetaData().getSchema();
+  Schema baseSchema = new AvroSchemaConverter(conf).convert(messageType);
+  // filter schema for reading
+  final Schema filterSchema = Schema.createRecord(baseSchema.getName(),
+  baseSchema.getDoc(), baseSchema.getNamespace(), 
baseSchema.isError(),
+  baseSchema.getFields().stream()
+  .filter(f -> readColNames.contains(f.name()))
+  .map(f -> new Schema.Field(f.name(), f.schema(), 
f.doc(), f.defaultVal()))
+  .collect(Collectors.toList()));
+  avroReadSupport.setAvroReadSchema(conf, filterSchema);
+  avroReadSupport.setRequestedProjection(conf, filterSchema);
+}
+parquetRecordReader = new ParquetRecordReader<>(avroReadSupport, 
getFilter(conf));
+  }
+
+  @Override
+  public void initialize(InputSplit split, TaskAttemptContext context) throws 
IOException, InterruptedException {
+parquetRecordReader.initialize(split, context);
+  }
+
+  @Override
+  public boolean nextKeyValue() throws IOException, InterruptedException {
+return parquetRecordReader.nextKeyValue();
+  }
+
+  @Override
+  public Void getCurrentKey() throws IOException, InterruptedException {

Review Comment:
   @xiarixiaoyao 
   
   I mainly refer to 
`org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper` in 
`hive-exec` package.
   
https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java#L47
   ```java
 private org.apache.hadoop.mapreduce.RecordReader 
realReader;
   ```
   
   It should be declared that paruqet reader actually only uses value to 
process row data.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-05-20 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r878137673


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##
@@ -109,4 +109,25 @@ public static List> 
getIOColumnNameAndTypes(Configuration co
 .collect(Collectors.toList());
   }
 
+  /**
+   * if schema contains timestamp columns, this method is used for 
compatibility when there is no timestamp fields
+   * We expect 3 cases to use parquet-avro reader to read timestamp column:
+   *  1. read columns contain timestamp type
+   *  2. no read columns and exists original columns contain timestamp type
+   *  3. no read columns and no original columns, but avro schema contains type
+   */
+  public static boolean supportTimestamp(Configuration conf) {
+List reads = Arrays.asList(getReadColumnNames(conf));
+if (reads.isEmpty()) {
+  return getIOColumnTypes(conf).contains("timestamp");
+}
+List names = getIOColumns(conf);
+if (names.isEmpty()) {
+  return true;

Review Comment:
   @xiarixiaoyao 
   My consideration is to avoid that some hive related test cases do not 
explicitly declare these configurations, resulting in test failure. I have 
encountered some similar problems on azure pipeline, so I add it. If it is 
empty, it defaults to true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-05-15 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r873107531


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java:
##
@@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig {
 
   public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = 
ConfigProperty
   .key("hoodie.datasource.hive_sync.support_timestamp")
-  .defaultValue("false")
+  .defaultValue("true")
   .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is 
converted to hive ‘timestamp’ type. "
   + "Disabled by default for backward compatibility.");

Review Comment:
   > > @XuQianJin-Stars Can this shows whether we can clearly explain the PR?
   > > 'INT64' with original type TIMESTAMP_MICROS is converted to hive 
'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can 
be disabled by this variable. Previous versions keep being disabled by default.
   > 
   > In `deprecatedAfter` method write version 0.12.0 and change 
withDocumentation‘s content?
   
   @XuQianJin-Stars 
   Is this right? 
   ```java
 public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = 
ConfigProperty
 .key("hoodie.datasource.hive_sync.support_timestamp")
 .defaultValue("true")
 .deprecatedAfter("0.12.0")
 .withDocumentation("'INT64' with original type TIMESTAMP_MICROS is 
converted to hive 'timestamp' type. "
 + "From 0.12.0, 'timestamp' type will be supported and also can be 
disabled by this variable. "
 + "Previous versions keep being disabled by default.");
   ```
   If there's no problem, I'll change all the other descriptions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-05-14 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r873107531


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java:
##
@@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig {
 
   public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = 
ConfigProperty
   .key("hoodie.datasource.hive_sync.support_timestamp")
-  .defaultValue("false")
+  .defaultValue("true")
   .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is 
converted to hive ‘timestamp’ type. "
   + "Disabled by default for backward compatibility.");

Review Comment:
   > > @XuQianJin-Stars Can this shows whether we can clearly explain the PR?
   > > 'INT64' with original type TIMESTAMP_MICROS is converted to hive 
'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can 
be disabled by this variable. Previous versions keep being disabled by default.
   > 
   > In `deprecatedAfter` method write version 0.12.0 and change 
withDocumentation‘s content?
   
   Is this right? 
   ```java
 public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = 
ConfigProperty
 .key("hoodie.datasource.hive_sync.support_timestamp")
 .defaultValue("true")
 .deprecatedAfter("0.12.0")
 .withDocumentation("'INT64' with original type TIMESTAMP_MICROS is 
converted to hive 'timestamp' type. "
 + "From 0.12.0, 'timestamp' type will be supported and also can be 
disabled by this variable. "
 + "Previous versions keep being disabled by default.");
   ```
   If there's no problem, I'll change all the other descriptions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-05-14 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r872982532


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java:
##
@@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig {
 
   public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = 
ConfigProperty
   .key("hoodie.datasource.hive_sync.support_timestamp")
-  .defaultValue("false")
+  .defaultValue("true")
   .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is 
converted to hive ‘timestamp’ type. "
   + "Disabled by default for backward compatibility.");

Review Comment:
   @XuQianJin-Stars 
   Can this shows whether we can clearly explain the PR?
   
   'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' 
type. From 0.12.0, 'timestamp' type will be supported and also can be disabled 
by this variable. Previous versions keep being disabled by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-05-14 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r872981445


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java:
##
@@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig {
 
   public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = 
ConfigProperty
   .key("hoodie.datasource.hive_sync.support_timestamp")
-  .defaultValue("false")
+  .defaultValue("true")
   .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is 
converted to hive ‘timestamp’ type. "
   + "Disabled by default for backward compatibility.");

Review Comment:
   @lucasmo Sorry I misunderstood what you mean. I will change this comment 
later ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-05-14 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r872981445


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java:
##
@@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig {
 
   public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = 
ConfigProperty
   .key("hoodie.datasource.hive_sync.support_timestamp")
-  .defaultValue("false")
+  .defaultValue("true")
   .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is 
converted to hive ‘timestamp’ type. "
   + "Disabled by default for backward compatibility.");

Review Comment:
   @lucasmo Sorry I misunderstood what you mean. I've revised this later ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-05-11 Thread GitBox


cdmikechen commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r870837005


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java:
##
@@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig {
 
   public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = 
ConfigProperty
   .key("hoodie.datasource.hive_sync.support_timestamp")
-  .defaultValue("false")
+  .defaultValue("true")
   .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is 
converted to hive ‘timestamp’ type. "
   + "Disabled by default for backward compatibility.");

Review Comment:
   @lucasmo If hudi can support hive to read the correct timestamp type, this 
configuration will be removed later. The plan here is to adapt step by step, so 
my consideration is to change it to true first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org