[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r1006287106 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java: ## @@ -109,4 +109,22 @@ public static List> getIOColumnNameAndTypes(Configuration co .collect(Collectors.toList()); } + /** + * if schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields + * We expect 3 cases to use parquet-avro reader to read timestamp column: + * 1. read columns contain timestamp type + * 2. no read columns and exists original columns contain timestamp type + * 3. no read columns and no original columns, but avro schema contains type + */ + public static boolean supportTimestamp(Configuration conf) { +List reads = Arrays.asList(getReadColumnNames(conf)); +if (reads.isEmpty()) { + return getIOColumnTypes(conf).contains("timestamp"); +} +List names = getIOColumns(conf); +List types = getIOColumnTypes(conf); +return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> reads.contains(names.get(i))) Review Comment: @xushiyan I was trying to think of a worst case scenario if I couldn't find the columns. I had encountered a similar problem with some of the test cases when running the azure test cases, so I've added this treatment here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r899667357 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java: ## @@ -76,8 +76,10 @@ public class HiveSyncConfig extends HoodieSyncConfig { @Parameter(names = {"--help", "-h"}, help = true) public Boolean help = false; - @Parameter(names = {"--support-timestamp"}, description = "'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type." - + "Disabled by default for backward compatibility.") + @Parameter(names = {"--support-timestamp"}, Review Comment: @xushiyan Hi~ I've created a new issue in jira https://issues.apache.org/jira/browse/HUDI-4274 Please see if the jira issue is OK. After this PR merged, I will plan to complete the test/fix in Presto/Trino before version 0.12 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r899666955 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetInputFormat.java: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.hadoop.avro; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; +import org.apache.parquet.hadoop.ParquetInputFormat; +import org.apache.parquet.hadoop.util.ContextUtil; + +import java.io.IOException; + +public class HoodieAvroParquetInputFormat extends ParquetInputFormat { Review Comment: @codope Hi~ I've created a new issue in jira https://issues.apache.org/jira/browse/HUDI-4274 After this PR merged, I will complete the test/fix in Presto/Trino before version 0.12 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r899667357 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java: ## @@ -76,8 +76,10 @@ public class HiveSyncConfig extends HoodieSyncConfig { @Parameter(names = {"--help", "-h"}, help = true) public Boolean help = false; - @Parameter(names = {"--support-timestamp"}, description = "'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type." - + "Disabled by default for backward compatibility.") + @Parameter(names = {"--support-timestamp"}, Review Comment: @xushiyan Hi~ I've created a new issue in jira https://issues.apache.org/jira/browse/HUDI-4274 After this PR merged, I will complete the test/fix in Presto/Trino before version 0.12 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r899666955 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetInputFormat.java: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.hadoop.avro; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; +import org.apache.parquet.hadoop.ParquetInputFormat; +import org.apache.parquet.hadoop.util.ContextUtil; + +import java.io.IOException; + +public class HoodieAvroParquetInputFormat extends ParquetInputFormat { Review Comment: @codope Hi~ I've created a new issue in jira https://issues.apache.org/jira/browse/HUDI-4274 After this PR is resolved, I will complete the test/fix in Presto/Trino before version 0.12 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r895736114 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetInputFormat.java: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.hadoop.avro; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; +import org.apache.parquet.hadoop.ParquetInputFormat; +import org.apache.parquet.hadoop.util.ContextUtil; + +import java.io.IOException; + +public class HoodieAvroParquetInputFormat extends ParquetInputFormat { Review Comment: @codope The question you have considered is very good. I might have been thoughtless. This PR mainly focus on hive. It also tests and verifies that spark reads hive tables. I don't know much about presto/trino. I hope @XuQianJin-Stars can help verify this and give feedback on the follow-up issues/prs of more aspects of presto/trino. Thanks ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r895728574 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.hadoop.avro; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; +import org.apache.hudi.hadoop.HoodieColumnProjectionUtils; +import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils; +import org.apache.parquet.avro.AvroReadSupport; +import org.apache.parquet.avro.AvroSchemaConverter; +import org.apache.parquet.format.converter.ParquetMetadataConverter; +import org.apache.parquet.hadoop.ParquetFileReader; +import org.apache.parquet.hadoop.ParquetInputSplit; +import org.apache.parquet.hadoop.ParquetRecordReader; +import org.apache.parquet.hadoop.metadata.ParquetMetadata; +import org.apache.parquet.schema.MessageType; + +import java.io.IOException; +import java.util.Arrays; +import java.util.List; +import java.util.stream.Collectors; + +import static org.apache.parquet.hadoop.ParquetInputFormat.getFilter; + +public class HoodieAvroParquetReader extends RecordReader { + + private final ParquetRecordReader parquetRecordReader; + + public HoodieAvroParquetReader(InputSplit inputSplit, Configuration conf) throws IOException { +AvroReadSupport avroReadSupport = new AvroReadSupport<>(); +// if exists read columns, we need to filter columns. +List readColNames = Arrays.asList(HoodieColumnProjectionUtils.getReadColumnNames(conf)); +if (!readColNames.isEmpty()) { + // get base schema + ParquetMetadata fileFooter = + ParquetFileReader.readFooter(conf, ((ParquetInputSplit) inputSplit).getPath(), ParquetMetadataConverter.NO_FILTER); + MessageType messageType = fileFooter.getFileMetaData().getSchema(); + Schema baseSchema = new AvroSchemaConverter(conf).convert(messageType); + // filter schema for reading + final Schema filterSchema = Schema.createRecord(baseSchema.getName(), + baseSchema.getDoc(), baseSchema.getNamespace(), baseSchema.isError(), + baseSchema.getFields().stream() + .filter(f -> readColNames.contains(f.name())) + .map(f -> new Schema.Field(f.name(), f.schema(), f.doc(), f.defaultVal())) + .collect(Collectors.toList())); + avroReadSupport.setAvroReadSchema(conf, filterSchema); + avroReadSupport.setRequestedProjection(conf, filterSchema); +} +parquetRecordReader = new ParquetRecordReader<>(avroReadSupport, getFilter(conf)); + } + + @Override + public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { +parquetRecordReader.initialize(split, context); + } + + @Override + public boolean nextKeyValue() throws IOException, InterruptedException { +return parquetRecordReader.nextKeyValue(); + } + + @Override + public Void getCurrentKey() throws IOException, InterruptedException { +return parquetRecordReader.getCurrentKey(); + } + + @Override + public ArrayWritable getCurrentValue() throws IOException, InterruptedException { +GenericRecord record = parquetRecordReader.getCurrentValue(); +return (ArrayWritable) HoodieRealtimeRecordReaderUtils.avroToArrayWritable(record, record.getSchema(), true); Review Comment: @codope If we use this class, it means that we need to use the timestamp type, so I directly set it to true here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r895724346 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.hadoop.avro; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; +import org.apache.hudi.hadoop.HoodieColumnProjectionUtils; +import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils; +import org.apache.parquet.avro.AvroReadSupport; +import org.apache.parquet.avro.AvroSchemaConverter; +import org.apache.parquet.format.converter.ParquetMetadataConverter; +import org.apache.parquet.hadoop.ParquetFileReader; +import org.apache.parquet.hadoop.ParquetInputSplit; +import org.apache.parquet.hadoop.ParquetRecordReader; +import org.apache.parquet.hadoop.metadata.ParquetMetadata; +import org.apache.parquet.schema.MessageType; + +import java.io.IOException; +import java.util.Arrays; +import java.util.List; +import java.util.stream.Collectors; + +import static org.apache.parquet.hadoop.ParquetInputFormat.getFilter; + +public class HoodieAvroParquetReader extends RecordReader { + + private final ParquetRecordReader parquetRecordReader; + + public HoodieAvroParquetReader(InputSplit inputSplit, Configuration conf) throws IOException { +AvroReadSupport avroReadSupport = new AvroReadSupport<>(); +// if exists read columns, we need to filter columns. +List readColNames = Arrays.asList(HoodieColumnProjectionUtils.getReadColumnNames(conf)); +if (!readColNames.isEmpty()) { + // get base schema + ParquetMetadata fileFooter = Review Comment: @codope Hi~. Your consideration is consistent with my original idea. This condition only occurs when there are query columns. Based on the `parquet-avro` methods, if there is a read schema, we need to manually specify the schema of the relevant fields to correctly output data. Otherwise, schema reading will wrong, I found this problem when I checked the test exception of Ci, and then I did this. https://github.com/apache/parquet-mr/blob/a89df8f9932b6ef6633d06069e50c9b7970bebd1/parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java#L86-L139 ```java @Override public ReadContext init(Configuration configuration, Map keyValueMetaData, MessageType fileSchema) { MessageType projection = fileSchema; Map metadata = new LinkedHashMap(); String requestedProjectionString = configuration.get(AVRO_REQUESTED_PROJECTION); if (requestedProjectionString != null) { Schema avroRequestedProjection = new Schema.Parser().parse(requestedProjectionString); projection = new AvroSchemaConverter(configuration).convert(avroRequestedProjection); } String avroReadSchema = configuration.get(AVRO_READ_SCHEMA); if (avroReadSchema != null) { metadata.put(AVRO_READ_SCHEMA_METADATA_KEY, avroReadSchema); } if (configuration.getBoolean(AVRO_COMPATIBILITY, AVRO_DEFAULT_COMPATIBILITY)) { metadata.put(AVRO_COMPATIBILITY, "true"); } return new ReadContext(projection, metadata); } @Override public RecordMaterializer prepareForRead( Configuration configuration, Map keyValueMetaData, MessageType fileSchema, ReadContext readContext) { Map metadata = readContext.getReadSupportMetadata(); MessageType parquetSchema = readContext.getRequestedSchema(); Schema avroSchema; if (metadata.get(AVRO_READ_SCHEMA_METADATA_KEY) != null) { // use the Avro read schema provided by the user avroSchema = new Schema.Parser().parse(metadata.get(AVRO_READ_SCHEMA_METADATA_KEY)); } else if
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r883598776 ## hudi-common/src/main/java/org/apache/hudi/common/util/Option.java: ## @@ -76,6 +76,10 @@ public static Option of(T value) { return new Option<>(value); } + public static Option ofNullable(Supplier value) { +return null == value ? empty() : ofNullable(value.get()); + } Review Comment: @xushiyan It's my mistake. Suppliers are supported in java9+, so I want to add it, but I lack consideration when defining methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r879378219 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java: ## @@ -146,4 +154,74 @@ public static List getIncrementalTableNames(JobContext job) { public static boolean isIncrementalUseDatabase(Configuration conf) { return conf.getBoolean(HOODIE_INCREMENTAL_USE_DATABASE, false); } + + public static boolean SUPPORT_TIMESTAMP_WRITEABLE_V2; + private static Class timestampClass = null; + private static Method setTimeInMillis = null; + private static Constructor timestampWriteableV2constructor = null; + + public static boolean SUPPORT_DATE_WRITEABLE_V2; + private static Constructor dateWriteableV2constructor = null; + + static { +// timestamp +try { + timestampClass = Class.forName("org.apache.hadoop.hive.common.type.Timestamp"); + setTimeInMillis = timestampClass.getDeclaredMethod("setTimeInMillis", long.class); + Class twV2Class = Class.forName("org.apache.hadoop.hive.serde2.io.TimestampWritableV2"); + timestampWriteableV2constructor = twV2Class.getConstructor(timestampClass); Review Comment: @xiarixiaoyao My mistake, I will modify it right now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r878161873 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HudiAvroParquetReader.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.hadoop.avro; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; +import org.apache.hudi.hadoop.HoodieColumnProjectionUtils; +import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils; +import org.apache.parquet.avro.AvroReadSupport; +import org.apache.parquet.avro.AvroSchemaConverter; +import org.apache.parquet.format.converter.ParquetMetadataConverter; +import org.apache.parquet.hadoop.ParquetFileReader; +import org.apache.parquet.hadoop.ParquetInputSplit; +import org.apache.parquet.hadoop.ParquetRecordReader; +import org.apache.parquet.hadoop.metadata.ParquetMetadata; +import org.apache.parquet.schema.MessageType; + +import java.io.IOException; +import java.util.Arrays; +import java.util.List; +import java.util.stream.Collectors; + +import static org.apache.parquet.hadoop.ParquetInputFormat.getFilter; + +public class HudiAvroParquetReader extends RecordReader { + + private final ParquetRecordReader parquetRecordReader; + + public HudiAvroParquetReader(InputSplit inputSplit, Configuration conf) throws IOException { +AvroReadSupport avroReadSupport = new AvroReadSupport<>(); +// if exists read columns, we need to filter columns. +List readColNames = Arrays.asList(HoodieColumnProjectionUtils.getReadColumnNames(conf)); +if (!readColNames.isEmpty()) { + // get base schema + ParquetMetadata fileFooter = + ParquetFileReader.readFooter(conf, ((ParquetInputSplit) inputSplit).getPath(), ParquetMetadataConverter.NO_FILTER); + MessageType messageType = fileFooter.getFileMetaData().getSchema(); + Schema baseSchema = new AvroSchemaConverter(conf).convert(messageType); + // filter schema for reading + final Schema filterSchema = Schema.createRecord(baseSchema.getName(), + baseSchema.getDoc(), baseSchema.getNamespace(), baseSchema.isError(), + baseSchema.getFields().stream() + .filter(f -> readColNames.contains(f.name())) + .map(f -> new Schema.Field(f.name(), f.schema(), f.doc(), f.defaultVal())) + .collect(Collectors.toList())); + avroReadSupport.setAvroReadSchema(conf, filterSchema); + avroReadSupport.setRequestedProjection(conf, filterSchema); +} +parquetRecordReader = new ParquetRecordReader<>(avroReadSupport, getFilter(conf)); + } + + @Override + public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { +parquetRecordReader.initialize(split, context); + } + + @Override + public boolean nextKeyValue() throws IOException, InterruptedException { +return parquetRecordReader.nextKeyValue(); + } + + @Override + public Void getCurrentKey() throws IOException, InterruptedException { Review Comment: @xiarixiaoyao I mainly refer to `org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper` in `hive-exec` package. https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java#L47 ```java private org.apache.hadoop.mapreduce.RecordReader realReader; ``` It should be declared that paruqet reader actually only uses value to process row data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r878137673 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java: ## @@ -109,4 +109,25 @@ public static List> getIOColumnNameAndTypes(Configuration co .collect(Collectors.toList()); } + /** + * if schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields + * We expect 3 cases to use parquet-avro reader to read timestamp column: + * 1. read columns contain timestamp type + * 2. no read columns and exists original columns contain timestamp type + * 3. no read columns and no original columns, but avro schema contains type + */ + public static boolean supportTimestamp(Configuration conf) { +List reads = Arrays.asList(getReadColumnNames(conf)); +if (reads.isEmpty()) { + return getIOColumnTypes(conf).contains("timestamp"); +} +List names = getIOColumns(conf); +if (names.isEmpty()) { + return true; Review Comment: @xiarixiaoyao My consideration is to avoid that some hive related test cases do not explicitly declare these configurations, resulting in test failure. I have encountered some similar problems on azure pipeline, so I add it. If it is empty, it defaults to true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r873107531 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java: ## @@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig { public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = ConfigProperty .key("hoodie.datasource.hive_sync.support_timestamp") - .defaultValue("false") + .defaultValue("true") .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. " + "Disabled by default for backward compatibility."); Review Comment: > > @XuQianJin-Stars Can this shows whether we can clearly explain the PR? > > 'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. Previous versions keep being disabled by default. > > In `deprecatedAfter` method write version 0.12.0 and change withDocumentation‘s content? @XuQianJin-Stars Is this right? ```java public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = ConfigProperty .key("hoodie.datasource.hive_sync.support_timestamp") .defaultValue("true") .deprecatedAfter("0.12.0") .withDocumentation("'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. " + "From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. " + "Previous versions keep being disabled by default."); ``` If there's no problem, I'll change all the other descriptions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r873107531 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java: ## @@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig { public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = ConfigProperty .key("hoodie.datasource.hive_sync.support_timestamp") - .defaultValue("false") + .defaultValue("true") .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. " + "Disabled by default for backward compatibility."); Review Comment: > > @XuQianJin-Stars Can this shows whether we can clearly explain the PR? > > 'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. Previous versions keep being disabled by default. > > In `deprecatedAfter` method write version 0.12.0 and change withDocumentation‘s content? Is this right? ```java public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = ConfigProperty .key("hoodie.datasource.hive_sync.support_timestamp") .defaultValue("true") .deprecatedAfter("0.12.0") .withDocumentation("'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. " + "From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. " + "Previous versions keep being disabled by default."); ``` If there's no problem, I'll change all the other descriptions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r872982532 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java: ## @@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig { public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = ConfigProperty .key("hoodie.datasource.hive_sync.support_timestamp") - .defaultValue("false") + .defaultValue("true") .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. " + "Disabled by default for backward compatibility."); Review Comment: @XuQianJin-Stars Can this shows whether we can clearly explain the PR? 'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. Previous versions keep being disabled by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r872981445 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java: ## @@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig { public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = ConfigProperty .key("hoodie.datasource.hive_sync.support_timestamp") - .defaultValue("false") + .defaultValue("true") .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. " + "Disabled by default for backward compatibility."); Review Comment: @lucasmo Sorry I misunderstood what you mean. I will change this comment later ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r872981445 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java: ## @@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig { public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = ConfigProperty .key("hoodie.datasource.hive_sync.support_timestamp") - .defaultValue("false") + .defaultValue("true") .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. " + "Disabled by default for backward compatibility."); Review Comment: @lucasmo Sorry I misunderstood what you mean. I've revised this later ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
cdmikechen commented on code in PR #3391: URL: https://github.com/apache/hudi/pull/3391#discussion_r870837005 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java: ## @@ -160,7 +160,7 @@ public class HiveSyncConfig extends HoodieSyncConfig { public static final ConfigProperty HIVE_SUPPORT_TIMESTAMP_TYPE = ConfigProperty .key("hoodie.datasource.hive_sync.support_timestamp") - .defaultValue("false") + .defaultValue("true") .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. " + "Disabled by default for backward compatibility."); Review Comment: @lucasmo If hudi can support hive to read the correct timestamp type, this configuration will be removed later. The plan here is to adapt step by step, so my consideration is to change it to true first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org