[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader
linliu-code commented on code in PR #9564: URL: https://github.com/apache/hudi/pull/9564#discussion_r1339593212 ## pom.xml: ## @@ -2173,6 +2173,25 @@ true true + + + +maven-compiler-plugin + + Review Comment: Parquet library version before 1.12.2 does not support column index, e.g., bloom filter, even RowGroup does not exist in these versions. Therefore, the metadata loader des not work for these < 1.12.2 versions; so I excluded these lookup classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader
linliu-code commented on code in PR #9564: URL: https://github.com/apache/hudi/pull/9564#discussion_r1339593212 ## pom.xml: ## @@ -2173,6 +2173,25 @@ true true + + + +maven-compiler-plugin + + Review Comment: Parquet library version before 1.12.2 does not support column index, e.g., bloom filter. keyed looking up does not work for these < 1.12.2 versions; so we need to exclude these lookup classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader
linliu-code commented on code in PR #9564: URL: https://github.com/apache/hudi/pull/9564#discussion_r1332040320 ## pom.xml: ## @@ -2508,6 +2584,25 @@ 1.5.6 1.11.1 + Review Comment: Flink from 1.17 to 1.13 all depends on 1.10.1 parquet version, which does not support index metadata, like column index. We have to exclude these classes from these profiles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader
linliu-code commented on code in PR #9564: URL: https://github.com/apache/hudi/pull/9564#discussion_r1323804690 ## hudi-common/src/main/java/org/apache/hudi/metadata/parquet/ParquetMetadataFileReader.java: ## @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.metadata.parquet; Review Comment: Yeah, it makes sense as well if we move them there. We put it here because we thought it reads the metadata table files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader
linliu-code commented on code in PR #9564: URL: https://github.com/apache/hudi/pull/9564#discussion_r1323802324 ## hudi-common/src/main/java/org/apache/hudi/metadata/parquet/ByteBufferBackedInputStream.java: ## @@ -0,0 +1,164 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.metadata.parquet; + +import org.apache.hadoop.fs.PositionedReadable; +import org.apache.hadoop.fs.Seekable; + +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; + +/** + * Instance of {@link InputStream} backed by {@link ByteBuffer}, implementing following + * functionality (on top of what's required by {@link InputStream}) + * + * + * Seeking: enables random access by allowing to seek to an arbitrary position w/in the stream + * (Thread-safe) Copying: enables to copy from the underlying buffer not modifying the state of the stream + * + * + * NOTE: Generally methods of this class are NOT thread-safe, unless specified otherwise Review Comment: What is UT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader
linliu-code commented on code in PR #9564: URL: https://github.com/apache/hudi/pull/9564#discussion_r1323802324 ## hudi-common/src/main/java/org/apache/hudi/metadata/parquet/ByteBufferBackedInputStream.java: ## @@ -0,0 +1,164 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.metadata.parquet; + +import org.apache.hadoop.fs.PositionedReadable; +import org.apache.hadoop.fs.Seekable; + +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; + +/** + * Instance of {@link InputStream} backed by {@link ByteBuffer}, implementing following + * functionality (on top of what's required by {@link InputStream}) + * + * + * Seeking: enables random access by allowing to seek to an arbitrary position w/in the stream + * (Thread-safe) Copying: enables to copy from the underlying buffer not modifying the state of the stream + * + * + * NOTE: Generally methods of this class are NOT thread-safe, unless specified otherwise Review Comment: What is UT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader
linliu-code commented on code in PR #9564: URL: https://github.com/apache/hudi/pull/9564#discussion_r1323795649 ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieParquetKeyedLookupReader.java: ## @@ -0,0 +1,318 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.storage; + +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.metadata.parquet.ParquetFileMetadataLoader; +import org.apache.hudi.metadata.parquet.RowGroup; + +import org.apache.hadoop.conf.Configuration; +import org.apache.parquet.HadoopReadOptions; +import org.apache.parquet.bytes.ByteBufferInputStream; +import org.apache.parquet.bytes.BytesInput; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.column.page.DataPageV1; +import org.apache.parquet.column.values.ValuesReader; +import org.apache.parquet.column.values.bloomfilter.BloomFilter; +import org.apache.parquet.compression.CompressionCodecFactory; +import org.apache.parquet.format.DataPageHeader; +import org.apache.parquet.format.PageHeader; +import org.apache.parquet.format.converter.ParquetMetadataConverter; +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; +import org.apache.parquet.hadoop.metadata.ColumnPath; +import org.apache.parquet.hadoop.util.CompressionConverter; +import org.apache.parquet.hadoop.util.HadoopCodecs; +import org.apache.parquet.internal.column.columnindex.ColumnIndex; +import org.apache.parquet.internal.column.columnindex.OffsetIndex; +import org.apache.parquet.io.InputFile; +import org.apache.parquet.io.api.Binary; +import org.apache.parquet.schema.PrimitiveType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.LinkedList; +import java.util.Map; +import java.util.Objects; +import java.util.Queue; +import java.util.SortedSet; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.function.BiConsumer; + +import static org.apache.parquet.column.ValuesType.DEFINITION_LEVEL; +import static org.apache.parquet.column.ValuesType.REPETITION_LEVEL; +import static org.apache.parquet.column.ValuesType.VALUES; + +/** + * Implements an efficient lookup for a key in a Parquet file, by using page level statistics, and bloom filters. + * Parquet file is expected to have two columns : + * 1. `key` binary column, which is the column to be used for lookup + * 2. `value` binary column, which is the column to be returned as a result of lookup + * + * as well as being sorted by `key` column. + * + * Known limitations: + * 1) Does not do bloom filter based skipping of row groups. + */ +public class HoodieParquetKeyedLookupReader { + private static Logger LOG = LoggerFactory.getLogger(HoodieParquetKeyedLookupReader.class); + private static String KEY = "key"; + private static String VALUE = "value"; + private final InputFile parquetFile; + private final Configuration conf; + private final ParquetFileMetadataLoader metadataLoader; + private final CompressionCodecFactory codecFactory; + private final ParquetMetadataConverter converter; + + public HoodieParquetKeyedLookupReader(Configuration conf, InputFile parquetFile) throws Exception { +this.conf = conf; +this.parquetFile = parquetFile; +this.metadataLoader = new ParquetFileMetadataLoader( +parquetFile, ParquetFileMetadataLoader.Options.builder().enableLoadBloomFilters().build()); +this.codecFactory = HadoopCodecs.newFactory(0); +this.converter = new ParquetMetadataConverter(); + +metadataLoader.load(); + } + + public Map> lookup(SortedSet keys) throws Exception { +Map> keyToValue = new HashMap<>(); +try (CompressionConverter.TransParquetFileReader reader = new CompressionConverter.TransParquetFileReader( +parquetFile, HadoopReadOptions.builder(conf).build())) { + Map matchingRecords = getMatchingRecords(reader, new LinkedList<>(keys)); + for (String key: keys) { +if (matchingRecords.containsKey(key)) { + keyToValue.put(key,