[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader

2023-09-28 Thread via GitHub


linliu-code commented on code in PR #9564:
URL: https://github.com/apache/hudi/pull/9564#discussion_r1339593212


##
pom.xml:
##
@@ -2173,6 +2173,25 @@
 true
 true
   
+  
+
+  
+maven-compiler-plugin
+
+  

Review Comment:
   Parquet library version before 1.12.2 does not support column index, e.g., 
bloom filter, even RowGroup does not exist in these versions. Therefore, the 
metadata loader des not work for these < 1.12.2 versions; so I excluded these 
lookup classes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader

2023-09-28 Thread via GitHub


linliu-code commented on code in PR #9564:
URL: https://github.com/apache/hudi/pull/9564#discussion_r1339593212


##
pom.xml:
##
@@ -2173,6 +2173,25 @@
 true
 true
   
+  
+
+  
+maven-compiler-plugin
+
+  

Review Comment:
   Parquet library version before 1.12.2 does not support column index, e.g., 
bloom filter. keyed looking up does not work for these < 1.12.2 versions; so we 
need to exclude these lookup classes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader

2023-09-20 Thread via GitHub


linliu-code commented on code in PR #9564:
URL: https://github.com/apache/hudi/pull/9564#discussion_r1332040320


##
pom.xml:
##
@@ -2508,6 +2584,25 @@
 1.5.6
 1.11.1
   
+  

Review Comment:
   Flink from 1.17 to 1.13 all depends on 1.10.1 parquet version, which does 
not support index metadata, like column index. We have to exclude these classes 
from these profiles.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader

2023-09-12 Thread via GitHub


linliu-code commented on code in PR #9564:
URL: https://github.com/apache/hudi/pull/9564#discussion_r1323804690


##
hudi-common/src/main/java/org/apache/hudi/metadata/parquet/ParquetMetadataFileReader.java:
##
@@ -0,0 +1,246 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.metadata.parquet;

Review Comment:
   Yeah, it makes sense as well if we move them there. We put it here because 
we thought it reads the metadata table files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader

2023-09-12 Thread via GitHub


linliu-code commented on code in PR #9564:
URL: https://github.com/apache/hudi/pull/9564#discussion_r1323802324


##
hudi-common/src/main/java/org/apache/hudi/metadata/parquet/ByteBufferBackedInputStream.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.metadata.parquet;
+
+import org.apache.hadoop.fs.PositionedReadable;
+import org.apache.hadoop.fs.Seekable;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+
+/**
+ * Instance of {@link InputStream} backed by {@link ByteBuffer}, implementing 
following
+ * functionality (on top of what's required by {@link InputStream})
+ *
+ * 
+ *   Seeking: enables random access by allowing to seek to an arbitrary 
position w/in the stream
+ *   (Thread-safe) Copying: enables to copy from the underlying buffer not 
modifying the state of the stream
+ * 
+ *
+ * NOTE: Generally methods of this class are NOT thread-safe, unless specified 
otherwise

Review Comment:
   What is UT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader

2023-09-12 Thread via GitHub


linliu-code commented on code in PR #9564:
URL: https://github.com/apache/hudi/pull/9564#discussion_r1323802324


##
hudi-common/src/main/java/org/apache/hudi/metadata/parquet/ByteBufferBackedInputStream.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.metadata.parquet;
+
+import org.apache.hadoop.fs.PositionedReadable;
+import org.apache.hadoop.fs.Seekable;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+
+/**
+ * Instance of {@link InputStream} backed by {@link ByteBuffer}, implementing 
following
+ * functionality (on top of what's required by {@link InputStream})
+ *
+ * 
+ *   Seeking: enables random access by allowing to seek to an arbitrary 
position w/in the stream
+ *   (Thread-safe) Copying: enables to copy from the underlying buffer not 
modifying the state of the stream
+ * 
+ *
+ * NOTE: Generally methods of this class are NOT thread-safe, unless specified 
otherwise

Review Comment:
   What is UT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] linliu-code commented on a diff in pull request #9564: [HUDI-6712] Add Parquet file metadata loader

2023-09-12 Thread via GitHub


linliu-code commented on code in PR #9564:
URL: https://github.com/apache/hudi/pull/9564#discussion_r1323795649


##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieParquetKeyedLookupReader.java:
##
@@ -0,0 +1,318 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.metadata.parquet.ParquetFileMetadataLoader;
+import org.apache.hudi.metadata.parquet.RowGroup;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.parquet.HadoopReadOptions;
+import org.apache.parquet.bytes.ByteBufferInputStream;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.page.DataPageV1;
+import org.apache.parquet.column.values.ValuesReader;
+import org.apache.parquet.column.values.bloomfilter.BloomFilter;
+import org.apache.parquet.compression.CompressionCodecFactory;
+import org.apache.parquet.format.DataPageHeader;
+import org.apache.parquet.format.PageHeader;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.hadoop.util.CompressionConverter;
+import org.apache.parquet.hadoop.util.HadoopCodecs;
+import org.apache.parquet.internal.column.columnindex.ColumnIndex;
+import org.apache.parquet.internal.column.columnindex.OffsetIndex;
+import org.apache.parquet.io.InputFile;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.PrimitiveType;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Queue;
+import java.util.SortedSet;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.function.BiConsumer;
+
+import static org.apache.parquet.column.ValuesType.DEFINITION_LEVEL;
+import static org.apache.parquet.column.ValuesType.REPETITION_LEVEL;
+import static org.apache.parquet.column.ValuesType.VALUES;
+
+/**
+ * Implements an efficient lookup for a key in a Parquet file, by using page 
level statistics, and bloom filters.
+ * Parquet file is expected to have two columns :
+ * 1. `key` binary column, which is the column to be used for lookup
+ * 2. `value` binary column, which is the column to be returned as a result of 
lookup
+ *
+ * as well as being sorted by `key` column.
+ *
+ * Known limitations:
+ * 1) Does not do bloom filter based skipping of row groups.
+ */
+public class HoodieParquetKeyedLookupReader {
+  private static Logger LOG = 
LoggerFactory.getLogger(HoodieParquetKeyedLookupReader.class);
+  private static String KEY = "key";
+  private static String VALUE = "value";
+  private final InputFile parquetFile;
+  private final Configuration conf;
+  private final ParquetFileMetadataLoader metadataLoader;
+  private final CompressionCodecFactory codecFactory;
+  private final ParquetMetadataConverter converter;
+
+  public HoodieParquetKeyedLookupReader(Configuration conf, InputFile 
parquetFile) throws Exception {
+this.conf = conf;
+this.parquetFile = parquetFile;
+this.metadataLoader = new ParquetFileMetadataLoader(
+parquetFile, 
ParquetFileMetadataLoader.Options.builder().enableLoadBloomFilters().build());
+this.codecFactory = HadoopCodecs.newFactory(0);
+this.converter = new ParquetMetadataConverter();
+
+metadataLoader.load();
+  }
+
+  public Map> lookup(SortedSet keys) throws 
Exception {
+Map> keyToValue = new HashMap<>();
+try (CompressionConverter.TransParquetFileReader reader = new 
CompressionConverter.TransParquetFileReader(
+parquetFile, HadoopReadOptions.builder(conf).build())) {
+  Map matchingRecords = getMatchingRecords(reader, new 
LinkedList<>(keys));
+  for (String key: keys) {
+if (matchingRecords.containsKey(key)) {
+  keyToValue.put(key,