[jira] [Resolved] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-758. -- Resolution: Fixed > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Assignee: Anja Boskovic >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-758: Assignee: Anja Boskovic > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Assignee: Anja Boskovic >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2340) appendRowGroup will loose pageIndex
[ https://issues.apache.org/jira/browse/PARQUET-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757396#comment-17757396 ] Gabor Szadovszky commented on PARQUET-2340: --- [~NathanKan], I don't think these methods are used anymore. {{parquet-cli}} has another concept to merge files and that supports column indexes AFAIK. [~wgtmac], could you confirm this? Maybe, we can close this jira? > appendRowGroup will loose pageIndex > --- > > Key: PARQUET-2340 > URL: https://issues.apache.org/jira/browse/PARQUET-2340 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: GANHONGNAN >Priority: Major > > Currently, > org.apache.parquet.hadoop.ParquetFileWriter#appendFile(org.apache.parquet.io.InputFile) > uses appendRowGroup method to concate parquet row group. However, > appendRowGroup method *looses* column index. > {code:java} > // code placeholder > public void appendRowGroup(SeekableInputStream from, BlockMetaData rowGroup, > boolean dropColumns) throws IOException { > > // TODO: column/offset indexes are not copied > // (it would require seeking to the end of the file for each row groups) > currentColumnIndexes.add(null); > currentOffsetIndexes.add(null); > } {code} > > [https://github.com/apache/parquet-mr/blob/f8465a274b42e0a96996c76f3be0b50cf85ecf15/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1033C19-L1033C19] > > Look forward to functionality that support append with page index. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2318) Implement a tool to list page headers
[ https://issues.apache.org/jira/browse/PARQUET-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2318. --- Resolution: Fixed > Implement a tool to list page headers > - > > Key: PARQUET-2318 > URL: https://issues.apache.org/jira/browse/PARQUET-2318 > Project: Parquet > Issue Type: New Feature > Components: parquet-cli >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Needs a tool which lists the page headers in a Parquet file for debugging > purposes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2318) Implement a tool to list page headers
Gabor Szadovszky created PARQUET-2318: - Summary: Implement a tool to list page headers Key: PARQUET-2318 URL: https://issues.apache.org/jira/browse/PARQUET-2318 Project: Parquet Issue Type: New Feature Components: parquet-cli Reporter: Gabor Szadovszky Assignee: Gabor Szadovszky Needs a tool which lists the page headers in a Parquet file for debugging purposes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2317) parquet-format and parquet-format-structures defines Util with inconsitent methods provided
[ https://issues.apache.org/jira/browse/PARQUET-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736990#comment-17736990 ] Gabor Szadovszky commented on PARQUET-2317: --- [~wgtmac], Let me summarize the history of this. parquet-format contains all the specification docs and the parquet.thrift itself which is a kind of source code and spec at the same time. This is good to have all of these separated from the implementations. Meanwhile, since the thrift file is there, it was natural to have Thrift code generation and the Util there as well. But it was not a good choice since we only had the java code there. In some new features we had to extend Util which is clearly related to parquet-mr. So, we decided to deprecate all of the java related stuff in parquet-format and moved them to parquet-format-structures under parquet-mr. So, it would be good to not only have Util be removed but all the other java classes including the Thrift generated ones to be part of the jar. The catch is we still need to have some mechanism that validates the thrift file so we won't add invalid changes. Also, the distribution should be changed because providing a jar file without java classes would not make sense. I think, we should release a tarball instead that contains all the specs and the thrift file as well. Of course, we would need to update the parquet-mr (and maybe other affected implementations) to download that tarball instead of the jar file. > parquet-format and parquet-format-structures defines Util with inconsitent > methods provided > --- > > Key: PARQUET-2317 > URL: https://issues.apache.org/jira/browse/PARQUET-2317 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Affects Versions: 1.12.0, 1.13.0 >Reporter: Joey Pereira >Priority: Major > > I have been running into a bug due to {{parquet-format}} and > {{parquet-format-structures}} both defining the > {{org.apache.parquet.format.Util}} class but doing so inconsistently. > Examples of this are several methods which include a {{BlockCipher}} > parameter that are defined from {{parquet-format-structures}} but not > {{{}parquet-format{}}}. While invoking code that happens to use these, such > as {{{}org.apache.parquet.hadoop.ParquetFileReader.readFooter{}}}, the code > will fail if the {{parquet-format}} happens to be loaded first on the > classpath. > Here is an example stack trace for a Scala Spark application. > {code:java} > Caused by: java.lang.NoSuchMethodError: > 'org.apache.parquet.format.FileMetaData > org.apache.parquet.format.Util.readFileMetaData(java.io.InputStream, > org.apache.parquet.format.BlockCipher$Decryptor, byte[])' > at > org.apache.parquet.format.converter.ParquetMetadataConverter$3.visit(ParquetMetadataConverter.java:1441) > ~[parquet_hadoop.jar:1.13.1] > at > org.apache.parquet.format.converter.ParquetMetadataConverter$3.visit(ParquetMetadataConverter.java:1438) > ~[parquet_hadoop.jar:1.13.1] > at > org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:1173) > ~[parquet_hadoop.jar:1.13.1] > at > org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1438) > ~[parquet_hadoop.jar:1.13.1] > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:591) > ~[parquet_hadoop.jar:1.13.1] > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:536) > ~[parquet_hadoop.jar:1.13.1] > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:530) > ~[parquet_hadoop.jar:1.13.1] > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:478) > ~[parquet_hadoop.jar:1.13.1] > ... (my application code invoking the above) > {code} > Because of issues external to Parquet that I have yet to figure out (a > complex Spark and dependency setup), my classpaths are not deterministically > ordered and I am unable to pin the {{parquet-format-structures}} ahead hence > why I'm chiming in about this. > Even if that weren't the case, this is a fairly prickly edge to run into as > both modules define overlapping classes. {{Util}} is not the only class that > appears to be defined by both, just what I have been focusing on due to this > bug. > It appears these methods were introduced in at least 1.12: > [https://github.com/apache/parquet-mr/commit/65b95fb72be8f5a8a193a6f7bc4560fdcd742fc7#diff-852341c99dcae06c8fa2b764bcf3d9e6860e40442d0ab1cf5b935df80a9cacb7] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730988#comment-17730988 ] Gabor Szadovszky edited comment on PARQUET- at 6/9/23 2:40 PM: --- [~mwish], -This is specifically about BOOLEAN values (data pages), not rl/dl. (In parquet-mr we write rl/dl and dictionary indices using RLE for both v1 and v2 settings.)- Sorry, misread your comment. So parquet-cpp does not write BOOLEAN data pages in any case using RLE? was (Author: gszadovszky): [~mwish], This is specifically about BOOLEAN values (data pages), not rl/dl. (In parquet-mr we write rl/dl and dictionary indices using RLE for both v1 and v2 settings.) > [Format] RLE encoding spec incorrect for v2 data pages > -- > > Key: PARQUET- > URL: https://issues.apache.org/jira/browse/PARQUET- > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Gang Wu >Priority: Critical > Fix For: format-2.10.0 > > > The spec > (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) > has this: > {code} > rle-bit-packed-hybrid: > length := length of the in bytes stored as 4 bytes little > endian (unsigned int32) > {code} > But the length is actually prepended only in v1 data pages, not in v2 data > pages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730988#comment-17730988 ] Gabor Szadovszky commented on PARQUET-: --- [~mwish], This is specifically about BOOLEAN values (data pages), not rl/dl. (In parquet-mr we write rl/dl and dictionary indices using RLE for both v1 and v2 settings.) > [Format] RLE encoding spec incorrect for v2 data pages > -- > > Key: PARQUET- > URL: https://issues.apache.org/jira/browse/PARQUET- > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Gang Wu >Priority: Critical > Fix For: format-2.10.0 > > > The spec > (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) > has this: > {code} > rle-bit-packed-hybrid: > length := length of the in bytes stored as 4 bytes little > endian (unsigned int32) > {code} > But the length is actually prepended only in v1 data pages, not in v2 data > pages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730904#comment-17730904 ] Gabor Szadovszky commented on PARQUET-: --- [~apitrou], [~wgtmac], It seems my review was not deep enough. Sorry for that. So, parquet-mr does not use RLE encoding for boolean values in case of V1 but only bit packing: * [V1|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L53] -> ... -> [Bit packing|https://github.com/apache/parquet-mr/blob/9d80330ae4948787ac0bf4e4b0d990917f106440/parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPackingValuesWriter.java] (encoding written to page header: PLAIN) * [V2|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L57] -> ... -> [RLE|https://github.com/apache/parquet-mr/blob/9d80330ae4948787ac0bf4e4b0d990917f106440/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridValuesWriter.java] (encoding written to page header: RLE) [~apitrou], could you please confirm that is the same for parquet cpp? So the table we added in this PR about prepending the length is misleading. Also, the link in the PLAIN encoding for boolean is dead and misleading. It should point to BIT_PACKED. In the definition of BIT_PACKED it is also wrongly stated that it is valid only for RL/DL. I think, the deprecation is valid since the "BIT_PACKED" encoding should not be written to anywhere but the actual encoding is used under PLAIN for boolean. Would you guys like to work on this? We probably want to add this to the current format release. > [Format] RLE encoding spec incorrect for v2 data pages > -- > > Key: PARQUET- > URL: https://issues.apache.org/jira/browse/PARQUET- > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Gang Wu >Priority: Critical > Fix For: format-2.10.0 > > > The spec > (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) > has this: > {code} > rle-bit-packed-hybrid: > length := length of the in bytes stored as 4 bytes little > endian (unsigned int32) > {code} > But the length is actually prepended only in v1 data pages, not in v2 data > pages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-: - Assignee: Gang Wu > [Format] RLE encoding spec incorrect for v2 data pages > -- > > Key: PARQUET- > URL: https://issues.apache.org/jira/browse/PARQUET- > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Gang Wu >Priority: Critical > Fix For: format-2.10.0 > > > The spec > (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) > has this: > {code} > rle-bit-packed-hybrid: > length := length of the in bytes stored as 4 bytes little > endian (unsigned int32) > {code} > But the length is actually prepended only in v1 data pages, not in v2 data > pages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729630#comment-17729630 ] Gabor Szadovszky commented on PARQUET-758: -- Thanks for your reply, [~anjakefala]! I've mentioned {{bfloat16}} only because of the ease of converting back and forth to java/c++ {{float}} which we will probably need to be implemented for {{IEEE Float16}} as well. But I agree, we should not block the format release because of additional discussions about this additional topic. > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729214#comment-17729214 ] Gabor Szadovszky commented on PARQUET-758: -- Hey everyone, who is interested in the half-float type, When I've reviewed the format change it was obvious to me to use the "2-byte IEEE little-endian format". Now, I've faced another approach to encode 2 byte FP numbers: [bfloat16|https://en.wikipedia.org/wiki/Bfloat16_floating-point_format]. Since neither java nor c++ support 2 byte FP numbers natively we probably need to convert the encoded numbers to {{float}}. For {{bfloat16}} it would be more performant to do so. It might worth adding {{bfloat16}} to the format as well and add implementations for it in the same round. WDYT? > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5
[ https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713635#comment-17713635 ] Gabor Szadovszky commented on PARQUET-2276: --- I think it is fine to drop support of older systems from time to time. It is unfortunate though, it was not properly advertised in PARQUET-2158 that we did not simply upgrade hadoop version in our build but made it incompatible with hadoop2. Meanwhile, I think it is fine to re-add support for hadoop2 if it is practically feasible and won't break the hadoop3 support. > ParquetReader reads do not work with Hadoop version 2.8.5 > - > > Key: PARQUET-2276 > URL: https://issues.apache.org/jira/browse/PARQUET-2276 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Atul Mohan >Priority: Major > > {{ParquetReader.read() fails with the following exception on parquet-mr > version 1.13.0 when using hadoop version 2.8.5:}} > {code:java} > java.lang.NoSuchMethodError: 'boolean > org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' > at > org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74) > > at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) > at > org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) > > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787) > > at > org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) > at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > {code} > > > > From an initial investigation, it looks like HadoopStreams has started using > [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74] > but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop > 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter
[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701575#comment-17701575 ] Gabor Szadovszky commented on PARQUET-2256: --- [~mwish], would you mind to do some investigations before this update? Let's get the binary data of a mentioned 2M bloom filter and compress with some codecs to see the gain. If the ratio is good, it might worth adding this features. It is also worth to mention that compressing bloom filter might hit filtering from performance point of view. > Adding Compression for BloomFilter > -- > > Key: PARQUET-2256 > URL: https://issues.apache.org/jira/browse/PARQUET-2256 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Affects Versions: format-2.9.0 >Reporter: Xuwei Fu >Assignee: Xuwei Fu >Priority: Major > > In Current Parquet implementions, if BloomFilter doesn't set the ndv, most > implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is > 0.01, the BloomFilter size may grows to 2M for each column, which is really > huge. Should we support compression for BloomFilter, like: > > ``` > /** > * The compression used in the Bloom filter. > **/ > struct Uncompressed {} > union BloomFilterCompression { > 1: Uncompressed UNCOMPRESSED; > +2: CompressionCodec COMPRESSION; > } > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2256) Adding Compression for BloomFilter
[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2256: - Assignee: Xuwei Fu > Adding Compression for BloomFilter > -- > > Key: PARQUET-2256 > URL: https://issues.apache.org/jira/browse/PARQUET-2256 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Affects Versions: format-2.9.0 >Reporter: Xuwei Fu >Assignee: Xuwei Fu >Priority: Major > > In Current Parquet implementions, if BloomFilter doesn't set the ndv, most > implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is > 0.01, the BloomFilter size may grows to 2M for each column, which is really > huge. Should we support compression for BloomFilter, like: > > ``` > /** > * The compression used in the Bloom filter. > **/ > struct Uncompressed {} > union BloomFilterCompression { > 1: Uncompressed UNCOMPRESSED; > +2: CompressionCodec COMPRESSION; > } > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2258) Storing toString fields in FilterPredicate instances can lead to memory pressure
[ https://issues.apache.org/jira/browse/PARQUET-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701568#comment-17701568 ] Gabor Szadovszky commented on PARQUET-2258: --- Thanks for fixing this, [~abstractdog]! As far as I understood this is not a serious issue so I don't think we need to include it in a patch release. If you agree please update the version number to {{1.13.0}}. (I usually don't bother selecting version numbers which is targeted by {{master}}. We'll set them in a bulk update based on the changelog.) > Storing toString fields in FilterPredicate instances can lead to memory > pressure > > > Key: PARQUET-2258 > URL: https://issues.apache.org/jira/browse/PARQUET-2258 > Project: Parquet > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Fix For: 1.12.3 > > Attachments: Parquet_Predicate_toString_memory.png, > image-2023-03-14-13-27-54-008.png > > > It happens with Hive (HiveServer2), a certain amount of predicate instances > can make HiveServer2 OOM. According to the heapdump and background > information, the predicates must have been simplified a bit, but still, > storing toString in the objects looks very weird. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1690) Integer Overflow of BinaryStatistics#isSmallerThan()
[ https://issues.apache.org/jira/browse/PARQUET-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701561#comment-17701561 ] Gabor Szadovszky commented on PARQUET-1690: --- [~humanoid], I don't know/remember the background of this issue and the closed PRs. I think it would be the best to start over with a new PR. [~sha...@uber.com], do you remember why the last PR was closed and not reviewed/submitted? > Integer Overflow of BinaryStatistics#isSmallerThan() > > > Key: PARQUET-1690 > URL: https://issues.apache.org/jira/browse/PARQUET-1690 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Labels: pull-request-available > > "(min.length() + max.length()) < size" didn't handle integer overflow > [https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L103] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2255) BloomFilter and float point is ambiguous
[ https://issues.apache.org/jira/browse/PARQUET-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699732#comment-17699732 ] Gabor Szadovszky commented on PARQUET-2255: --- But we don't build the dictionary for filtering but for encoding. We should not add anything else than what we have in the pages. So anything should be added to the read path. Maybe we do not need to handle +0.0 and -0.0 differently from the other values. (We needed to handle them separately for min/max values because the comparison is not trivial and there were actual issues.) If someone deals with FP numbers they should know about the difference between +0.0 and -0.0. Because the FP spec allows to have multiple NaN values (even though java use one actual bitmap for it) we need to avoid using Bloom filter in this case. Dictionary is a different thing because we deserialize it to java Double/Float values in a Set so we will have one NaN value that is the very same one we are searching for. (It is more for the other implementations to deal with NaN if the language has several NaN values.) > BloomFilter and float point is ambiguous > > > Key: PARQUET-2255 > URL: https://issues.apache.org/jira/browse/PARQUET-2255 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Xuwei Fu >Priority: Major > Fix For: format-2.9.0 > > > Currently, our Parquet can use BloomFilter for any physical types. However, > when BloomFilter apply on float: > # What does +0 -0 means? Are they equal? > # Should qNaN sNaN written in BloomFilter? Are they equal? > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2255) BloomFilter and float point is ambiguous
[ https://issues.apache.org/jira/browse/PARQUET-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699712#comment-17699712 ] Gabor Szadovszky commented on PARQUET-2255: --- Bloom filters are for searching for exact values. Exact checking of floating point numbers are usually code smell. Usually checking if the difference is below an epsilon value is suggested over using exact equality. I am wondering if there is a real usecase for searching for an exact floating point number. Maybe disabling bloom filters completely for FP numbers is the simplest choice and probably won't bother anyone. If we still want to handle FP bloom filters I agree with [~wgtmac]'s proposal. (It is a similar approach we implemented for min/max values.) Keep in mind that we need to handle the case when someone wants to filter on a NaN. > BloomFilter and float point is ambiguous > > > Key: PARQUET-2255 > URL: https://issues.apache.org/jira/browse/PARQUET-2255 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Xuwei Fu >Priority: Major > Fix For: format-2.9.0 > > > Currently, our Parquet can use BloomFilter for any physical types. However, > when BloomFilter apply on float: > # What does +0 -0 means? Are they equal? > # Should qNaN sNaN written in BloomFilter? Are they equal? > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size
[ https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697510#comment-17697510 ] Gabor Szadovszky commented on PARQUET-2254: --- 1) I think, for creating bloom filters we have the statistics to decide how much space the bloom filter shall occupy (we have the actual data). What we don't know if the bloom filter in itself will be useful or not. (Whould there be filtering on the related column and would it be Eq/NotEq/IsIn etc. like predicates.) This one shall be decided by the client by the already introduced properties. We do not write bloom filters by default anyway. 2) Of course it is hard to be smart for PPD since we don't know the actual data (we are just before reading it). But there is an actual order of checking the row group filters: statistics, dictionary, bloom filter. Checking the statistics first is obviously correct. What I am not sure about is if we want to check dictionary first and then the bloom filter or the other way around. Because of that question I am also unsure if it is a good practice to not write bloom filters if the whole column chunk is dictionary encoded. > Build a BloomFilter with a more precise size > > > Key: PARQUET-2254 > URL: https://issues.apache.org/jira/browse/PARQUET-2254 > Project: Parquet > Issue Type: Improvement >Reporter: Mars >Assignee: Mars >Priority: Major > > Now the usage is to specify the size, and then build BloomFilter. In general > scenarios, it is actually not sure how much the distinct value is. > If BloomFilter can be automatically generated according to the data, the file > size can be reduced and the reading efficiency can also be improved. > I have an idea that the user can specify a maximum BloomFilter filter size, > then we build multiple BloomFilter at the same time, we can use the largest > BloomFilter as a counting tool( If there is no hit when inserting a value, > the counter will be +1, of course this may be imprecise but enough) > Then at the end of the write, choose a BloomFilter of a more appropriate size > when the file is finally written. > I want to implement this feature and hope to get your opinions, thank you -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size
[ https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697301#comment-17697301 ] Gabor Szadovszky commented on PARQUET-2254: --- I think this is a good idea. Meanwhile, it would increase the memory footprint of the writer. However, if you plan to keep the current logic that the user decides the columns which bloom filters are generated for, it should be acceptable. However, I think, we need to take one step back and investigate/synchronize the efforts around row group filtering. Or maybe it is only me for whom the following questions are not obvious? :) * Is it always true that reading the dictionary for filtering is cheaper than reading the bloom filter? Bloom filters should be usually smaller than dictionaries and faster to be scanned for a value. * Based on the previous one if we decide that it might worth reading the bloom filter before the dictionary it also questions the logic of not writing bloom filters in case of the whole column chunk is dictionary encoded. * Meanwhile, if the whole column chunk is dictionary encoded but the dictionary is still small (the redundancy is high) then it might not worth writing a bloom filter since checking the dictionary might be cheaper. What do you think? > Build a BloomFilter with a more precise size > > > Key: PARQUET-2254 > URL: https://issues.apache.org/jira/browse/PARQUET-2254 > Project: Parquet > Issue Type: Improvement >Reporter: Mars >Assignee: Mars >Priority: Major > > Now the usage is to specify the size, and then build BloomFilter. In general > scenarios, it is actually not sure how much the distinct value is. > If BloomFilter can be automatically generated according to the data, the file > size can be reduced and the reading efficiency can also be improved. > I have an idea that the user can specify a maximum BloomFilter filter size, > then we build multiple BloomFilter at the same time, we can use the largest > BloomFilter as a counting tool( If there is no hit when inserting a value, > the counter will be +1, of course this may be imprecise but enough) > Then at the end of the write, choose a BloomFilter of a more appropriate size > when the file is finally written. > I want to implement this feature and hope to get your opinions, thank you -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2254) Build a BloomFilter with a more precise size
[ https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2254: - Assignee: Mars > Build a BloomFilter with a more precise size > > > Key: PARQUET-2254 > URL: https://issues.apache.org/jira/browse/PARQUET-2254 > Project: Parquet > Issue Type: Improvement >Reporter: Mars >Assignee: Mars >Priority: Major > > Now the usage is to specify the size, and then build BloomFilter. In general > scenarios, it is actually not sure how much the distinct value is. > If BloomFilter can be automatically generated according to the data, the file > size can be reduced and the reading efficiency can also be improved. > I have an idea that the user can specify a maximum BloomFilter filter size, > then we build multiple BloomFilter at the same time, we can use the largest > BloomFilter as a counting tool( If there is no hit when inserting a value, > the counter will be +1, of course this may be imprecise but enough) > Then at the end of the write, choose a BloomFilter of a more appropriate size > when the file is finally written. > I want to implement this feature and hope to get your opinions, thank you -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2246) Add short circuit logic to column index filter
[ https://issues.apache.org/jira/browse/PARQUET-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2246. --- Resolution: Fixed > Add short circuit logic to column index filter > -- > > Key: PARQUET-2246 > URL: https://issues.apache.org/jira/browse/PARQUET-2246 > Project: Parquet > Issue Type: Improvement >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Minor > > ColumnIndexFilter can be optimized by adding short-circuit logic to `AND` and > `OR` operations. It's not necessary to evaluating the right node in some > cases: > * If the left result row ranges of `AND` is empty > * If the left result row ranges of `OR` is full range of the row-group -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2246) Add short circuit logic to column index filter
[ https://issues.apache.org/jira/browse/PARQUET-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2246: - Assignee: Yujiang Zhong > Add short circuit logic to column index filter > -- > > Key: PARQUET-2246 > URL: https://issues.apache.org/jira/browse/PARQUET-2246 > Project: Parquet > Issue Type: Improvement >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Minor > > ColumnIndexFilter can be optimized by adding short-circuit logic to `AND` and > `OR` operations. It's not necessary to evaluating the right node in some > cases: > * If the left result row ranges of `AND` is empty > * If the left result row ranges of `OR` is full range of the row-group -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2243) Support zstd-jni in DirectCodecFactory
[ https://issues.apache.org/jira/browse/PARQUET-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2243. --- Resolution: Fixed > Support zstd-jni in DirectCodecFactory > -- > > Key: PARQUET-2243 > URL: https://issues.apache.org/jira/browse/PARQUET-2243 > Project: Parquet > Issue Type: Bug >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > During switching to zstd-jni (from the Hadoop native zstd codec) we missed to > add proper implementations for {{DirectCodecFactory}}. Currently, NPE occurs > in case of the {{DirectCodecFactory}} is used while reading/writing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow
[ https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2247: - Assignee: dzcxzl (was: Gabor Szadovszky) > Fail-fast if CapacityByteArrayOutputStream write overflow > - > > Key: PARQUET-2247 > URL: https://issues.apache.org/jira/browse/PARQUET-2247 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > > The bytesUsed of CapacityByteArrayOutputStream may overflow when writing some > large byte data, resulting in parquet file write corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow
[ https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2247. --- Resolution: Fixed > Fail-fast if CapacityByteArrayOutputStream write overflow > - > > Key: PARQUET-2247 > URL: https://issues.apache.org/jira/browse/PARQUET-2247 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > > The bytesUsed of CapacityByteArrayOutputStream may overflow when writing some > large byte data, resulting in parquet file write corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow
[ https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2247: - Assignee: Gabor Szadovszky > Fail-fast if CapacityByteArrayOutputStream write overflow > - > > Key: PARQUET-2247 > URL: https://issues.apache.org/jira/browse/PARQUET-2247 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: dzcxzl >Assignee: Gabor Szadovszky >Priority: Critical > > The bytesUsed of CapacityByteArrayOutputStream may overflow when writing some > large byte data, resulting in parquet file write corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls
[ https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2241. --- Resolution: Fixed > ByteStreamSplitDecoder broken in presence of nulls > -- > > Key: PARQUET-2241 > URL: https://issues.apache.org/jira/browse/PARQUET-2241 > Project: Parquet > Issue Type: Bug > Components: parquet-format, parquet-mr >Affects Versions: format-2.8.0 >Reporter: Xuwei Fu >Assignee: Gang Wu >Priority: Major > > > This problem is shown in this issue: > [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let] > Let me talk about it briefly: > * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, > but using "num_values" as stride in BYTE_STREAM_SPLIT > * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in > the page, however, in DATA_PAGE_V1, without statistics, we should read > def-levels and rep-levels to get the real num-of-values. And without the > num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly > > The bug-reproducing code is in the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2228. --- Resolution: Fixed > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2244: - Assignee: Yujiang Zhong > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2244. --- Resolution: Fixed > Dictionary filter may skip row-groups incorrectly when evaluating notIn > --- > > Key: PARQUET-2244 > URL: https://issues.apache.org/jira/browse/PARQUET-2244 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Yujiang Zhong >Assignee: Yujiang Zhong >Priority: Major > > Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on > optional columns with null values. Here is an example: > Say there is a optional column `c1` with all pages dict encoded, `c1` has and > only has two distinct values: ['foo', null], and the predicate is `c1 not > in ('foo', 'bar')`. > Now dictionary filter may skip this row-group that is actually should not be > skipped, because there are nulls in the column. > > This is a bug similar to #1510. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2243) Support zstd-jni in DirectCodecFactory
Gabor Szadovszky created PARQUET-2243: - Summary: Support zstd-jni in DirectCodecFactory Key: PARQUET-2243 URL: https://issues.apache.org/jira/browse/PARQUET-2243 Project: Parquet Issue Type: Bug Reporter: Gabor Szadovszky Assignee: Gabor Szadovszky During switching to zstd-jni (from the Hadoop native zstd codec) we missed to add proper implementations for {{DirectCodecFactory}}. Currently, NPE occurs in case of the {{DirectCodecFactory}} is used while reading/writing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls
[ https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688363#comment-17688363 ] Gabor Szadovszky commented on PARQUET-2241: --- [~wgtmac], realted to your question about production. I haven't seen any usage of BYTE_STREAM_SPLIT in prod. Production envrionments I have been working on was stuck to Parquet v1 encodings. > ByteStreamSplitDecoder broken in presence of nulls > -- > > Key: PARQUET-2241 > URL: https://issues.apache.org/jira/browse/PARQUET-2241 > Project: Parquet > Issue Type: Bug > Components: parquet-format, parquet-mr >Affects Versions: format-2.8.0 >Reporter: Xuwei Fu >Assignee: Gang Wu >Priority: Major > > > This problem is shown in this issue: > [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let] > Let me talk about it briefly: > * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, > but using "num_values" as stride in BYTE_STREAM_SPLIT > * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in > the page, however, in DATA_PAGE_V1, without statistics, we should read > def-levels and rep-levels to get the real num-of-values. And without the > num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly > > The bug-reproducing code is in the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls
[ https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688363#comment-17688363 ] Gabor Szadovszky edited comment on PARQUET-2241 at 2/14/23 8:37 AM: [~wgtmac], realted to your question about production. I haven't seen any usage of BYTE_STREAM_SPLIT in prod. Production envrionments I have been working on were stuck to Parquet v1 encodings. was (Author: gszadovszky): [~wgtmac], realted to your question about production. I haven't seen any usage of BYTE_STREAM_SPLIT in prod. Production envrionments I have been working on was stuck to Parquet v1 encodings. > ByteStreamSplitDecoder broken in presence of nulls > -- > > Key: PARQUET-2241 > URL: https://issues.apache.org/jira/browse/PARQUET-2241 > Project: Parquet > Issue Type: Bug > Components: parquet-format, parquet-mr >Affects Versions: format-2.8.0 >Reporter: Xuwei Fu >Assignee: Gang Wu >Priority: Major > > > This problem is shown in this issue: > [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let] > Let me talk about it briefly: > * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, > but using "num_values" as stride in BYTE_STREAM_SPLIT > * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in > the page, however, in DATA_PAGE_V1, without statistics, we should read > def-levels and rep-levels to get the real num-of-values. And without the > num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly > > The bug-reproducing code is in the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2226) Support merge Bloom Filter
[ https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2226. --- Resolution: Fixed > Support merge Bloom Filter > -- > > Key: PARQUET-2226 > URL: https://issues.apache.org/jira/browse/PARQUET-2226 > Project: Parquet > Issue Type: Improvement >Reporter: Mars >Assignee: miracle >Priority: Major > > We need to collect Parquet's bloom filter of multiple files, and then > synthesize a more comprehensive bloom filter for common use. > Guava supports similar api operations > https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2226) Support merge Bloom Filter
[ https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2226: - Assignee: miracle > Support merge Bloom Filter > -- > > Key: PARQUET-2226 > URL: https://issues.apache.org/jira/browse/PARQUET-2226 > Project: Parquet > Issue Type: Improvement >Reporter: Mars >Assignee: miracle >Priority: Major > > We need to collect Parquet's bloom filter of multiple files, and then > synthesize a more comprehensive bloom filter for common use. > Guava supports similar api operations > https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2226) Support merge Bloom Filter
[ https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2226: - Assignee: (was: miracle) > Support merge Bloom Filter > -- > > Key: PARQUET-2226 > URL: https://issues.apache.org/jira/browse/PARQUET-2226 > Project: Parquet > Issue Type: Improvement >Reporter: Mars >Priority: Major > > We need to collect Parquet's bloom filter of multiple files, and then > synthesize a more comprehensive bloom filter for common use. > Guava supports similar api operations > https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2226) Support merge Bloom Filter
[ https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2226: - Assignee: miracle > Support merge Bloom Filter > -- > > Key: PARQUET-2226 > URL: https://issues.apache.org/jira/browse/PARQUET-2226 > Project: Parquet > Issue Type: Improvement >Reporter: Mars >Assignee: miracle >Priority: Major > > We need to collect Parquet's bloom filter of multiple files, and then > synthesize a more comprehensive bloom filter for common use. > Guava supports similar api operations > https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1980) Build and test Apache Parquet on ARM64 CPU architecture
[ https://issues.apache.org/jira/browse/PARQUET-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656754#comment-17656754 ] Gabor Szadovszky commented on PARQUET-1980: --- Perfect. Thank you, [~mgrigorov]! > Build and test Apache Parquet on ARM64 CPU architecture > --- > > Key: PARQUET-1980 > URL: https://issues.apache.org/jira/browse/PARQUET-1980 > Project: Parquet > Issue Type: Test > Components: parquet-format >Reporter: Martin Tzvetanov Grigorov >Assignee: Martin Tzvetanov Grigorov >Priority: Minor > Labels: pull-request-available > Fix For: 1.12.0 > > > More and more deployments are being done on ARM64 machines. > It would be good to make sure Parquet MR project builds fine on it. > The project moved from TravisCI to GitHub Actions recently (PARQUET-1969) but > .travis.yml could be re-intorduced for ARM64 until GitHub Actions provide > aarch64 nodes! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (PARQUET-1980) Build and test Apache Parquet on ARM64 CPU architecture
[ https://issues.apache.org/jira/browse/PARQUET-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reopened PARQUET-1980: --- [~mgrigorov], PMC just got a note from Apache IT that they are about to "move away from Travis at the beginning of 2023". I don't know if Github actions are now suitable for ARM64 or there any other solutions for this. If you have time, could you please take a look? > Build and test Apache Parquet on ARM64 CPU architecture > --- > > Key: PARQUET-1980 > URL: https://issues.apache.org/jira/browse/PARQUET-1980 > Project: Parquet > Issue Type: Test > Components: parquet-format >Reporter: Martin Tzvetanov Grigorov >Assignee: Martin Tzvetanov Grigorov >Priority: Minor > Labels: pull-request-available > Fix For: 1.12.0 > > > More and more deployments are being done on ARM64 machines. > It would be good to make sure Parquet MR project builds fine on it. > The project moved from TravisCI to GitHub Actions recently (PARQUET-1969) but > .travis.yml could be re-intorduced for ARM64 until GitHub Actions provide > aarch64 nodes! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2220) Parquet Filter predicate storing nested string causing OOM's
[ https://issues.apache.org/jira/browse/PARQUET-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653313#comment-17653313 ] Gabor Szadovszky commented on PARQUET-2220: --- [~abhiSumo304], I agree eagerly storing the toString value is not a good idea. I don't think it has proper use case either. toString should be used for debugging purposes anyway so eagerly storing the value does not really make sense. Unfortunately, I don't work on the Parquet code base actively anymore. Feel free to put up a PR to fix this and I'll try to review it in time. > Parquet Filter predicate storing nested string causing OOM's > > > Key: PARQUET-2220 > URL: https://issues.apache.org/jira/browse/PARQUET-2220 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Abhishek Jain >Priority: Critical > > Each Instance of ColumnFilterPredicate stores the filter values in toString > variable eagerly. Which is not useful > {code:java} > static abstract class ColumnFilterPredicate> > implements FilterPredicate, Serializable { > private final Column column; > private final T value; > private final String toString; > protected ColumnFilterPredicate(Column column, T value) { > this.column = Objects.requireNonNull(column, "column cannot be null"); > // Eq and NotEq allow value to be null, Lt, Gt, LtEq, GtEq however do not, > so they guard against > // null in their own constructors. > this.value = value; > String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH); > this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + > value + ")"; > }{code} > > > If your filter predicate is too long/nested this can take a lot of memory > while creating Filter. > We have seen in our productions this can go upto 4gbs of space while opening > multiple parquet readers > Same thing is replicated in BinaryLogicalFilterPredicate. Where toString is > eagerly calculated and stored in string and lot of duplication is happening > while making And/or filter. > I did not find use case of storing it so eagerly -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2159: - Assignee: Fang-Xie > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2020) Remove deprecated modules
[ https://issues.apache.org/jira/browse/PARQUET-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17616825#comment-17616825 ] Gabor Szadovszky commented on PARQUET-2020: --- [~Unsta], the module {{parquet-cli}} is meant to substitute the functionality of {{parquet-tools}}. {{parquet-cli}} might have the functionality you need. However, neither of them was designed to have its classes used publicly. (There are no guarantees the changes will be backward compatible.) I don't think that {{parquet-format-structures}} would be a good fit to place such functionality either. This module is for reading/writing the footer and also not designed to be used by our clients. The question is if you need this json representation for production use or for debugging purposes. In case of the latter one we might want to create a new module inside parquet-mr for tools to be used from the java API. We might factor out some existing implementation from {{parquet-cli}} and maybe having back something from {{parquet-tools}} if required. If however you need the reading to json (and maybe writing from it) for production use I would suggest having a new binding for json just like we have {{parquet-avro}}, {{parquet-protobuf}}, {{parquet-thrift}} etc. Unfortunately, I won't have time to guide you with any of these choices. I would suggest bringing up this topic on [mailto:dev@parquet.apache.org] to have broader audience. > Remove deprecated modules > - > > Key: PARQUET-2020 > URL: https://issues.apache.org/jira/browse/PARQUET-2020 > Project: Parquet > Issue Type: Improvement > Components: parquet-cascading >Affects Versions: 1.12.0 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.13.0 > > > Removes: > * parquet-tools-deprecated > * parquet-scrooge-deprecated > * parquet-cascading-common23-deprecated > * parquet-cascading-deprecated > * parquet-cascading3-deprecated -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types
[ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614907#comment-17614907 ] Gabor Szadovszky commented on PARQUET-1222: --- [~emkornfield], There are a couple of docs in the parquet-format repo. The related ones are [about logical types|[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]] and the main one that contains the description of the [primitive types|https://github.com/apache/parquet-format/blob/master/README.md#types]. Unfortunately, the latter one does not contain anything about sorting order. So, I think, we need to do the following: * Define the sorting order for the primitive types or reference the logical types description for it. (In most cases it would be referencing since the ordering depends on the related logical types e.g. signed/unsigned sorting of integral types) * After defining the sorting order of the primitive floating point numbers based on what we've discussed above reference it from the new half-precision FP logical type. (Another unfortunate thing is that we have some specification-like docs at the [parquet site|https://parquet.apache.org] as well. I think we should propagate the parquet-format docs to there automatically or simply link them from the site. But it is clearly a different topic.) > Specify a well-defined sorting order for float and double types > --- > > Key: PARQUET-1222 > URL: https://issues.apache.org/jira/browse/PARQUET-1222 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Zoltan Ivanfi >Priority: Critical > > Currently parquet-format specifies the sort order for floating point numbers > as follows: > {code:java} >* FLOAT - signed comparison of the represented value >* DOUBLE - signed comparison of the represented value > {code} > The problem is that the comparison of floating point numbers is only a > partial ordering with strange behaviour in specific corner cases. For > example, according to IEEE 754, -0 is neither less nor more than \+0 and > comparing NaN to anything always returns false. This ordering is not suitable > for statistics. Additionally, the Java implementation already uses a > different (total) ordering that handles these cases correctly but differently > than the C\+\+ implementations, which leads to interoperability problems. > TypeDefinedOrder for doubles and floats should be deprecated and a new > TotalFloatingPointOrder should be introduced. The default for writing doubles > and floats would be the new TotalFloatingPointOrder. This ordering should be > effective and easy to implement in all programming languages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types
[ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611398#comment-17611398 ] Gabor Szadovszky commented on PARQUET-1222: --- [~emkornfield], I think we do not need to handle NaN values with a boolean to fix this issue. NaN is kind of similar than null values so we may even count them instead of having a boolean but this question is not tightly related to this topic. What do you think about elevating the current suggestion in the thrift file to specification level for writing/reading FP min/max values? {quote}Because the sorting order is not specified properly for floating point values (relations vs. total ordering) the following compatibility rules should be applied when reading statistics: * If the min is a NaN, it should be ignored. * If the max is a NaN, it should be ignored. * If the min is +0, the row group may contain -0 values as well. * If the max is -0, the row group may contain +0 values as well. * When looking for NaN values, min and max should be ignored.{quote} For writing we shall skip NaN values and use -0 for min and +0 for max any time when a 0 is to be taken into account. With this solution we cannot do anything clever in case of searching for a NaN but it can be fixed separately. And we also need to double-check whether we really ignore the min/max stats in case of searching for a NaN. I think it is a good idea to discuss such topics on the mailing list. However, we should also time-box the discussion and go forward with a proposed solution if there are no interests on the mailing list. (Personally, I do not follow the dev list anymore.) > Specify a well-defined sorting order for float and double types > --- > > Key: PARQUET-1222 > URL: https://issues.apache.org/jira/browse/PARQUET-1222 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Zoltan Ivanfi >Priority: Critical > > Currently parquet-format specifies the sort order for floating point numbers > as follows: > {code:java} >* FLOAT - signed comparison of the represented value >* DOUBLE - signed comparison of the represented value > {code} > The problem is that the comparison of floating point numbers is only a > partial ordering with strange behaviour in specific corner cases. For > example, according to IEEE 754, -0 is neither less nor more than \+0 and > comparing NaN to anything always returns false. This ordering is not suitable > for statistics. Additionally, the Java implementation already uses a > different (total) ordering that handles these cases correctly but differently > than the C\+\+ implementations, which leads to interoperability problems. > TypeDefinedOrder for doubles and floats should be deprecated and a new > TotalFloatingPointOrder should be introduced. The default for writing doubles > and floats would be the new TotalFloatingPointOrder. This ordering should be > effective and easy to implement in all programming languages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2182) Handle unknown logical types
Gabor Szadovszky created PARQUET-2182: - Summary: Handle unknown logical types Key: PARQUET-2182 URL: https://issues.apache.org/jira/browse/PARQUET-2182 Project: Parquet Issue Type: Bug Reporter: Gabor Szadovszky New logical types introduced in parquet-format shall be properly handled in parquet-mr releases that are not aware of this new type. In this case we shall read the data as if only the primitive type would be defined (without a logical type) with one exception: We shall not use min/max based statistics (including column indexes) since we don't know the proper ordering of that type. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2094) Handle negative values in page headers
[ https://issues.apache.org/jira/browse/PARQUET-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky updated PARQUET-2094: -- External issue ID: CVE-2021-41561 External issue URL: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41561 > Handle negative values in page headers > -- > > Key: PARQUET-2094 > URL: https://issues.apache.org/jira/browse/PARQUET-2094 > Project: Parquet > Issue Type: Bug >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.11.2, 1.12.2 > > > There are integer values in the page headers that should be always positive > (e.g. length). I am not sure if we properly handle the cases if they are not > positive. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (PARQUET-2106) BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
[ https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky updated PARQUET-2106: -- Issue Type: Improvement (was: Task) > BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path > --- > > Key: PARQUET-2106 > URL: https://issues.apache.org/jira/browse/PARQUET-2106 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, > profile_48449_alloc_1638494450_sort_by.html > > > *Background* > While writing out large Parquet tables using Spark, we've noticed that > BinaryComparator is the source of substantial churn of extremely short-lived > `HeapByteBuffer` objects – It's taking up to *16%* of total amount of > allocations in our benchmarks, putting substantial pressure on a Garbage > Collector: > !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521! > [^profile_48449_alloc_1638494450_sort_by.html] > > *Proposal* > We're proposing to adjust lexicographical comparison (at least) to avoid > doing any allocations, since this code lies on the hot-path of every Parquet > write, therefore causing substantial churn amplification. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (PARQUET-2106) BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
[ https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2106: - Assignee: Alexey Kudinkin > BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path > --- > > Key: PARQUET-2106 > URL: https://issues.apache.org/jira/browse/PARQUET-2106 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, > profile_48449_alloc_1638494450_sort_by.html > > > *Background* > While writing out large Parquet tables using Spark, we've noticed that > BinaryComparator is the source of substantial churn of extremely short-lived > `HeapByteBuffer` objects – It's taking up to *16%* of total amount of > allocations in our benchmarks, putting substantial pressure on a Garbage > Collector: > !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521! > [^profile_48449_alloc_1638494450_sort_by.html] > > *Proposal* > We're proposing to adjust lexicographical comparison (at least) to avoid > doing any allocations, since this code lies on the hot-path of every Parquet > write, therefore causing substantial churn amplification. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (PARQUET-2107) Travis failures
[ https://issues.apache.org/jira/browse/PARQUET-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2107. --- Resolution: Fixed > Travis failures > --- > > Key: PARQUET-2107 > URL: https://issues.apache.org/jira/browse/PARQUET-2107 > Project: Parquet > Issue Type: Bug >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > There are Travis failures since a while in our PRs. See e.g. > https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598285 or > https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598286 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (PARQUET-2107) Travis failures
Gabor Szadovszky created PARQUET-2107: - Summary: Travis failures Key: PARQUET-2107 URL: https://issues.apache.org/jira/browse/PARQUET-2107 Project: Parquet Issue Type: Bug Reporter: Gabor Szadovszky Assignee: Gabor Szadovszky There are Travis failures since a while in our PRs. See e.g. https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598285 or https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598286 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2104) parquet-cli broken in master
[ https://issues.apache.org/jira/browse/PARQUET-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448483#comment-17448483 ] Gabor Szadovszky commented on PARQUET-2104: --- [~gamaken], I am not sure about a workaround. I've tried this on master as well as on the tags of the releases 1.12.2 and 1.11.2. All works the same way. :( One idea is to use parquet-tools instead of parquet-cli. It has similar functionality. However, parquet-tools has been deprecated in 1.12.0 and removed in the current master. You may want to try it with an older tag (e.g. apache-parquet-1.11.2). > parquet-cli broken in master > > > Key: PARQUET-2104 > URL: https://issues.apache.org/jira/browse/PARQUET-2104 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.12.2 > Environment: ubuntu 18.04 and ubuntu 20.04 >Reporter: Balaji K >Priority: Major > > Creating a Jira per this thread: > [https://lists.apache.org/thread/k233838g010lvbp81s99floqjmm7nnvs] > # clone parquet-mr and build the repo locally > # run parquet-cli without Hadoop (according to this ReadMe > <[https://github.com/apache/parquet-mr/tree/master/parquet-cli#running-without-hadoop]> > ) > # try a command that deserializes data such as cat or head > # observe NoSuchMethodError being thrown > *Error stack:* ~/repos/parquet-mr/parquet-cli$ parquet cat > ../../testdata/dictionaryEncodingSample.parquet WARNING: An illegal > reflective access operation has occurred .. > Exception in thread "main" java.lang.NoSuchMethodError: > 'org.apache.avro.Schema > org.apache.parquet.avro.AvroSchemaConverter.convert(org.apache.parquet.schema.MessageType)' > at org.apache.parquet.cli.util.Schemas.fromParquet(Schemas.java:89) at > org.apache.parquet.cli.BaseCommand.getAvroSchema(BaseCommand.java:405) at > org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:66) at > org.apache.parquet.cli.Main.run(Main.java:157) at > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at > org.apache.parquet.cli.Main.main(Main.java:187) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2103) crypto exception in print toPrettyJSON
[ https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447455#comment-17447455 ] Gabor Szadovszky commented on PARQUET-2103: --- I think, we need to update [ParquetMetadata.toJSON|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java#L67-L71]. Jackson shall be able to be configured to not to look for getter methods but the private fields. I am not sure if it is a good idea or if it will work in every environment. Another option would be to refactor EncryptedColumnChunkMetaData to not to call "decrypt" for a getter but it might not worth the efforts. The easiest way would be to simply detect if the metadata contains encrypted data and do not log anything. I don't know how urgent might it be to log the metadata in case of debugging. > crypto exception in print toPrettyJSON > -- > > Key: PARQUET-2103 > URL: https://issues.apache.org/jira/browse/PARQUET-2103 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0, 1.12.1, 1.12.2 >Reporter: Gidon Gershinsky >Priority: Major > > In debug mode, this code > {{if (LOG.isDebugEnabled()) {}} > {{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}} > {{}}} > called in > {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}} > > _*for unencrypted files*_ > triggers an exception: > > {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. > Null File Decryptor }} > {{ at > org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at > org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:?]}} > {{ at > jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:?]}} > {{ at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:?]}} > {{ at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) >
[jira] [Resolved] (PARQUET-2101) Fix wrong descriptions about the default block size
[ https://issues.apache.org/jira/browse/PARQUET-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2101. --- Resolution: Fixed > Fix wrong descriptions about the default block size > --- > > Key: PARQUET-2101 > URL: https://issues.apache.org/jira/browse/PARQUET-2101 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-mr, parquet-protobuf >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Trivial > > https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L90 > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L240 > https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoParquetWriter.java#L80 > These javadocs say the default block size is 50 MB but it's actually 128MB. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2094) Handle negative values in page headers
[ https://issues.apache.org/jira/browse/PARQUET-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky updated PARQUET-2094: -- Fix Version/s: 1.12.2 1.11.2 > Handle negative values in page headers > -- > > Key: PARQUET-2094 > URL: https://issues.apache.org/jira/browse/PARQUET-2094 > Project: Parquet > Issue Type: Bug >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.11.2, 1.12.2 > > > There are integer values in the page headers that should be always positive > (e.g. length). I am not sure if we properly handle the cases if they are not > positive. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2094) Handle negative values in page headers
[ https://issues.apache.org/jira/browse/PARQUET-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2094. --- Resolution: Fixed > Handle negative values in page headers > -- > > Key: PARQUET-2094 > URL: https://issues.apache.org/jira/browse/PARQUET-2094 > Project: Parquet > Issue Type: Bug >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > There are integer values in the page headers that should be always positive > (e.g. length). I am not sure if we properly handle the cases if they are not > positive. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-1968. --- Resolution: Fixed > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Assignee: Huaxin Gao >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-1968: - Assignee: Huaxin Gao > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Assignee: Huaxin Gao >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2096) Upgrade Thrift to 0.15.0
[ https://issues.apache.org/jira/browse/PARQUET-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2096. --- Resolution: Fixed > Upgrade Thrift to 0.15.0 > > > Key: PARQUET-2096 > URL: https://issues.apache.org/jira/browse/PARQUET-2096 > Project: Parquet > Issue Type: Improvement >Reporter: Vinoo Ganesh >Assignee: Vinoo Ganesh >Priority: Minor > > Thrift 0.15.0 is currently the default in brew: > [https://github.com/Homebrew/homebrew-core/blob/82d03f657371e1541a9a5e5de57c5e1aa00acd45/Formula/thrift.rb#L4.|https://github.com/Homebrew/homebrew-core/blob/master/Formula/thrift.rb#L4.] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-2096) Upgrade Thrift to 0.15.0
[ https://issues.apache.org/jira/browse/PARQUET-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2096: - Assignee: Vinoo Ganesh > Upgrade Thrift to 0.15.0 > > > Key: PARQUET-2096 > URL: https://issues.apache.org/jira/browse/PARQUET-2096 > Project: Parquet > Issue Type: Improvement >Reporter: Vinoo Ganesh >Assignee: Vinoo Ganesh >Priority: Minor > > Thrift 0.15.0 is currently the default in brew: > [https://github.com/Homebrew/homebrew-core/blob/82d03f657371e1541a9a5e5de57c5e1aa00acd45/Formula/thrift.rb#L4.|https://github.com/Homebrew/homebrew-core/blob/master/Formula/thrift.rb#L4.] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset
[ https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421270#comment-17421270 ] Gabor Szadovszky commented on PARQUET-2080: --- [~gershinsky], could you make the doc available for comments? > Deprecate RowGroup.file_offset > -- > > Key: PARQUET-2080 > URL: https://issues.apache.org/jira/browse/PARQUET-2080 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gidon Gershinsky >Priority: Major > > Due to PARQUET-2078 RowGroup.file_offset is not reliable. > This field is also wrongly calculated in the C++ oss parquet implementation > PARQUET-2089 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2094) Handle negative values in page headers
Gabor Szadovszky created PARQUET-2094: - Summary: Handle negative values in page headers Key: PARQUET-2094 URL: https://issues.apache.org/jira/browse/PARQUET-2094 Project: Parquet Issue Type: Bug Reporter: Gabor Szadovszky Assignee: Gabor Szadovszky There are integer values in the page headers that should be always positive (e.g. length). I am not sure if we properly handle the cases if they are not positive. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-118) Provide option to use on-heap buffers for Snappy compression/decompression
[ https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418202#comment-17418202 ] Gabor Szadovszky commented on PARQUET-118: -- [~MasterDDT], Unfortunately I can only say something similar that Julien add at the first comment. I'm happy to review any PRs about this topic. :) > Provide option to use on-heap buffers for Snappy compression/decompression > -- > > Key: PARQUET-118 > URL: https://issues.apache.org/jira/browse/PARQUET-118 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: Patrick Wendell >Priority: Major > > The current code uses direct off-heap buffers for decompression. If many > decompressors are instantiated across multiple threads, and/or the objects > being decompressed are large, this can lead to a huge amount of off-heap > allocation by the JVM. This can be exacerbated if overall, there is not heap > contention, since no GC will be performed to reclaim the space used by these > buffers. > It would be nice if there was a flag we cold use to simply allocate on-heap > buffers here: > https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28 > We ran into an issue today where these buffers totaled a very large amount of > storage and caused our Java processes (running within containers) to be > terminated by the kernel OOM-killer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2091) Fix release build error introduced by PARQUET-2043
[ https://issues.apache.org/jira/browse/PARQUET-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417540#comment-17417540 ] Gabor Szadovszky commented on PARQUET-2091: --- Strange to me because the release command should not do anything more (related to dependencies) than a {{mvn verify}}. Isn't it possible that this issue occurred only on the 1.12.x branch and the master doesn't have this issue? > Fix release build error introduced by PARQUET-2043 > -- > > Key: PARQUET-2091 > URL: https://issues.apache.org/jira/browse/PARQUET-2091 > Project: Parquet > Issue Type: Task >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > After PARQUET-2043 when building for a release like 1.12.1, there is build > error complaining 'used undeclared dependency'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2088) Different created_by field values for application and library
[ https://issues.apache.org/jira/browse/PARQUET-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415378#comment-17415378 ] Gabor Szadovszky commented on PARQUET-2088: --- parquet-mr automatically fills the {{created_by}} field by using FULL_VERSION. The components using it (Hive/Spark) do not have to populate anything. So if parquet-mr writes a file the proper full version string of parquet-mr will be written to the field every time. You are right that there is no separate field to fill the version of the "higher level" application. (I remember some discussions about this topic but could not find it in the jiras :( ) The issue here is which application version should we store? For example there is a customer code that uses a tool written for Spark that writes the parquet file. We can make mistakes at any level that may cause invalid values (from a certain point of view). So how should we handle this and how can we formalize it? Also, how can we enforce the client codes to fill these fields? Anyway, if you have a proposal feel free to write to the dev list. > Different created_by field values for application and library > - > > Key: PARQUET-2088 > URL: https://issues.apache.org/jira/browse/PARQUET-2088 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: format-2.9.0 >Reporter: Joshua Howard >Priority: Minor > > There seems to be a discrepancy in the Parquet format created_by field > regarding how it should be filled out. The parquet-mr library uses this value > to enable/disable features based on the parquet-mr version > [here|https://github.com/apache/parquet-mr/blob/5f403501e9de05b6aa48f028191b4e78bb97fb12/parquet-column/src/main/java/org/apache/parquet/CorruptDeltaByteArrays.java#L64-L68]. > Meanwhile, users are encouraged to make use of the application version > [here|https://www.javadoc.io/doc/org.apache.parquet/parquet-format/latest/org/apache/parquet/format/FileMetaData.html]. > It seems like there are multiple fields needed for an application and > library version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2091) Fix release build error introduced by PARQUET-2043
[ https://issues.apache.org/jira/browse/PARQUET-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414886#comment-17414886 ] Gabor Szadovszky commented on PARQUET-2091: --- [~sha...@uber.com], do you have issues with building on master? Just checked and it is working fine on my environment. (Also seems to be working at the PR checks.) > Fix release build error introduced by PARQUET-2043 > -- > > Key: PARQUET-2091 > URL: https://issues.apache.org/jira/browse/PARQUET-2091 > Project: Parquet > Issue Type: Task >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > After PARQUET-2043 when building for a release like 1.12.1, there is build > error complaining 'used undeclared dependency'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2084) Upgrade Thrift to 0.14.2
[ https://issues.apache.org/jira/browse/PARQUET-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2084. --- Resolution: Fixed > Upgrade Thrift to 0.14.2 > > > Key: PARQUET-2084 > URL: https://issues.apache.org/jira/browse/PARQUET-2084 > Project: Parquet > Issue Type: Improvement >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2083) Expose getFieldPath from ColumnIO
[ https://issues.apache.org/jira/browse/PARQUET-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2083. --- Resolution: Fixed > Expose getFieldPath from ColumnIO > - > > Key: PARQUET-2083 > URL: https://issues.apache.org/jira/browse/PARQUET-2083 > Project: Parquet > Issue Type: Improvement >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Minor > > Similar to PARUQET-2050, this exposes {{getFieldPath}} from {{ColumnIO}} so > downstream apps such as Spark can use it to assemble nested records. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2088) Different created_by field values for application and library
[ https://issues.apache.org/jira/browse/PARQUET-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414829#comment-17414829 ] Gabor Szadovszky commented on PARQUET-2088: --- Ah, I see. So, that code part is not about a feature but a bug fix. It is the pain in file format implementations that you not only have to fix issues in the code but you have to deal with invalid files written by that faulty code (if it was released). This time we've had to implement a workaround for those invalid files written by parquet-mr releases before 1.8.0. I am not sure how the Impala reader/writer works. I work on parquet-mr and Impala is not tightly part of the Parquet community. It is more an example that the created_by field has to be filled by the application actually implements the writing of the parquet files. So e.g. Hive, Spark etc. won't be listed here ever as they are using parquet-mr to write/read the files. Impala has its own writer/reader implementation. > Different created_by field values for application and library > - > > Key: PARQUET-2088 > URL: https://issues.apache.org/jira/browse/PARQUET-2088 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: format-2.9.0 >Reporter: Joshua Howard >Priority: Minor > > There seems to be a discrepancy in the Parquet format created_by field > regarding how it should be filled out. The parquet-mr library uses this value > to enable/disable features based on the parquet-mr version > [here|https://github.com/apache/parquet-mr/blob/5f403501e9de05b6aa48f028191b4e78bb97fb12/parquet-column/src/main/java/org/apache/parquet/CorruptDeltaByteArrays.java#L64-L68]. > Meanwhile, users are encouraged to make use of the application version > [here|https://www.javadoc.io/doc/org.apache.parquet/parquet-format/latest/org/apache/parquet/format/FileMetaData.html]. > It seems like there are multiple fields needed for an application and > library version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2085) Formatting is broken for description of BIT_PACKED
[ https://issues.apache.org/jira/browse/PARQUET-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414823#comment-17414823 ] Gabor Szadovszky commented on PARQUET-2085: --- [~alexott], I got it now. You are talking about the [Parquet site|http://parquet.apache.org/documentation/latest/]. I was confused because the PR is in the parquet-format repo. The Offical site has a separate repository: https://github.com/apache/parquet-site. It is a bit tricky to update (need to install old ruby libs and generate the htmls manually) but if you would like to give a try feel free to create a new PR on the site repo. > Formatting is broken for description of BIT_PACKED > -- > > Key: PARQUET-2085 > URL: https://issues.apache.org/jira/browse/PARQUET-2085 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Alex Ott >Priority: Minor > > The Nested Encoding section of documentation doesn't escape the {{_}} > character, so it looks as following: > Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is > now used as it supersedes BIT_PACKED. > instead of > Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is > now used as it supersedes BIT_PACKED. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version
[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2078. --- Resolution: Fixed Since the PR is merged I am resolving this. > Failed to read parquet file after writing with the same parquet version > --- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Nemon Lou >Assignee: Nemon Lou >Priority: Critical > Fix For: 1.13.0, 1.12.1 > > Attachments: > PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, > tpcds_customer_footer.json > > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional int64 c_last_review_date_sk [RLE, > PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}] > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) ~[?:1.8.0_292] > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > ~[?:1.8.0_292] > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > ~[?:1.8.0_292] > at
[jira] [Commented] (PARQUET-2088) Different created_by field values for application and library
[ https://issues.apache.org/jira/browse/PARQUET-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414092#comment-17414092 ] Gabor Szadovszky commented on PARQUET-2088: --- Could you please list what exact features do you think parquet-mr is enabling/disabling based on {{created_by}}? This field is used by the actual writer implementations (e.g. Impala, parquet-mr, parquet-cpp etc.). The example already explains how to use it: {{impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)}} > Different created_by field values for application and library > - > > Key: PARQUET-2088 > URL: https://issues.apache.org/jira/browse/PARQUET-2088 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: format-2.9.0 >Reporter: Joshua Howard >Priority: Minor > > There seems to be a discrepancy in the Parquet format created_by field > regarding how it should be filled out. The parquet-mr library uses this value > to enable/disable features based on the parquet-mr version [here|#L64-L68]. > Meanwhile, users are encouraged to make use of the application version > [here|[https://www.javadoc.io/doc/org.apache.parquet/parquet-format/latest/org/apache/parquet/format/FileMetaData.html]]. > It seems like there are multiple fields needed for an application and > library version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset
[ https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414053#comment-17414053 ] Gabor Szadovszky commented on PARQUET-2080: --- [~gershinsky], however the original topic of this jira is invalid we still need to add proper comments to {{RowGroup.file_offset}} describing the situation of PARQUET-2078 and helping the implementations to handle the potential wrong value. Would you like to handle this? > Deprecate RowGroup.file_offset > -- > > Key: PARQUET-2080 > URL: https://issues.apache.org/jira/browse/PARQUET-2080 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Due to PARQUET-2078 RowGroup.file_offset is not reliable. We shall deprecate > the field and add suggestions how to calculate the value. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2080) Deprecate RowGroup.file_offset
Gabor Szadovszky created PARQUET-2080: - Summary: Deprecate RowGroup.file_offset Key: PARQUET-2080 URL: https://issues.apache.org/jira/browse/PARQUET-2080 Project: Parquet Issue Type: Bug Components: parquet-format Reporter: Gabor Szadovszky Assignee: Gabor Szadovszky Due to PARQUET-2078 RowGroup.file_offset is not reliable. We shall deprecate the field and add suggestions how to calculate the value. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version
[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2078: - Assignee: Nemon Lou > Failed to read parquet file after writing with the same parquet version > --- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Nemon Lou >Assignee: Nemon Lou >Priority: Critical > Fix For: 1.13.0, 1.12.1 > > Attachments: > PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, > tpcds_customer_footer.json > > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional int64 c_last_review_date_sk [RLE, > PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}] > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) ~[?:1.8.0_292] > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > ~[?:1.8.0_292] > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > ~[?:1.8.0_292] > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) >
[jira] [Commented] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version
[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17406722#comment-17406722 ] Gabor Szadovszky commented on PARQUET-2078: --- [~nemon], you are right, so {{dictionaryPageOffset}} is not impacted. Great news. After the second look it is not required to have the first column being dictionary encoded before the invalid row group. It is enough that there are dictionary encoded column chunks in the previous row groups and that the first column chunk is not dictionary encoded in the invalid row group. So, [~nemon], you also right with your PR. > Failed to read parquet file after writing with the same parquet version > --- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Nemon Lou >Priority: Critical > Fix For: 1.13.0, 1.12.1 > > Attachments: > PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, > tpcds_customer_footer.json > > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional int64 c_last_review_date_sk [RLE, > PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}] > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at
[jira] [Commented] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version
[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17406621#comment-17406621 ] Gabor Szadovszky commented on PARQUET-2078: --- [~nemon], I am not sure how it would be possible. RowGroup.file_offset is set by using the dictionary page offset of the first column chunk (if there is any): * [rowGroup.setFile_offset(block.getStartingPos())|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L580] * [BlockMetaData.getStartingPos()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/BlockMetaData.java#L102-L104] * [ColumnChunkMetaData.getStartingPos()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L184-L193] As per my understanding the issue is based on the following to have the wrong offsets of {{rowGroup ~n~}} (where we have {{k}} columns): * {{columnChunk ~n-1, 1~}} (first column chunk of {{rowGroup ~n-1~}}) is dictionary encoded as well as {{columnChunk ~n-1, k~}} * {{columnChunk ~n, 1~}} is not dictionary encoded In this case {{fileOffset ~n~ = dictionaryOffset ~n, 1~ = dictionaryOffset ~n-1, k~}} To discover this issue we should check if a column chunk is dictionary encoded before using the dictionary offset of it. Unfortunately, we have to do the same before using the file offset of a row group, or simply ignore this value and use the offsets of the first column chunk with the check. > Failed to read parquet file after writing with the same parquet version > --- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Nemon Lou >Priority: Critical > Fix For: 1.13.0, 1.12.1 > > Attachments: > PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, > tpcds_customer_footer.json > > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional int64 c_last_review_date_sk [RLE, > PLAIN_DICTIONARY, BIT_PACKED],
[jira] [Commented] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version
[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405698#comment-17405698 ] Gabor Szadovszky commented on PARQUET-2078: --- Added the dev list thread link here to keep both sides in the loop. > Failed to read parquet file after writing with the same parquet version > --- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Nemon Lou >Priority: Critical > Fix For: 1.13.0, 1.12.1 > > Attachments: > PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, > tpcds_customer_footer.json > > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional int64 c_last_review_date_sk [RLE, > PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}] > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) ~[?:1.8.0_292] > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > ~[?:1.8.0_292] > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > ~[?:1.8.0_292] > at
[jira] [Comment Edited] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version
[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405677#comment-17405677 ] Gabor Szadovszky edited comment on PARQUET-2078 at 8/27/21, 8:50 AM: - [~nemon], thanks a lot for the detailed explanation and the patch! So what I have written before stands. Before 1.12.0 we did not write the dictionary offset to the column chunk metadata (see PARQUET-1850) even though the calculation was wrong since the beginning. Since we released 1.12.0 already it means we have to prepare for the invalid dictionary offset values. What we need to handle in a fix: * Fix the calculation issue (see the attached patch) * Add unit test for this issue to ensure it works properly and won't happen again * Investigate all code parts where the dictionary offset and file offset are used and prepare for invalid values [~nemon], would you like to work on this by opening a PR on github? was (Author: gszadovszky): [~nemon], thanks a lot for the detailed explanation and the patch! So what I have written before stands. Before 1.12.0 we did not write the dictionary offset to the column chunk metadata (see PARQUET-1850) even though the calculation was wrong since the beginning. Since we released 1.12.0 already it means we have to prepare for the invalid dictionary offset values. What we need to handle in a fix: * Fix the calculation issue (see the attached patch) * Add unit test for this issue to ensure it works properly and won't happen again * Investigate all code parts where the dictionary offset is used and prepare for invalid values [~nemon], would you like to work on this by opening a PR on github? > Failed to read parquet file after writing with the same parquet version > --- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Nemon Lou >Priority: Critical > Fix For: 1.13.0, 1.12.1 > > Attachments: > PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, > tpcds_customer_footer.json > > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional
[jira] [Commented] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version
[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405677#comment-17405677 ] Gabor Szadovszky commented on PARQUET-2078: --- [~nemon], thanks a lot for the detailed explanation and the patch! So what I have written before stands. Before 1.12.0 we did not write the dictionary offset to the column chunk metadata (see PARQUET-1850) even though the calculation was wrong since the beginning. Since we released 1.12.0 already it means we have to prepare for the invalid dictionary offset values. What we need to handle in a fix: * Fix the calculation issue (see the attached patch) * Add unit test for this issue to ensure it works properly and won't happen again * Investigate all code parts where the dictionary offset is used and prepare for invalid values [~nemon], would you like to work on this by opening a PR on github? > Failed to read parquet file after writing with the same parquet version > --- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Nemon Lou >Priority: Critical > Fix For: 1.13.0, 1.12.1 > > Attachments: > PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, > tpcds_customer_footer.json > > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional int64 c_last_review_date_sk [RLE, > PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}] > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at >
[jira] [Commented] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version
[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405227#comment-17405227 ] Gabor Szadovszky commented on PARQUET-2078: --- [~nemon], thanks a lot for the investigation. What is not clear to me how it could happen that we set the wrong value to {{RowGroup.file_offset}}. Based on the code in [ParquetMetadataConverter|https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L580] we use the starting position of the first column chunk of the actual row group. The starting position of the column chunk is the dictionary page offset or the first data page offset, whatever is the smaller (because dictionary page is always at the starting position of the column chunk.) If the dictionary page offset or the first data page offset would be wrong we should have other issues as well. Can you read the file content without using InputSplits (e.g. parquet-tools, parquet-cli or java code that reads the whole file)? There is a new parquet-cli tool called footer that can list the raw footer of the file. It would be interesting to see the output of it on the related parquet file. Unfortunately, this feature is not released yet so it have to be built from master. If you are interested to do so please check the [readme|https://github.com/apache/parquet-mr/blob/master/parquet-cli/README.md] for details. If you are right and we write invalid offsets to the file since 1.12.0 that it is a serious issue. We not only have to fix the writing path but the reading as well since we will have files already written by 1.12.0. > Failed to read parquet file after writing with the same parquet version > --- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Nemon Lou >Priority: Critical > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional int64 c_last_review_date_sk [RLE, > PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}] > at >
[jira] [Updated] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version
[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky updated PARQUET-2078: -- Fix Version/s: 1.12.1 1.13.0 > Failed to read parquet file after writing with the same parquet version > --- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Nemon Lou >Priority: Critical > Fix For: 1.13.0, 1.12.1 > > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional int64 c_last_review_date_sk [RLE, > PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}] > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) ~[?:1.8.0_292] > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > ~[?:1.8.0_292] > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > ~[?:1.8.0_292] > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > ~[?:1.8.0_292] > at >
[jira] [Commented] (PARQUET-2071) Encryption translation tool
[ https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403039#comment-17403039 ] Gabor Szadovszky commented on PARQUET-2071: --- [~sha...@uber.com], sure, I am fine with having the "universal tool" and the required refactors be handled under the separate jira. > Encryption translation tool > > > Key: PARQUET-2071 > URL: https://issues.apache.org/jira/browse/PARQUET-2071 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When translating existing data to encryption state, we could develop a tool > like TransCompression to translate the data at page level to encryption state > without reading to record and rewrite. This will speed up the process a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2064) Make Range public accessible in RowRanges
[ https://issues.apache.org/jira/browse/PARQUET-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2064. --- Resolution: Fixed > Make Range public accessible in RowRanges > - > > Key: PARQUET-2064 > URL: https://issues.apache.org/jira/browse/PARQUET-2064 > Project: Parquet > Issue Type: New Feature >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When rolling out to Presto, I found we need to know the boundaries of each > Range in RowRanges. It is still doable with Iterator but Presto has. batch > reader, we cannot use iterator for each row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2073) Is there something wrong calculate usedMem in ColumnWriteStoreBase.java
[ https://issues.apache.org/jira/browse/PARQUET-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2073. --- Resolution: Fixed > Is there something wrong calculate usedMem in ColumnWriteStoreBase.java > --- > > Key: PARQUET-2073 > URL: https://issues.apache.org/jira/browse/PARQUET-2073 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: JiangYang >Assignee: JiangYang >Priority: Critical > Attachments: image-2021-08-05-14-37-51-299.png > > > !image-2021-08-05-14-37-51-299.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2059) Tests require too much memory
[ https://issues.apache.org/jira/browse/PARQUET-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2059. --- Resolution: Fixed > Tests require too much memory > - > > Key: PARQUET-2059 > URL: https://issues.apache.org/jira/browse/PARQUET-2059 > Project: Parquet > Issue Type: Test >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > For testing the solution of PARQUET-1633 we require ~3GB memory that is not > always available. To solve this issue we temporarily disabled the implemented > unit test. > We need to ensure somehow that [this > test|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLargeColumnChunk.java] > (and maybe some other similar ones) are executed regularly. Some options we > might have: > * Execute this test separately with a maven profile. I am not sure if the CI > allows allocating such large memory but with Xmx options we might give a try > and create a separate check for this test only. > * Similar to the previous with the profile but not executing in the CI ever. > Instead, we add some comments to the release doc so this test will be > executed at least once per release. > * Configuring the CI profile to skip this test but have it in the normal > scenario meaning the devs will execute it locally. There are a couple of cons > though. There is no guarantee that devs executes all the tests including this > one. It also can cause issues if the dev doesn't have enough memory and don't > know that the test failure is not related to the current change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2043) Fail build for used but not declared direct dependencies
[ https://issues.apache.org/jira/browse/PARQUET-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2043. --- Resolution: Fixed > Fail build for used but not declared direct dependencies > > > Key: PARQUET-2043 > URL: https://issues.apache.org/jira/browse/PARQUET-2043 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > It is always a good practice to specify all the dependencies directly used > (classes are imported from) by our modules. We have a couple of issues where > classes are imported from transitive dependencies. It makes hard to validate > the actual dependency tree and also may result in using wrong versions of > classes (see PARQUET-2038 for example). > It would be good to enforce to reference such dependencies directly in the > module poms. The [maven-dependency-plugin analyze-only > goal|http://maven.apache.org/plugins/maven-dependency-plugin/analyze-only-mojo.html] > can be used for this purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2063) Remove Compile Warnings from MemoryManager
[ https://issues.apache.org/jira/browse/PARQUET-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2063. --- Resolution: Fixed > Remove Compile Warnings from MemoryManager > -- > > Key: PARQUET-2063 > URL: https://issues.apache.org/jira/browse/PARQUET-2063 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2074) Upgrade to JDK 9+
[ https://issues.apache.org/jira/browse/PARQUET-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396113#comment-17396113 ] Gabor Szadovszky commented on PARQUET-2074: --- [~belugabehr], it sounds good to me but also keep in mind that switching to JDK9 and using its new capabilities would make parqet-mr incompatible with certain environments. Also, this would require a community agreement. I would suggest bringing up this topic in the next parquet sync (Aug 24) and/or start a formal vote in the dev list. > Upgrade to JDK 9+ > - > > Key: PARQUET-2074 > URL: https://issues.apache.org/jira/browse/PARQUET-2074 > Project: Parquet > Issue Type: Improvement >Reporter: David Mollitor >Priority: Major > > Moving to JDK 9 will provide a plethora of new compares/equals capabilities > on arrays that are all based on vectorization and implement > {{\@IntrinsicCandidate}} > https://docs.oracle.com/javase/9/docs/api/java/util/Arrays.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-2073) Is there something wrong calculate usedMem in ColumnWriteStoreBase.java
[ https://issues.apache.org/jira/browse/PARQUET-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2073: - Assignee: JiangYang > Is there something wrong calculate usedMem in ColumnWriteStoreBase.java > --- > > Key: PARQUET-2073 > URL: https://issues.apache.org/jira/browse/PARQUET-2073 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: JiangYang >Assignee: JiangYang >Priority: Critical > Attachments: image-2021-08-05-14-37-51-299.png > > > !image-2021-08-05-14-37-51-299.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2072) Do Not Determine Both Min/Max for Binary Stats
[ https://issues.apache.org/jira/browse/PARQUET-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2072. --- Resolution: Fixed > Do Not Determine Both Min/Max for Binary Stats > -- > > Key: PARQUET-2072 > URL: https://issues.apache.org/jira/browse/PARQUET-2072 > Project: Parquet > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Minor > > I'm looking at some benchmarking code of Apache ORC v.s. Apache Parquet and > see that Parquet is quite a bit slower for writes (reads TBD). Based on my > investigation, I have noticed a significant amount of time spent in > determining min/max for binary types. > One quick improvement is to bypass a "max" value determinization if the value > has already been determined to be a "min". > While I'm at it, remove calls to deprecated functions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2073) Is there something wrong calculate usedMem in ColumnWriteStoreBase.java
[ https://issues.apache.org/jira/browse/PARQUET-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394605#comment-17394605 ] Gabor Szadovszky commented on PARQUET-2073: --- [~JiangYang], you're right, {{rowsToFillPage}} will always be zero. It means (because of [line 256|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L256]) that we never use this estimate correctly so the next row count check will always step by {{props.getMinRowCountForPageSizeCheck()}}. Funny that it was working this way ever since we have this estimation logic. Strange that no one have ever noticed. About fixing this issue. We can have proper results without casting: {code:java} rows * remainingMem / usedMem {code} Meanwhile, this form is a bit misleading so we need some comments that we are calculating the estimated number of rows can be written to the page based on the average size of rows already written. The tricky part is how to test it. This will be a new behavior of the page writing and we have never tested this properly. (Otherwise, we would have caught this issue.) It highly depends on the characteristics of the values if this approach works fine or not. (For example small values at the beginning and large ones later can cause this logic overrun the maximum size of the page. However, the same can happen if the wrong values are used for {{min/maxRowCountForPageSizeCheck}}.) Sure, please, create a PR. I am happy to review. > Is there something wrong calculate usedMem in ColumnWriteStoreBase.java > --- > > Key: PARQUET-2073 > URL: https://issues.apache.org/jira/browse/PARQUET-2073 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: JiangYang >Priority: Critical > Attachments: image-2021-08-05-14-37-51-299.png > > > !image-2021-08-05-14-37-51-299.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2073) Is there something wrong calculate usedMem in ColumnWriteStoreBase.java
[ https://issues.apache.org/jira/browse/PARQUET-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393807#comment-17393807 ] Gabor Szadovszky commented on PARQUET-2073: --- So, we are talking about [this line|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L243]. The original line was {code:java} (long) ((float) rows) / usedMem * remainingMem {code} Here both the casts are for {{rows}} so it is completely fine removing the {{(float)}} cast. Even the {{(long)}} cast can be removed since all the tree values are {{long}}. I can see only one option were the result can be different in the two when the value in {{rows}} overflows at downcast to {{float}}. Could you please list exact numbers where you got different numbers? It is another thing that the very original code should have been {code:java} (long) ((double) rows / usedMem * remainingMem ) {code} This way you would get more accurate numbers. > Is there something wrong calculate usedMem in ColumnWriteStoreBase.java > --- > > Key: PARQUET-2073 > URL: https://issues.apache.org/jira/browse/PARQUET-2073 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: JiangYang >Priority: Critical > Attachments: image-2021-08-05-14-37-51-299.png > > > !image-2021-08-05-14-37-51-299.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2071) Encryption translation tool
[ https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393788#comment-17393788 ] Gabor Szadovszky commented on PARQUET-2071: --- I think it is a great idea to skip unnecessary deserialization/serialization steps in such cases. Meanwhile, we already have some tools with similar approach like trans-compression or prune columns. What do you think of implementing a more universal tool where you can configure the projection schema and the configuration of the target file. Then the tool can decide which level of deserialization/serialization is required. For example for trans-compression you need to decompress the pages while for encryption you don't. What do you think? > Encryption translation tool > > > Key: PARQUET-2071 > URL: https://issues.apache.org/jira/browse/PARQUET-2071 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When translating existing data to encryption state, we could develop a tool > like TransCompression to translate the data at page level to encryption state > without reading to record and rewrite. This will speed up the process a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-2070) Replace deprecated syntax in protobuf support
[ https://issues.apache.org/jira/browse/PARQUET-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2070: - Assignee: Svend Vanderveken > Replace deprecated syntax in protobuf support > - > > Key: PARQUET-2070 > URL: https://issues.apache.org/jira/browse/PARQUET-2070 > Project: Parquet > Issue Type: Improvement >Reporter: Svend Vanderveken >Assignee: Svend Vanderveken >Priority: Minor > > This is trivial change, though at the moment ProtoWriteSupport.java is > producing a human-readable JSON output of the protobuf schema with the > following deprecated syntax: > > {code:java} > TextFormat.printToString(asProto){code} > > Also, the method where is code is present executed one reflection invocation > to get the protobuf descriptor which is unnecesserary since the context from > where it's called already has this descriptor. > => all minor and trivial stuff though well, housekeeping I guess :) > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2070) Replace deprecated syntax in protobuf support
[ https://issues.apache.org/jira/browse/PARQUET-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-2070. --- Resolution: Fixed > Replace deprecated syntax in protobuf support > - > > Key: PARQUET-2070 > URL: https://issues.apache.org/jira/browse/PARQUET-2070 > Project: Parquet > Issue Type: Improvement >Reporter: Svend Vanderveken >Assignee: Svend Vanderveken >Priority: Minor > > This is trivial change, though at the moment ProtoWriteSupport.java is > producing a human-readable JSON output of the protobuf schema with the > following deprecated syntax: > > {code:java} > TextFormat.printToString(asProto){code} > > Also, the method where is code is present executed one reflection invocation > to get the protobuf descriptor which is unnecesserary since the context from > where it's called already has this descriptor. > => all minor and trivial stuff though well, housekeeping I guess :) > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2065) parquet-cli not working in release 1.12.0
[ https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381910#comment-17381910 ] Gabor Szadovszky commented on PARQUET-2065: --- I've checked this with 1.11.0 and is reproducible so not a regression in 1.12.0. The problem is that in target there are multiple parquet-cli jars are generated. One is a slim jar (parquet-cli-1.12.0.jar) and another one is a fat jar (parquet-cli-1.12.0-runtime.jar) that contains the avro dependency shaded. If all of these jars put on the classpath (target/*) it can mix up things. So, I would suggest using one specific jar file from the listed ones instead of putting all jars on the classpath from target. The other dependency jars are required. For example: {code} java -cp target/parquet-cli-1.12.0.jar:target/dependency/* org.apache.parquet.cli.Main head {code} > parquet-cli not working in release 1.12.0 > - > > Key: PARQUET-2065 > URL: https://issues.apache.org/jira/browse/PARQUET-2065 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.12.0 >Reporter: Akshay Sundarraj >Priority: Major > > When I run parquet-cli getting java.lang.NoSuchMethodError > Steps to repdouce: > # Download parquet-mr 1.12.0 from > [https://github.com/apache/parquet-mr/archive/refs/tags/apache-parquet-1.12.0.tar.gz] > # Build and install using mvn clean install > # cd parquet-cli > # {{mvn dependency:copy-dependencies}} > # {{java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main head > }} > # Got below exception > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > ([file:/home/amsundar/hgroot/parquet-mr-apache-parquet-1.12.0/parquet-cli/target/dependency/hadoop-auth-2.10.1.jar|file://home/amsundar/hgroot/parquet-mr-apache-parquet-1.12.0/parquet-cli/target/dependency/hadoop-auth-2.10.1.jar]) > to method sun.security.krb5.Config.getInstance() > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > Exception in thread "main" java.lang.NoSuchMethodError: > org.apache.parquet.avro.AvroSchemaConverter.convert(Lorg/apache/parquet/schema/MessageType;)Lorg/apache/avro/Schema; > at org.apache.parquet.cli.util.Schemas.fromParquet(Schemas.java:89) > at org.apache.parquet.cli.BaseCommand.getAvroSchema(BaseCommand.java:405) > at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:66) > at org.apache.parquet.cli.Main.run(Main.java:155) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:185) > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2064) Make Range public accessible in RowRanges
[ https://issues.apache.org/jira/browse/PARQUET-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379219#comment-17379219 ] Gabor Szadovszky commented on PARQUET-2064: --- [~sha...@uber.com], sorry if I was misleading. I do agree to make the required classes/methods public if it makes our clients' lives easier. > Make Range public accessible in RowRanges > - > > Key: PARQUET-2064 > URL: https://issues.apache.org/jira/browse/PARQUET-2064 > Project: Parquet > Issue Type: New Feature >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When rolling out to Presto, I found we need to know the boundaries of each > Range in RowRanges. It is still doable with Iterator but Presto has. batch > reader, we cannot use iterator for each row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2059) Tests require too much memory
[ https://issues.apache.org/jira/browse/PARQUET-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky updated PARQUET-2059: -- Summary: Tests require too much memory (was: Tests require to much memory) > Tests require too much memory > - > > Key: PARQUET-2059 > URL: https://issues.apache.org/jira/browse/PARQUET-2059 > Project: Parquet > Issue Type: Test >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > For testing the solution of PARQUET-1633 we require ~3GB memory that is not > always available. To solve this issue we temporarily disabled the implemented > unit test. > We need to ensure somehow that [this > test|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLargeColumnChunk.java] > (and maybe some other similar ones) are executed regularly. Some options we > might have: > * Execute this test separately with a maven profile. I am not sure if the CI > allows allocating such large memory but with Xmx options we might give a try > and create a separate check for this test only. > * Similar to the previous with the profile but not executing in the CI ever. > Instead, we add some comments to the release doc so this test will be > executed at least once per release. > * Configuring the CI profile to skip this test but have it in the normal > scenario meaning the devs will execute it locally. There are a couple of cons > though. There is no guarantee that devs executes all the tests including this > one. It also can cause issues if the dev doesn't have enough memory and don't > know that the test failure is not related to the current change. -- This message was sent by Atlassian Jira (v8.3.4#803005)