[jira] [Created] (PARQUET-515) Add "Reset" to LevelEncoder and LevelDecoder
Deepak Majeti created PARQUET-515: - Summary: Add "Reset" to LevelEncoder and LevelDecoder Key: PARQUET-515 URL: https://issues.apache.org/jira/browse/PARQUET-515 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Deepak Majeti Assignee: Deepak Majeti The rle-encoder and rle-decoder classes have a "Reset" method as a quick way to initialize the objects. This method resets the encoder an decoder state to work on a new buffer without the need to create a new object at every DATA PAGE granularity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PARQUET-505) Column reader: automatically handle large data pages
[ https://issues.apache.org/jira/browse/PARQUET-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15138299#comment-15138299 ] Deepak Majeti edited comment on PARQUET-505 at 2/9/16 3:30 PM: --- [~wesmckinn] I followed the Impala code. I am adding a unit test as well. -- regards, Deepak Majeti was (Author: mdeepak): [~wesm] I followed the Impala code. I am adding a unit test as well. -- regards, Deepak Majeti > Column reader: automatically handle large data pages > > > Key: PARQUET-505 > URL: https://issues.apache.org/jira/browse/PARQUET-505 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Deepak Majeti > > Currently, we are only supporting data pages whose headers are 64K or less > (see {{parquet/column/serialized-page.cc}}. Since page headers can > essentially be arbitrarily large (in pathological cases) because of the page > statistics, if deserializing the page header fails, we should attempt to read > a progressively larger amount of file data in effort to find the end of the > page header. > As part of this (and to make testing easier!), the maximum data page header > size should be configurable. We can write test cases by defining appropriate > Statistics structs to yield serialized page headers of whatever desired size. > On malformed files, we may run past the end of the file, in such cases we > should raise a reasonable exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-517) Ensuring aligned memory on encode / decode hot paths for SSE optimizations
Wes McKinney created PARQUET-517: Summary: Ensuring aligned memory on encode / decode hot paths for SSE optimizations Key: PARQUET-517 URL: https://issues.apache.org/jira/browse/PARQUET-517 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Wes McKinney We are using {{std::vector}} in many places for memory allocation; if we want to use SSE on this memory we may run into some problems. Couple things we should do * Add an STL allocator for {{std::vector}} that ensure 16-byte aligned memory * Check user-provided memory for alignment before utilizing an SSE-accelerated routine (e.g. SSE hash functions for dictionary encoding) and decide whether to copy and use SSE or no-copy and use no-SSE code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-515) Add "Reset" to LevelEncoder and LevelDecoder
[ https://issues.apache.org/jira/browse/PARQUET-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139843#comment-15139843 ] Wes McKinney commented on PARQUET-515: -- Per discussion in https://github.com/apache/parquet-cpp/pull/30, in order for this to be properly tested we will need to construct test cases with multiple data pages. > Add "Reset" to LevelEncoder and LevelDecoder > > > Key: PARQUET-515 > URL: https://issues.apache.org/jira/browse/PARQUET-515 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Deepak Majeti >Assignee: Deepak Majeti > > The rle-encoder and rle-decoder classes have a "Reset" method as a quick way > to initialize the objects. This method resets the encoder an decoder state to > work on a new buffer without the need to create a new object at every DATA > PAGE granularity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-518) Review usages of size_t and unsigned integers generally per Google style guide
Wes McKinney created PARQUET-518: Summary: Review usages of size_t and unsigned integers generally per Google style guide Key: PARQUET-518 URL: https://issues.apache.org/jira/browse/PARQUET-518 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Wes McKinney Priority: Minor The Google style guide recommends generally avoiding unsigned integers for the bugs they can silently introduce. https://google.github.io/styleguide/cppguide.html#Integer_Types -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-511) Integer overflow on counting values in column
[ https://issues.apache.org/jira/browse/PARQUET-511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140121#comment-15140121 ] Ryan Blue commented on PARQUET-511: --- Thanks [~goreckim]! I'll have a look soon. I know we've also considered a maximum number of records to add to a row group for cases where compression is good enough that you never hit the max. That would work for a fix as well (if that's not what you did). > Integer overflow on counting values in column > - > > Key: PARQUET-511 > URL: https://issues.apache.org/jira/browse/PARQUET-511 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.1 >Reporter: Michal Gorecki >Assignee: Michal Gorecki >Priority: Critical > > Parquet will ignore a column if the combined amount of elements in the column > is larger than the size of an int. > The issue is that as the column reader is initialized and the rep and def > levels are initialized per column, the size of the integer will overflow, > causing these values to not be set properly. Then, during read, the level > will not match the current level of the reader, and a null value will be > provided. Since there is no overflow checking, no exception is thrown, and it > appears that the data is corrupted. > This happened to us with a fairly complex schema, with an array of maps, > which contained arrays as well. There were over 4 billion values in all > column pages in one row group, which is what triggered the overflow. > Relevant stack trace > org.apache.parquet.io.ParquetDecodingException: Can not read value at 172310 > in block 0 in file > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227) >... > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: INT64 at value 95584934 out of 95530352, 130598 > out of 130598 in currentPage. repetition level: 0, definition level: 2 > at > org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:484) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218) > ... 18 more > Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking > stream. > at > org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64) > at > org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong(DictionaryValuesReader.java:121) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$4.read(ColumnReaderImpl.java:263) > at > org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464) > ... 21 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-519) Disable compiler warning supressions and fix compiler warnings with -O3
Wes McKinney created PARQUET-519: Summary: Disable compiler warning supressions and fix compiler warnings with -O3 Key: PARQUET-519 URL: https://issues.apache.org/jira/browse/PARQUET-519 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Wes McKinney Related to PARQUET-447 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-447) Add Debug and Release build types and associated compiler flags
[ https://issues.apache.org/jira/browse/PARQUET-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140131#comment-15140131 ] Wes McKinney commented on PARQUET-447: -- See https://github.com/apache/parquet-cpp/pull/45. > Add Debug and Release build types and associated compiler flags > --- > > Key: PARQUET-447 > URL: https://issues.apache.org/jira/browse/PARQUET-447 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-470) Thrift 0.9.3 cannot be used in conjunction with googletest and C++11 on Linux
[ https://issues.apache.org/jira/browse/PARQUET-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139991#comment-15139991 ] Wes McKinney commented on PARQUET-470: -- For the time being Thrift 0.9.0 seems to work reliably Linux. > Thrift 0.9.3 cannot be used in conjunction with googletest and C++11 on Linux > - > > Key: PARQUET-470 > URL: https://issues.apache.org/jira/browse/PARQUET-470 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney > > Thrift 0.9.3 introduces a {{#include }} include which > causes {{tr1/functional}} to be included, causing a compiler conflict with > googletest, which has its own portability macros surrounding its use of > {{std::tr1::tuple}}. I spent a bunch of time twiddling compiler flags to try > to resolve this conflict, but wasn't able to figure it out. > If this is a Thrift bug, we should report it to Thrift. If it's fixable by > compiler flags, then we should figure that out and track the issue here, > otherwise users with the latest version of Thrift will be unable to compile > the parquet-cpp test suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332)