[jira] [Created] (PARQUET-515) Add "Reset" to LevelEncoder and LevelDecoder

2016-02-09 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-515:
-

 Summary: Add "Reset" to LevelEncoder and LevelDecoder
 Key: PARQUET-515
 URL: https://issues.apache.org/jira/browse/PARQUET-515
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Deepak Majeti


The rle-encoder and rle-decoder classes have a "Reset" method as a quick way to 
initialize the objects. This method resets the encoder an decoder state to work 
on a new buffer without the need to create a new object at every DATA PAGE 
granularity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-505) Column reader: automatically handle large data pages

2016-02-09 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15138299#comment-15138299
 ] 

Deepak Majeti edited comment on PARQUET-505 at 2/9/16 3:30 PM:
---

[~wesmckinn] I followed the Impala code.  I am adding a unit test as well.




-- 
regards,
Deepak Majeti



was (Author: mdeepak):
[~wesm] I followed the Impala code.  I am adding a unit test as well.




-- 
regards,
Deepak Majeti


> Column reader: automatically handle large data pages
> 
>
> Key: PARQUET-505
> URL: https://issues.apache.org/jira/browse/PARQUET-505
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
>
> Currently, we are only supporting data pages whose headers are 64K or less 
> (see {{parquet/column/serialized-page.cc}}. Since page headers can 
> essentially be arbitrarily large (in pathological cases) because of the page 
> statistics, if deserializing the page header fails, we should attempt to read 
> a progressively larger amount of file data in effort to find the end of the 
> page header. 
> As part of this (and to make testing easier!), the maximum data page header 
> size should be configurable. We can write test cases by defining appropriate 
> Statistics structs to yield serialized page headers of whatever desired size.
> On malformed files, we may run past the end of the file, in such cases we 
> should raise a reasonable exception. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-517) Ensuring aligned memory on encode / decode hot paths for SSE optimizations

2016-02-09 Thread Wes McKinney (JIRA)
Wes McKinney created PARQUET-517:


 Summary: Ensuring aligned memory on encode / decode hot paths for 
SSE optimizations
 Key: PARQUET-517
 URL: https://issues.apache.org/jira/browse/PARQUET-517
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney


We are using {{std::vector}} in many places for memory allocation; if we want 
to use SSE on this memory we may run into some problems.

Couple things we should do

* Add an STL allocator for {{std::vector}} that ensure 16-byte aligned memory
* Check user-provided memory for alignment before utilizing an SSE-accelerated 
routine (e.g. SSE hash functions for dictionary encoding) and decide whether to 
copy and use SSE or no-copy and use no-SSE code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-515) Add "Reset" to LevelEncoder and LevelDecoder

2016-02-09 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139843#comment-15139843
 ] 

Wes McKinney commented on PARQUET-515:
--

Per discussion in https://github.com/apache/parquet-cpp/pull/30, in order for 
this to be properly tested we will need to construct test cases with multiple 
data pages.

> Add "Reset" to LevelEncoder and LevelDecoder
> 
>
> Key: PARQUET-515
> URL: https://issues.apache.org/jira/browse/PARQUET-515
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>
> The rle-encoder and rle-decoder classes have a "Reset" method as a quick way 
> to initialize the objects. This method resets the encoder an decoder state to 
> work on a new buffer without the need to create a new object at every DATA 
> PAGE granularity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-518) Review usages of size_t and unsigned integers generally per Google style guide

2016-02-09 Thread Wes McKinney (JIRA)
Wes McKinney created PARQUET-518:


 Summary: Review usages of size_t and unsigned integers generally 
per Google style guide
 Key: PARQUET-518
 URL: https://issues.apache.org/jira/browse/PARQUET-518
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney
Priority: Minor


The Google style guide recommends generally avoiding unsigned integers for the 
bugs they can silently introduce. 

https://google.github.io/styleguide/cppguide.html#Integer_Types




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-511) Integer overflow on counting values in column

2016-02-09 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140121#comment-15140121
 ] 

Ryan Blue commented on PARQUET-511:
---

Thanks [~goreckim]! I'll have a look soon. I know we've also considered a 
maximum number of records to add to a row group for cases where compression is 
good enough that you never hit the max. That would work for a fix as well (if 
that's not what you did).

> Integer overflow on counting values in column
> -
>
> Key: PARQUET-511
> URL: https://issues.apache.org/jira/browse/PARQUET-511
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Michal Gorecki
>Assignee: Michal Gorecki
>Priority: Critical
>
> Parquet will ignore a column if the combined amount of elements in the column 
> is larger than the size of an int.
> The issue is that as the column reader is initialized and the rep and def 
> levels are initialized per column, the size of the integer will overflow, 
> causing these values to not be set properly. Then, during read, the level 
> will not match the current level of the reader, and a null value will be 
> provided. Since there is no overflow checking, no exception is thrown, and it 
> appears that the data is corrupted.
> This happened to us with a fairly complex schema, with an array of maps, 
> which contained arrays as well. There were over 4 billion values in all 
> column pages in one row group, which is what triggered the overflow.
> Relevant stack trace
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 172310 
> in block 0 in file 
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>...
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:   INT64 at value 95584934 out of 95530352, 130598 
> out of 130598 in currentPage. repetition level: 0, definition level: 2
> at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:484)
> at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)
> at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
> ... 18 more
> Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking 
> stream.
> at 
> org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
> at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
> at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
> at 
> org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong(DictionaryValuesReader.java:121)
> at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$4.read(ColumnReaderImpl.java:263)
> at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
> ... 21 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-519) Disable compiler warning supressions and fix compiler warnings with -O3

2016-02-09 Thread Wes McKinney (JIRA)
Wes McKinney created PARQUET-519:


 Summary: Disable compiler warning supressions and fix compiler 
warnings with -O3
 Key: PARQUET-519
 URL: https://issues.apache.org/jira/browse/PARQUET-519
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney


Related to PARQUET-447



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-447) Add Debug and Release build types and associated compiler flags

2016-02-09 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140131#comment-15140131
 ] 

Wes McKinney commented on PARQUET-447:
--

See https://github.com/apache/parquet-cpp/pull/45.

> Add Debug and Release build types and associated compiler flags
> ---
>
> Key: PARQUET-447
> URL: https://issues.apache.org/jira/browse/PARQUET-447
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-470) Thrift 0.9.3 cannot be used in conjunction with googletest and C++11 on Linux

2016-02-09 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139991#comment-15139991
 ] 

Wes McKinney commented on PARQUET-470:
--

For the time being Thrift 0.9.0 seems to work reliably Linux. 

> Thrift 0.9.3 cannot be used in conjunction with googletest and C++11 on Linux
> -
>
> Key: PARQUET-470
> URL: https://issues.apache.org/jira/browse/PARQUET-470
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>
> Thrift 0.9.3 introduces a {{#include }} include which 
> causes {{tr1/functional}} to be included, causing a compiler conflict with 
> googletest, which has its own portability macros surrounding its use of 
> {{std::tr1::tuple}}. I spent a bunch of time twiddling compiler flags to try 
> to resolve this conflict, but wasn't able to figure it out. 
> If this is a Thrift bug, we should report it to Thrift. If it's fixable by 
> compiler flags, then we should figure that out and track the issue here, 
> otherwise users with the latest version of Thrift will be unable to compile 
> the parquet-cpp test suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)