[jira] [Updated] (PARQUET-1523) [C++] Vectorize comparator interface

2019-02-25 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1523:

Labels: pull-request-available  (was: )

> [C++] Vectorize comparator interface
> 
>
> Key: PARQUET-1523
> URL: https://issues.apache.org/jira/browse/PARQUET-1523
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>
> The {{parquet::Comparator}} interface yields scalar virtual calls on the 
> innermost loop. In addition to removing the usage of 
> {{PARQUET_TEMPLATE_EXPORT}} as with other recent patches, I propose to 
> refactor to a vector-based comparison to update the minimum and maximum 
> elements in a single virtual call
> cc [~mdeepak] [~xhochy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Overridden methods of dictionary

2019-02-25 Thread Swapnil Chougule
Hi Folks,

Abstract class Dictionary contains methods:
public Binary decodeToBinary(int id)
public int decodeToInt(int id)
public long decodeToLong(int id)
public float decodeToFloat(int id)
public double decodeToDouble(int id)
public boolean decodeToBoolean(int id)

These are subsequently overridden in respective dictionary implementation
like

PlainLongDictionary overrides "decodeToLong" method only
PlainIntegerDictionary overrides "decodeToInt" method only
& so on

Can we support type upcasting here ?
PlainLongDictionary overrides "decodeToLong" & "decodeToDouble" methods
PlainIntegerDictionary overrides "decodeToInt", "decodeToLong" &
"decodeToDouble" methods

Type up casting is valid use case.
It also needs some changes in ValidTypeMap.java &
SchemaCompatibilityValidator.java for Filter predicate.

Can parquet support this type upcasting feature? I came across such
scenario in one of my use case.

Thanks,
Swapnil


Re: Column index testing break down

2019-02-25 Thread Anna Szonyi
Hi dev@,

After a week off, this week we have an excerpt from our internal data
interoperability testing, which tests compatibility between Hive, Spark and
Impala over Avro and Parquet. This test case is tailor-made to test
specific layouts so that files written using parquet-mr can be read by any
of the above mentioned components. We have also checked fault injection
cases.

The test suite is private currently, however we have made the test classes
corresponding to the following document public:
https://docs.google.com/document/d/1mHYQGXE4oM1zgg83MMc4ho1gmoJMeZcq9MWG99WgL3A

Please find the test cases and their results here:
https://github.com/zivanfi/column-indexes-data-interop-tests-excerpts

Best,
Anna



On Mon, Feb 11, 2019 at 4:57 PM Anna Szonyi  wrote:

> Hi dev@,
>
> Last week we had a twofer: e2e tool and integration test validating the
> contract of column indexes/indices (if all values are between min and max
> and if set whether the boundary order is correct). There are some takeaways
> and corrections to be made to the former (like the max->min typo) - thanks
> for the feedback on that!
>
> The next installment is also an integration test that tests the filtering
> logic on files including simple and special cases (user defined function,
> complex filtering, no filtering, etc.).
>
>
> https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnIndexFiltering.java
>
> Please let me know if you have any questions/comments.
>
> Best,
> Anna
>
>
>
>
>


[jira] [Updated] (PARQUET-1531) Page row count limit causes empty pages to be written from MessageColumnIO

2019-02-25 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1531:
--
Fix Version/s: 1.11.0

> Page row count limit causes empty pages to be written from MessageColumnIO
> --
>
> Key: PARQUET-1531
> URL: https://issues.apache.org/jira/browse/PARQUET-1531
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Matt Cheah
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> This originally manifested as 
> https://issues.apache.org/jira/browse/SPARK-26874 but we realized that this 
> is fundamentally an issue in the way PARQUET-1414's solution interacts with 
> {{MessageColumnIO}}, where Spark is one such user of that API.
> In {{MessageColumnIO#endMessage()}}, we first examine if any fields are 
> missing and fill in the values with null in 
> {{MessageColumnIO#writeNullForMissingFieldsAtCurrentLevel}}. However, this 
> method might not actually write any nulls to the underlying page. 
> {{MessageColumnIO}} can buffer nulls in memory and flush them to the page 
> store lazily.
> Regardless of whether or not nulls are flushed to the page store, in 
> {{MessageColumnIO#endMessage}} we always call {{columns#endRecord()}} which 
> will signal to the {{ColumnWriteStore}} that a record was written. At that 
> point, the write store increments the row count for the current page by 1, 
> and then check if the page needs to be flushed due to hitting the page row 
> count limit.
> The problem is that with the above writing scheme, {{MessageColumnIO}} can 
> cause empty pages to be written to Parquet files, and empty pages are not 
> readable by Parquet readers. Suppose the page row count limit is N, and the 
> {{MessageColumnIO}} receives N nulls for a column. The {{MessageColumnIO}} 
> will buffer the nulls in memory, and doesn't necessarily flush the nulls to 
> the writer yet. On the Nth call to {{endMessage()}}, however, the column 
> store will think there are N values in memory and that the page has hit the 
> row count limit, despite the fact that no rows have actually been written at 
> all. But the underlying page writer will write an empty page regardless.
> To illustrate the problem, one can try running this simple example inserted 
> into Spark's \{{ParquetIOSuite}} when Spark has been upgraded to use the 
> master branch of Parquet. Attach a debugger to 
> {{MessageColumnIO#endMessage()}} and trace the logic accordingly - the column 
> writer will push a page with 0 values:
> {code:java}
> test("PARQUET-1414 Problems") {
>   // Manually adjust the maximum row count to reproduce the issue on small 
> data
>   sparkContext.hadoopConfiguration.set("parquet.page.row.count.limit", "1")
>   withTempPath { location =>
> val path = new Path(location.getCanonicalPath + "/parquet-data")
> val schema = StructType(
>   Array(StructField("timestamps1", ArrayType(TimestampType
> val rows = ListBuffer[Row]()
> for (j <- 0 until 10) {
>   rows += Row(
> null.asInstanceOf[Array[java.sql.Timestamp]])
> }
> val srcDf = spark.createDataFrame(
>   sparkContext.parallelize(rows, 3),
>   schema,
>   true)
> srcDf.write.parquet(path.toString)
> assert(spark.read.parquet(path.toString).collect.size > 0)
>   }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1531) Page row count limit causes empty pages to be written from MessageColumnIO

2019-02-25 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1531:
--
Affects Version/s: 1.11.0

> Page row count limit causes empty pages to be written from MessageColumnIO
> --
>
> Key: PARQUET-1531
> URL: https://issues.apache.org/jira/browse/PARQUET-1531
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Matt Cheah
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> This originally manifested as 
> https://issues.apache.org/jira/browse/SPARK-26874 but we realized that this 
> is fundamentally an issue in the way PARQUET-1414's solution interacts with 
> {{MessageColumnIO}}, where Spark is one such user of that API.
> In {{MessageColumnIO#endMessage()}}, we first examine if any fields are 
> missing and fill in the values with null in 
> {{MessageColumnIO#writeNullForMissingFieldsAtCurrentLevel}}. However, this 
> method might not actually write any nulls to the underlying page. 
> {{MessageColumnIO}} can buffer nulls in memory and flush them to the page 
> store lazily.
> Regardless of whether or not nulls are flushed to the page store, in 
> {{MessageColumnIO#endMessage}} we always call {{columns#endRecord()}} which 
> will signal to the {{ColumnWriteStore}} that a record was written. At that 
> point, the write store increments the row count for the current page by 1, 
> and then check if the page needs to be flushed due to hitting the page row 
> count limit.
> The problem is that with the above writing scheme, {{MessageColumnIO}} can 
> cause empty pages to be written to Parquet files, and empty pages are not 
> readable by Parquet readers. Suppose the page row count limit is N, and the 
> {{MessageColumnIO}} receives N nulls for a column. The {{MessageColumnIO}} 
> will buffer the nulls in memory, and doesn't necessarily flush the nulls to 
> the writer yet. On the Nth call to {{endMessage()}}, however, the column 
> store will think there are N values in memory and that the page has hit the 
> row count limit, despite the fact that no rows have actually been written at 
> all. But the underlying page writer will write an empty page regardless.
> To illustrate the problem, one can try running this simple example inserted 
> into Spark's \{{ParquetIOSuite}} when Spark has been upgraded to use the 
> master branch of Parquet. Attach a debugger to 
> {{MessageColumnIO#endMessage()}} and trace the logic accordingly - the column 
> writer will push a page with 0 values:
> {code:java}
> test("PARQUET-1414 Problems") {
>   // Manually adjust the maximum row count to reproduce the issue on small 
> data
>   sparkContext.hadoopConfiguration.set("parquet.page.row.count.limit", "1")
>   withTempPath { location =>
> val path = new Path(location.getCanonicalPath + "/parquet-data")
> val schema = StructType(
>   Array(StructField("timestamps1", ArrayType(TimestampType
> val rows = ListBuffer[Row]()
> for (j <- 0 until 10) {
>   rows += Row(
> null.asInstanceOf[Array[java.sql.Timestamp]])
> }
> val srcDf = spark.createDataFrame(
>   sparkContext.parallelize(rows, 3),
>   schema,
>   true)
> srcDf.write.parquet(path.toString)
> assert(spark.read.parquet(path.toString).collect.size > 0)
>   }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1540) [C++] Set shared library version for linux and mac builds

2019-02-25 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1540:

Labels: pull-request-available  (was: )

> [C++] Set shared library version for linux and mac builds
> -
>
> Key: PARQUET-1540
> URL: https://issues.apache.org/jira/browse/PARQUET-1540
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
>
> It looks like this was previously implemented when parquet-cpp was managed as 
> a separate repo (PARQUET-935).  It would be good to add this back now that 
> parquet-cpp was incorporated into the arrow project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1540) [C++] Set shared library version for linux and mac builds

2019-02-25 Thread Hatem Helal (JIRA)
Hatem Helal created PARQUET-1540:


 Summary: [C++] Set shared library version for linux and mac builds
 Key: PARQUET-1540
 URL: https://issues.apache.org/jira/browse/PARQUET-1540
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Hatem Helal
Assignee: Hatem Helal


It looks like this was previously implemented when parquet-cpp was managed as a 
separate repo (PARQUET-935).  It would be good to add this back now that 
parquet-cpp was incorporated into the arrow project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (PARQUET-1381) Add merge blocks command to parquet-tools

2019-02-25 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reopened PARQUET-1381:
---

Re-opening this issue due to reverted from main.

> Add merge blocks command to parquet-tools
> -
>
> Key: PARQUET-1381
> URL: https://issues.apache.org/jira/browse/PARQUET-1381
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Ekaterina Galieva
>Assignee: Ekaterina Galieva
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Current implementation of merge command in parquet-tools doesn't merge row 
> groups, just places one after the other. Add API and command option to be 
> able to merge small blocks into larger ones up to specified size limit.
> h6. Implementation details:
> Blocks are not reordered not to break possible initial predicate pushdown 
> optimizations.
> Blocks are not divided to fit upper bound perfectly. 
> This is an intentional performance optimization. 
> This gives an opportunity to form new blocks by coping full content of 
> smaller blocks by column, not by row.
> h6. Examples:
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [128 | 40], [120]{code}
> Expected output file blocks sizes:
> {{merge }}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b}}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b -l 256 }}
> {code:java}
> [163 | 168 | 120]
> {code}
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [40], [120], [6] {code}
> Expected output file blocks sizes:
> {{merge}}
> {code:java}
> [128 | 35 | 40 | 120 | 6] 
> {code}
> {{merge -b}}
> {code:java}
> [128 | 75 | 126] 
> {code}
> {{merge -b -l 256}}
> {code:java}
> [203 | 126]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1381) Add merge blocks command to parquet-tools

2019-02-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776835#comment-16776835
 ] 

ASF GitHub Bot commented on PARQUET-1381:
-

gszadovszky commented on pull request #621: Revert "PARQUET-1381: Add merge 
blocks command to parquet-tools (#512)"
URL: https://github.com/apache/parquet-mr/pull/621
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add merge blocks command to parquet-tools
> -
>
> Key: PARQUET-1381
> URL: https://issues.apache.org/jira/browse/PARQUET-1381
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Ekaterina Galieva
>Assignee: Ekaterina Galieva
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Current implementation of merge command in parquet-tools doesn't merge row 
> groups, just places one after the other. Add API and command option to be 
> able to merge small blocks into larger ones up to specified size limit.
> h6. Implementation details:
> Blocks are not reordered not to break possible initial predicate pushdown 
> optimizations.
> Blocks are not divided to fit upper bound perfectly. 
> This is an intentional performance optimization. 
> This gives an opportunity to form new blocks by coping full content of 
> smaller blocks by column, not by row.
> h6. Examples:
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [128 | 40], [120]{code}
> Expected output file blocks sizes:
> {{merge }}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b}}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b -l 256 }}
> {code:java}
> [163 | 168 | 120]
> {code}
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [40], [120], [6] {code}
> Expected output file blocks sizes:
> {{merge}}
> {code:java}
> [128 | 35 | 40 | 120 | 6] 
> {code}
> {{merge -b}}
> {code:java}
> [128 | 75 | 126] 
> {code}
> {{merge -b -l 256}}
> {code:java}
> [203 | 126]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1381) Add merge blocks command to parquet-tools

2019-02-25 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1381:
--
Fix Version/s: (was: 1.11.0)

> Add merge blocks command to parquet-tools
> -
>
> Key: PARQUET-1381
> URL: https://issues.apache.org/jira/browse/PARQUET-1381
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Ekaterina Galieva
>Assignee: Ekaterina Galieva
>Priority: Major
>  Labels: pull-request-available
>
> Current implementation of merge command in parquet-tools doesn't merge row 
> groups, just places one after the other. Add API and command option to be 
> able to merge small blocks into larger ones up to specified size limit.
> h6. Implementation details:
> Blocks are not reordered not to break possible initial predicate pushdown 
> optimizations.
> Blocks are not divided to fit upper bound perfectly. 
> This is an intentional performance optimization. 
> This gives an opportunity to form new blocks by coping full content of 
> smaller blocks by column, not by row.
> h6. Examples:
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [128 | 40], [120]{code}
> Expected output file blocks sizes:
> {{merge }}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b}}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b -l 256 }}
> {code:java}
> [163 | 168 | 120]
> {code}
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [40], [120], [6] {code}
> Expected output file blocks sizes:
> {{merge}}
> {code:java}
> [128 | 35 | 40 | 120 | 6] 
> {code}
> {{merge -b}}
> {code:java}
> [128 | 75 | 126] 
> {code}
> {{merge -b -l 256}}
> {code:java}
> [203 | 126]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1533) TestSnappy() throws OOM exception with Parquet-1485 change

2019-02-25 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1533.
---
Resolution: Fixed

> TestSnappy() throws OOM exception with Parquet-1485 change 
> ---
>
> Key: PARQUET-1533
> URL: https://issues.apache.org/jira/browse/PARQUET-1533
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
> Environment: Mac OS 10.14.1
>Reporter: Xinli Shang
>Assignee: Gabor Szadovszky
>Priority: Minor
>  Labels: pull-request-available
>
> Parquet-1485 initialize the buffer size(inputBuffer and outputBuffer) from 0 
> to 128M in total. This cause the unit test TestSnappy() failed with OOM 
> exception. This is on my Mac laptop. 
> To solve the unit test failure, we can increase the size of -Xmx from 512m to 
> 1024m like below. However, we need to evaluate whether or not the increase of 
> the initial size of direct memory usage for inputBuffer and outputBuffer will 
> cause real Parquet application OOM or not, if that application is not with 
> big enough -Xmx size. 
> org.apache.maven.plugins
>  maven-surefire-plugin
>  ...
>  -Xmx1014m
> ...
> For details of the exception, the commit page 
> ([https://github.com/apache/parquet-mr/commit/7dcdcdcf0eb5e91618c443d4a84973bf7883d79b])
>  has the detail. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1320) Fast clean unused direct memory when decompress

2019-02-25 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1320.
---
Resolution: Duplicate

> Fast clean unused direct memory when decompress
> ---
>
> Key: PARQUET-1320
> URL: https://issues.apache.org/jira/browse/PARQUET-1320
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: zhoukang
>Priority: Major
>  Labels: pull-request-available
>
> When use *NonBlockedDecompressorStream* which call:
> *SnappyDecompressor.setInput*
> {code:java}
> public synchronized void setInput(byte[] buffer, int off, int len) {
>  SnappyUtil
> public synchronized void setInput(byte[] buffer, int off, int len) {
>  SnappyUtil.validateBuffer(buffer, off, len);
>  if (inputBuffer.capacity() - inputBuffer.position() < len) {
>  ByteBuffer newBuffer = ByteBuffer.allocateDirect(inputBuffer.position() + 
> len);
>  inputBuffer.rewind();
>  newBuffer.put(inputBuffer);
>  inputBuffer = newBuffer; 
>  } else {
>  inputBuffer.limit(inputBuffer.position() + len);
>  }
>  inputBuffer.put(buffer, off, len);
> }
> .validateBuffer(buffer, off, len);
>  if (inputBuffer.capacity() - inputBuffer.position() < len) {
>  ByteBuffer newBuffer = ByteBuffer.allocateDirect(inputBuffer.position() + 
> len);
>  inputBuffer.rewind();
>  newBuffer.put(inputBuffer);
>  inputBuffer = newBuffer; 
>  } else {
>  inputBuffer.limit(inputBuffer.position() + len);
>  }
>  inputBuffer.put(buffer, off, len);
> }
> {code}
> If we do not get any full gc for old gen.we may failed by off-heap memory leak
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1533) TestSnappy() throws OOM exception with Parquet-1485 change

2019-02-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776811#comment-16776811
 ] 

ASF GitHub Bot commented on PARQUET-1533:
-

gszadovszky commented on pull request #622: PARQUET-1533: TestSnappy() throws 
OOM exception with Parquet-1485 change
URL: https://github.com/apache/parquet-mr/pull/622
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> TestSnappy() throws OOM exception with Parquet-1485 change 
> ---
>
> Key: PARQUET-1533
> URL: https://issues.apache.org/jira/browse/PARQUET-1533
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
> Environment: Mac OS 10.14.1
>Reporter: Xinli Shang
>Assignee: Gabor Szadovszky
>Priority: Minor
>  Labels: pull-request-available
>
> Parquet-1485 initialize the buffer size(inputBuffer and outputBuffer) from 0 
> to 128M in total. This cause the unit test TestSnappy() failed with OOM 
> exception. This is on my Mac laptop. 
> To solve the unit test failure, we can increase the size of -Xmx from 512m to 
> 1024m like below. However, we need to evaluate whether or not the increase of 
> the initial size of direct memory usage for inputBuffer and outputBuffer will 
> cause real Parquet application OOM or not, if that application is not with 
> big enough -Xmx size. 
> org.apache.maven.plugins
>  maven-surefire-plugin
>  ...
>  -Xmx1014m
> ...
> For details of the exception, the commit page 
> ([https://github.com/apache/parquet-mr/commit/7dcdcdcf0eb5e91618c443d4a84973bf7883d79b])
>  has the detail. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1539) Clarify CRC checksum in page header

2019-02-25 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1539:
-

Assignee: Boudewijn Braams

> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)