Same Travis failure for several parquet-format PR

2018-06-22 Thread Nandor Kollar
Hi All,

I recently noticed, that on parquet-format Travis builds fail with

[ERROR] Plugin org.apache.maven.plugins:maven-remote-resources-plugin:1.5
or one of its dependencies could not be resolved: Failed to read artifact
descriptor for
org.apache.maven.plugins:maven-remote-resources-plugin:jar:1.5: Could not
transfer artifact
org.apache.maven.plugins:maven-remote-resources-plugin:pom:1.5 from/to
central (https://repo.maven.apache.org/maven2): Received fatal alert:
protocol_version -> [Help 1]

Since several Travis build failed with similar reason for unrelated PRs, I
suspect this might be an infrastructure problem. Does anyone know who to
contact with in this case? Open an Apache infra ticket?

Thanks,
Nandor


[jira] [Created] (PARQUET-1335) Logical type names in parquet-mr are not consistent with parquet-format

2018-06-22 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1335:
--

 Summary: Logical type names in parquet-mr are not consistent with 
parquet-format
 Key: PARQUET-1335
 URL: https://issues.apache.org/jira/browse/PARQUET-1335
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Nandor Kollar
Assignee: Nandor Kollar


UTF8 logical type should be called STRING, INT should be called INTEGER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1335) Logical type names in parquet-mr are not consistent with parquet-format

2018-06-22 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1335:
--
Affects Version/s: 1.11.0

> Logical type names in parquet-mr are not consistent with parquet-format
> ---
>
> Key: PARQUET-1335
> URL: https://issues.apache.org/jira/browse/PARQUET-1335
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>
> UTF8 logical type should be called STRING, INT should be called INTEGER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1335) Logical type names in parquet-mr are not consistent with parquet-format

2018-06-22 Thread Gabor Szadovszky (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520285#comment-16520285
 ] 

Gabor Szadovszky commented on PARQUET-1335:
---

Please, make sure it'll be pushed before releasing PARQUET-1253.

> Logical type names in parquet-mr are not consistent with parquet-format
> ---
>
> Key: PARQUET-1335
> URL: https://issues.apache.org/jira/browse/PARQUET-1335
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>
> UTF8 logical type should be called STRING, INT should be called INTEGER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1335) Logical type names in parquet-mr are not consistent with parquet-format

2018-06-22 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520290#comment-16520290
 ] 

Nandor Kollar commented on PARQUET-1335:


[~gszadovszky] sure, but since new API is not yet released, this shouldn't be a 
backward incompatible change.

> Logical type names in parquet-mr are not consistent with parquet-format
> ---
>
> Key: PARQUET-1335
> URL: https://issues.apache.org/jira/browse/PARQUET-1335
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
>
> UTF8 logical type should be called STRING, INT should be called INTEGER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1335) Logical type names in parquet-mr are not consistent with parquet-format

2018-06-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520288#comment-16520288
 ] 

ASF GitHub Bot commented on PARQUET-1335:
-

nandorKollar opened a new pull request #496: PARQUET-1335: Logical type names 
in parquet-mr are not consistent with parquet-format
URL: https://github.com/apache/parquet-mr/pull/496
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Logical type names in parquet-mr are not consistent with parquet-format
> ---
>
> Key: PARQUET-1335
> URL: https://issues.apache.org/jira/browse/PARQUET-1335
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
>
> UTF8 logical type should be called STRING, INT should be called INTEGER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1335) Logical type names in parquet-mr are not consistent with parquet-format

2018-06-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1335:

Labels: pull-request-available  (was: )

> Logical type names in parquet-mr are not consistent with parquet-format
> ---
>
> Key: PARQUET-1335
> URL: https://issues.apache.org/jira/browse/PARQUET-1335
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
>
> UTF8 logical type should be called STRING, INT should be called INTEGER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Estimated row-group size is significantly higher than the written one

2018-06-22 Thread Ryan Blue
I think you're right about the cause. The current estimate is what is
buffered in memory, so it includes all of the intermediate data for the
last page before it is finalized and compressed.

We could probably get a better estimate by using the amount of buffered
data and how large other pages in a column were after fully encoding and
compressing. So if you have 5 pages compressed and buffered, and another
1000 values, use the compression ratio of the 5 pages to estimate the final
size. We'd probably want to use some overhead value for the header. And,
we'd want to separate the amount of buffered data from our row group size
estimate, which are currently the same thing.

rb

On Thu, Jun 21, 2018 at 1:17 AM Gabor Szadovszky  wrote:

> Hi All,
>
> One of our customers faced the following issue. parquet.block.size is
> configured to 128M. (parquet.writer.max-padding is left with the default
> 8M.) In average 7 row-groups are generated in one block with the sizes
> ~74M, ~16M, ~12M, ~9M, ~7M, ~5M, ~4M. By increasing the padding to e.g. 60M
> only one row-group per block is written but it is a waste of disk space.
> By investigating the logs it turns out that parquet-mr thinks the row-group
> is already close to 128M so it writes the first one then realize we still
> have space to write until reaching the block size and so on:
> INFO hadoop.InternalParquetRecordWriter: mem size 134,673,545 >
> 134,217,728: flushing 484,972 records to disk.
> INFO hadoop.InternalParquetRecordWriter: mem size 59,814,120 > 59,814,925:
> flushing 99,030 records to disk.
> INFO hadoop.InternalParquetRecordWriter: mem size 43,396,192 > 43,397,248:
> flushing 71,848 records to disk.
> ...
>
> My idea about the root cause is that there are many dictionary encoded
> columns where the value variance is low. When we are approximating the
> row-group size there are pages which are still open (not encoded yet). If
> these pages are dictionary encoded we calculate with 4bytes values as the
> dictionary indexes. But if the variance is low, the RLE and bitpacking will
> decrease the size of these pages dramatically.
>
> What do you guys think? Are we able to make the approximation a bit better?
> Do we have some properties that can solve this issue?
>
> Thanks a lot,
> Gabor
>


-- 
Ryan Blue
Software Engineer
Netflix