[DISCUSS] Alternative design for KMS interaction in parquet-cpp

2020-11-11 Thread Benjamin Kietzman
In the context of https://issues.apache.org/jira/browse/ARROW-9318 /
https://github.com/apache/arrow/pull/8023 which port the parquet-mr
design to c++: there's some question whether that design is consistent
with the style and conventions of the c++ implementation of parquet.

Here is a Gist with a sketched alternative design adapted to c++
https://gist.github.com/bkietz/f701d32add6f5a2aeed89ce36a443d43

Specific concerns in brief:
- Multiple internal classes are left public in header files, where it would
be
  preferred that public classes be kept to a minimum.
- Several levels of explicit synchronization with mutexes are used to
guarantee
  that each class is completely thread safe, but this is unnecessary
overhead
  given the pattern of use of `parquet-cpp`.


[jira] [Updated] (PARQUET-1945) Add an option to allow auto conversion from empty fields to NULL

2020-11-11 Thread Zheng Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated PARQUET-1945:

Description: 
Right now, Parquet Writer throws out an exception:

{{Parquet record is malformed: empty fields are illegal, the field should be 
ommited completely instead}}

when an empty field (array or struct or map I guess?) is written.

The suggestion here is to add an option "auto_convert_empty_fields_to_null" 
that convert empty fields to null automatically on write.

The LOC to change is 
[here:|https://sourcegraph.com/github.com/apache/parquet-mr/-/blob/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L328]
{quote}{{if (emptyField) {}}
{{ {{   throw new ParquetEncodingException("empty fields are illegal, the field 
should be ommited completely instead");
{{}}}{quote}
 

  was:
Right now, Parquet Writer throws out an exception:

{{Parquet record is malformed: empty fields are illegal, the field should be 
ommited completely instead}}

when an empty field (array or struct or map I guess?) is written.

The suggestion here is to add an option "auto_convert_empty_fields_to_null" 
that convert empty fields to null automatically on write.

The LOC to change is 
[here:|https://sourcegraph.com/github.com/apache/parquet-mr/-/blob/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L328]



{{ if (emptyField) {}}
{{    throw new ParquetEncodingException("empty fields are illegal, the field 
should be ommited completely instead");}}
{{ }}}


> Add an option to allow auto conversion from empty fields to NULL
> 
>
> Key: PARQUET-1945
> URL: https://issues.apache.org/jira/browse/PARQUET-1945
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Zheng Shao
>Priority: Minor
>
> Right now, Parquet Writer throws out an exception:
> {{Parquet record is malformed: empty fields are illegal, the field should be 
> ommited completely instead}}
> when an empty field (array or struct or map I guess?) is written.
> The suggestion here is to add an option "auto_convert_empty_fields_to_null" 
> that convert empty fields to null automatically on write.
> The LOC to change is 
> [here:|https://sourcegraph.com/github.com/apache/parquet-mr/-/blob/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L328]
> {quote}{{if (emptyField) {}}
> {{ {{   throw new ParquetEncodingException("empty fields are illegal, the 
> field should be ommited completely instead");
> {{}}}{quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1945) Add an option to allow auto conversion from empty fields to NULL

2020-11-11 Thread Zheng Shao (Jira)
Zheng Shao created PARQUET-1945:
---

 Summary: Add an option to allow auto conversion from empty fields 
to NULL
 Key: PARQUET-1945
 URL: https://issues.apache.org/jira/browse/PARQUET-1945
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Zheng Shao


Right now, Parquet Writer throws out an exception:

{{Parquet record is malformed: empty fields are illegal, the field should be 
ommited completely instead}}

when an empty field (array or struct or map I guess?) is written.

The suggestion here is to add an option "auto_convert_empty_fields_to_null" 
that convert empty fields to null automatically on write.

The LOC to change is 
[here:|https://sourcegraph.com/github.com/apache/parquet-mr/-/blob/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L328]



{{ if (emptyField) {}}
{{    throw new ParquetEncodingException("empty fields are illegal, the field 
should be ommited completely instead");}}
{{ }}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1944) Unable to download transitive dependency hadoop-lzo

2020-11-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230118#comment-17230118
 ] 

ASF GitHub Bot commented on PARQUET-1944:
-

shangxinli commented on pull request #843:
URL: https://github.com/apache/parquet-mr/pull/843#issuecomment-725554202


   LGTM



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Unable to download transitive dependency hadoop-lzo
> ---
>
> Key: PARQUET-1944
> URL: https://issues.apache.org/jira/browse/PARQUET-1944
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Blocker
>
> Seems that the solution for PARQUET-1691 is not complete. We have to exclude 
> `hadoop-lzo` from the transitive dependencies of `elephant-bird-pig` as well. 
> Not sure why we did not recognize this issue before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] shangxinli commented on pull request #843: PARQUET-1944: Unable to download transitive dependency hadoop-lzo

2020-11-11 Thread GitBox


shangxinli commented on pull request #843:
URL: https://github.com/apache/parquet-mr/pull/843#issuecomment-725554202


   LGTM



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] gszadovszky commented on pull request #843: PARQUET-1944: Unable to download transitive dependency hadoop-lzo

2020-11-11 Thread GitBox


gszadovszky commented on pull request #843:
URL: https://github.com/apache/parquet-mr/pull/843#issuecomment-725334361


   @shangxinli, could you please check it out? It blocks all of our recent PRs.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1944) Unable to download transitive dependency hadoop-lzo

2020-11-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229886#comment-17229886
 ] 

ASF GitHub Bot commented on PARQUET-1944:
-

gszadovszky commented on pull request #843:
URL: https://github.com/apache/parquet-mr/pull/843#issuecomment-725334361


   @shangxinli, could you please check it out? It blocks all of our recent PRs.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Unable to download transitive dependency hadoop-lzo
> ---
>
> Key: PARQUET-1944
> URL: https://issues.apache.org/jira/browse/PARQUET-1944
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Blocker
>
> Seems that the solution for PARQUET-1691 is not complete. We have to exclude 
> `hadoop-lzo` from the transitive dependencies of `elephant-bird-pig` as well. 
> Not sure why we did not recognize this issue before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1944) Unable to download transitive dependency hadoop-lzo

2020-11-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229885#comment-17229885
 ] 

ASF GitHub Bot commented on PARQUET-1944:
-

gszadovszky opened a new pull request #843:
URL: https://github.com/apache/parquet-mr/pull/843


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following 
[PARQUET-1944](https://issues.apache.org/jira/browse/PARQUET-1944) issues and 
references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Unable to download transitive dependency hadoop-lzo
> ---
>
> Key: PARQUET-1944
> URL: https://issues.apache.org/jira/browse/PARQUET-1944
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Blocker
>
> Seems that the solution for PARQUET-1691 is not complete. We have to exclude 
> `hadoop-lzo` from the transitive dependencies of `elephant-bird-pig` as well. 
> Not sure why we did not recognize this issue before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: travis failures due to maven.twttr.com outage

2020-11-11 Thread Gabor Szadovszky
Hi everyone,

So, the issue seems to be the same as in PARQUET-1691
 we already resolved.
What I do not understand is why we did not discover that the solution is
not complete. Anyway, created the other jira PARQUET-1944
 to track the current
issue and initiated the related pull request.

Cheers,
Gabor

On Mon, Nov 9, 2020 at 11:43 AM Gabor Szadovszky  wrote:

> Hi everyone,
>
> We have Travis failures for a week because maven.twttr.com does not work.
> (See an example at https://api.travis-ci.org/v3/job/741385934/log.txt.)
> The funny part is we should not depend on that repo and the artifact
> failing to be downloaded is not in our dependency tree. (See my fix from
> last year at
> https://github.com/apache/parquet-mr/commit/d1190abee3ff1f1183757f7e400d7b9b4025d95b
> .)
>
> Any idea what causes these issues and how to fix them?
>
> Cheers,
> Gabor
>


[GitHub] [parquet-mr] gszadovszky opened a new pull request #843: PARQUET-1944: Unable to download transitive dependency hadoop-lzo

2020-11-11 Thread GitBox


gszadovszky opened a new pull request #843:
URL: https://github.com/apache/parquet-mr/pull/843


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following 
[PARQUET-1944](https://issues.apache.org/jira/browse/PARQUET-1944) issues and 
references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (PARQUET-1944) Unable to download transitive dependency hadoop-lzo

2020-11-11 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1944:
-

 Summary: Unable to download transitive dependency hadoop-lzo
 Key: PARQUET-1944
 URL: https://issues.apache.org/jira/browse/PARQUET-1944
 Project: Parquet
  Issue Type: Bug
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


Seems that the solution for PARQUET-1691 is not complete. We have to exclude 
`hadoop-lzo` from the transitive dependencies of `elephant-bird-pig` as well. 
Not sure why we did not recognize this issue before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)