[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

2020-07-20 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161683#comment-17161683
 ] 

Yuming Wang commented on PARQUET-1739:
--

[~sha...@uber.com] We can not upgrade Parquet to 1.11 because of Avro 
dependence. Please work on this if you like.

> Make Spark SQL support Column indexes
> -
>
> Key: PARQUET-1739
> URL: https://issues.apache.org/jira/browse/PARQUET-1739
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Make Spark SQL support Column indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-07-20 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161644#comment-17161644
 ] 

Felix Kizhakkel Jose commented on PARQUET-1830:
---

https://issues.apache.org/jira/browse/SPARK-26345. But no one has picked that 
Jira yet

 

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1886) CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

2020-07-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161636#comment-17161636
 ] 

ASF GitHub Bot commented on PARQUET-1886:
-

XinDongIntel opened a new pull request #803:
URL: https://github.com/apache/parquet-mr/pull/803


   … for parquet-mr
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ PARQUET-1886] CompressionCodec Provider-aware Compression Codec Lookup 
for parquet-mr
 - https://issues.apache.org/jira/browse/PARQUET-1886

   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr
> ---
>
> Key: PARQUET-1886
> URL: https://issues.apache.org/jira/browse/PARQUET-1886
> Project: Parquet
>  Issue Type: Wish
>  Components: parquet-mr
>Reporter: XinDong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] XinDongIntel opened a new pull request #803: PARQUET-1886 CompressionCodec Provider-aware Compression Codec Lookup…

2020-07-20 Thread GitBox


XinDongIntel opened a new pull request #803:
URL: https://github.com/apache/parquet-mr/pull/803


   … for parquet-mr
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ PARQUET-1886] CompressionCodec Provider-aware Compression Codec Lookup 
for parquet-mr
 - https://issues.apache.org/jira/browse/PARQUET-1886

   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (PARQUET-1886) CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

2020-07-20 Thread XinDong (Jira)
XinDong created PARQUET-1886:


 Summary: CompressionCodec Provider-aware Compression Codec Lookup 
for parquet-mr
 Key: PARQUET-1886
 URL: https://issues.apache.org/jira/browse/PARQUET-1886
 Project: Parquet
  Issue Type: Wish
  Components: parquet-mr
Reporter: XinDong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-07-20 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161573#comment-17161573
 ] 

Xinli Shang commented on PARQUET-1830:
--

[~FelixKJose]Do we have Spark task created for implementing the short term 
solution? 

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

2020-07-20 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161566#comment-17161566
 ] 

Xinli Shang commented on PARQUET-1739:
--

[~yumwang], Can you share is the implementation is done in Spark to skip 
Parquet pages, as [~gszadovszky] asked that question in Spark-26346? If you 
haven't, I will start looking into it. 

> Make Spark SQL support Column indexes
> -
>
> Key: PARQUET-1739
> URL: https://issues.apache.org/jira/browse/PARQUET-1739
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Make Spark SQL support Column indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1684) [parquet-protobuf] default protobuf field values are stored as nulls

2020-07-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161484#comment-17161484
 ] 

ASF GitHub Bot commented on PARQUET-1684:
-

dossett commented on pull request #702:
URL: https://github.com/apache/parquet-mr/pull/702#issuecomment-661286742


   cc @gszadovszky it looks like you are driving 1.11.1 (apologies if that is 
not the case)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [parquet-protobuf] default protobuf field values are stored as nulls
> 
>
> Key: PARQUET-1684
> URL: https://issues.apache.org/jira/browse/PARQUET-1684
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.11.0
>Reporter: George Haddad
>Assignee: Priyank Bagrecha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> When the source is a protobuf3 message, and the target file is Parquet, all 
> the default values are stored in the output parquet as `{{null`}} instead of 
> the actual type's default value.
>  For example, if the field is of type `int32`, `double` or `enum` and it 
> hasn't been set, the parquet value is `{{null`}} instead of `0`. When the 
> field's type is a `string` that hasn't been set, the parquet value is 
> {{`null`}} instead of an empty string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] dossett commented on pull request #702: PARQUET-1684: dont store default protobuf values as null for proto3

2020-07-20 Thread GitBox


dossett commented on pull request #702:
URL: https://github.com/apache/parquet-mr/pull/702#issuecomment-661286742


   cc @gszadovszky it looks like you are driving 1.11.1 (apologies if that is 
not the case)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1684) [parquet-protobuf] default protobuf field values are stored as nulls

2020-07-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161481#comment-17161481
 ] 

ASF GitHub Bot commented on PARQUET-1684:
-

dossett commented on pull request #702:
URL: https://github.com/apache/parquet-mr/pull/702#issuecomment-661282325


   Can it be considered for 1.11.1? I see a release candidate is out.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [parquet-protobuf] default protobuf field values are stored as nulls
> 
>
> Key: PARQUET-1684
> URL: https://issues.apache.org/jira/browse/PARQUET-1684
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.11.0
>Reporter: George Haddad
>Assignee: Priyank Bagrecha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> When the source is a protobuf3 message, and the target file is Parquet, all 
> the default values are stored in the output parquet as `{{null`}} instead of 
> the actual type's default value.
>  For example, if the field is of type `int32`, `double` or `enum` and it 
> hasn't been set, the parquet value is `{{null`}} instead of `0`. When the 
> field's type is a `string` that hasn't been set, the parquet value is 
> {{`null`}} instead of an empty string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] dossett commented on pull request #702: PARQUET-1684: dont store default protobuf values as null for proto3

2020-07-20 Thread GitBox


dossett commented on pull request #702:
URL: https://github.com/apache/parquet-mr/pull/702#issuecomment-661282325


   Can it be considered for 1.11.1? I see a release candidate is out.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1885) [parquet-protobuf] Pass descriptor to ProtoWriteSupport constructor

2020-07-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161473#comment-17161473
 ] 

ASF GitHub Bot commented on PARQUET-1885:
-

mauliksoneji opened a new pull request #802:
URL: https://github.com/apache/parquet-mr/pull/802


   addresses https://issues.apache.org/jira/browse/PARQUET-1885



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [parquet-protobuf] Pass descriptor to ProtoWriteSupport constructor 
> 
>
> Key: PARQUET-1885
> URL: https://issues.apache.org/jira/browse/PARQUET-1885
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0, 1.10.1
>Reporter: Maulik Soneji
>Priority: Major
>
> Currently, the ProtoWriteSupport class checks for Descriptor by calling 
> `Protobufs.getMessageDescriptor` function which checks for descriptor in the 
> classpath. There is no way to pass descriptor as an argument to the 
> ProtoWriteSupport constructor.
> In our approach to using parquet-mr library, we are using a descriptor that 
> is not available in the classpath.
> I will be happy to work on adding this support to the parquet-mr library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] mauliksoneji opened a new pull request #802: PARQUET-1885: Pass descriptor to ProtoWriteSupport constructor

2020-07-20 Thread GitBox


mauliksoneji opened a new pull request #802:
URL: https://github.com/apache/parquet-mr/pull/802


   addresses https://issues.apache.org/jira/browse/PARQUET-1885



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (PARQUET-1885) [parquet-protobuf] Pass descriptor to ProtoWriteSupport constructor

2020-07-20 Thread Maulik Soneji (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maulik Soneji updated PARQUET-1885:
---
Affects Version/s: 1.11.0
   1.10.1

> [parquet-protobuf] Pass descriptor to ProtoWriteSupport constructor 
> 
>
> Key: PARQUET-1885
> URL: https://issues.apache.org/jira/browse/PARQUET-1885
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.11.0, 1.10.1
>Reporter: Maulik Soneji
>Priority: Major
>
> Currently, the ProtoWriteSupport class checks for Descriptor by calling 
> `Protobufs.getMessageDescriptor` function which checks for descriptor in the 
> classpath. There is no way to pass descriptor as an argument to the 
> ProtoWriteSupport constructor.
> In our approach to using parquet-mr library, we are using a descriptor that 
> is not available in the classpath.
> I will be happy to work on adding this support to the parquet-mr library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1885) [parquet-protobuf] Pass descriptor to ProtoWriteSupport constructor

2020-07-20 Thread Maulik Soneji (Jira)
Maulik Soneji created PARQUET-1885:
--

 Summary: [parquet-protobuf] Pass descriptor to ProtoWriteSupport 
constructor 
 Key: PARQUET-1885
 URL: https://issues.apache.org/jira/browse/PARQUET-1885
 Project: Parquet
  Issue Type: Improvement
Reporter: Maulik Soneji


Currently, the ProtoWriteSupport class checks for Descriptor by calling 
`Protobufs.getMessageDescriptor` function which checks for descriptor in the 
classpath. There is no way to pass descriptor as an argument to the 
ProtoWriteSupport constructor.

In our approach to using parquet-mr library, we are using a descriptor that is 
not available in the classpath.
I will be happy to work on adding this support to the parquet-mr library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1885) [parquet-protobuf] Pass descriptor to ProtoWriteSupport constructor

2020-07-20 Thread Maulik Soneji (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maulik Soneji updated PARQUET-1885:
---
Component/s: parquet-mr

> [parquet-protobuf] Pass descriptor to ProtoWriteSupport constructor 
> 
>
> Key: PARQUET-1885
> URL: https://issues.apache.org/jira/browse/PARQUET-1885
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0, 1.10.1
>Reporter: Maulik Soneji
>Priority: Major
>
> Currently, the ProtoWriteSupport class checks for Descriptor by calling 
> `Protobufs.getMessageDescriptor` function which checks for descriptor in the 
> classpath. There is no way to pass descriptor as an argument to the 
> ProtoWriteSupport constructor.
> In our approach to using parquet-mr library, we are using a descriptor that 
> is not available in the classpath.
> I will be happy to work on adding this support to the parquet-mr library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: How to incrementally store timeseries in Parquet files for efficient retrieval?

2020-07-20 Thread Tim Armstrong
The usual solution is to partition the data based on the criteria you want
to filter by. E.g. for Hive tables, you would partition by date and have a
separate directory per date.

If you have a relatively modern version of Parquet, stats and page indices
will allow the reader to filter out files based on ranges of values in the
file after reading the file footers. Reading the footer takes longer than
not reading the file at all, but is much faster than reading the whole file.

On Sat, Jul 18, 2020 at 8:21 AM Yash Ganthe  wrote:

> I would like to store the stock price of a large number of companies in a
> parquet file in the form of a timeseries.
> If I gather the data at the end of 1 Jul, I would be writing a file such
> as:
> 1 Jul 2020, Company1,35
> 1 Jul 2020, Company2,46
> 
>
> On 2 Jul, I would receive the new prices and would write it in "append"
> mode as:
> 2 Jul 2020, Company1,37
> 2 Jul 2020, Company2,43
> ...
>
> This will result in 2 partition files being created for the same parquet
> file:
> stocks.parquet/
> part0_stocks.parquet written on 1 Jul
> part1_stocks.parquet written on 2 Jul
>
> If this continues for years, I will have a large number of partition files
> created, one per day.
> If a client application wants to fetch the timeseries for 6 months, it will
> be reading several files to gather the data and may be inefficient.
>
> Is there a better way to store timeseries data in parquet?
>


[jira] [Commented] (PARQUET-14) Pig and Hive cannot read repeated groups written with parquet-protobuf

2020-07-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-14?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161135#comment-17161135
 ] 

ASF GitHub Bot commented on PARQUET-14:
---

NathanHowell closed pull request #14:
URL: https://github.com/apache/parquet-mr/pull/14


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Pig and Hive cannot read repeated groups written with parquet-protobuf
> --
>
> Key: PARQUET-14
> URL: https://issues.apache.org/jira/browse/PARQUET-14
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Nathan Howell
>Priority: Major
>
> parquet-hive and parquet-pig make assumptions about list schemas that are not 
> compatible with the more compact schemas generated by parquet-protobuf. This 
> bug was discussed in more detail on 
> https://github.com/Parquet/parquet-mr/issues/354



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] NathanHowell closed pull request #14: PARQUET-14: Support lifted coercions of file schemas into compatible read schemas.

2020-07-20 Thread GitBox


NathanHowell closed pull request #14:
URL: https://github.com/apache/parquet-mr/pull/14


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org