date:20210201



[ 
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276914#comment-17276914
 ] 

ASF GitHub Bot commented on PARQUET-1950:
-

ggershinsky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771442874


   to add the parquet encryption angle to this discussion. This feature adds 
protection of confidentiality and integrity of parquet files (when they have 
columns with sensitive data). These security layers will make it difficult to 
support many of the legacy features mentioned above, like external chunks or 
merging multiple files into a single master file (this interferes with 
definition of file integrity). Reading encrypted data is also difficult before 
file writing is finished. All of these are not impossible, but challenging, and 
would require an explicit scaffolding plus some Thrift format changes. If there 
is a strong demand for using encryption with these legacy features, despite 
them being deprecated (or with some of the mentioned new features), we can plan 
this for future versions of parquet-format, parquet-mr etc.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Define core features / compliance level
> ---
>
> Key: PARQUET-1950
> URL: https://issues.apache.org/jira/browse/PARQUET-1950
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Parquet format is getting more and more features while the different 
> implementations cannot keep the pace and left behind with some features 
> implemented and some are not. In many cases it is also not clear if the 
> related feature is mature enough to be used widely or more an experimental 
> one.
> These are huge issues that makes hard ensure interoperability between the 
> different implementations.
> The following idea came up in a 
> [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E].
>  Create a now document in the parquet-format repository that lists the "core 
> features". This document is versioned by the parquet-format releases. This 
> way a certain version of "core features" defines a level of compatibility 
> between the different implementations. This version number can be written to 
> a new field (e.g. complianceLevel) in the footer. If an implementation writes 
> a file with a version in the field it must implement all the related "core 
> features" (read and write) and must not use any other features at write 
> because it makes the data unreadable by another implementation if only the 
> same level of "core features" are implemented.
> For example if we have encoding A listed in the version 1 "core features" but 
> encoding B is not then at "complianceLevel = 1" we can use encoding A but we 
> cannot use encoding B because it would make the related data unreadable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-format] ggershinsky commented on pull request #164: PARQUET-1950: Define core features



ggershinsky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771442874


   to add the parquet encryption angle to this discussion. This feature adds 
protection of confidentiality and integrity of parquet files (when they have 
columns with sensitive data). These security layers will make it difficult to 
support many of the legacy features mentioned above, like external chunks or 
merging multiple files into a single master file (this interferes with 
definition of file integrity). Reading encrypted data is also difficult before 
file writing is finished. All of these are not impossible, but challenging, and 
would require an explicit scaffolding plus some Thrift format changes. If there 
is a strong demand for using encryption with these legacy features, despite 
them being deprecated (or with some of the mentioned new features), we can plan 
this for future versions of parquet-format, parquet-mr etc.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (PARQUET-1967) Upgrade Zstd-jni to 1.4.8-3

2021-02-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated PARQUET-1967:
---
Summary: Upgrade Zstd-jni to 1.4.8-3  (was: Upgrade Zstd-jni to 1.4.8-2)

> Upgrade Zstd-jni to 1.4.8-3
> ---
>
> Key: PARQUET-1967
> URL: https://issues.apache.org/jira/browse/PARQUET-1967
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2021-02-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276862#comment-17276862
 ] 

Nicholas Chammas commented on PARQUET-41:
-

Thanks for the link [~yumwang]. That 
[README|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop#readme] 
is what I was looking for.

Are these docs published on the [documentation 
site|http://parquet.apache.org/documentation/latest/] anywhere, or is the 
README file on GitHub the canonical reference?

> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Junjie Chen
>Priority: Major
>  Labels: filter2, pull-request-available
> Fix For: format-2.7.0, 1.12.0
>
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics



[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276854#comment-17276854
 ] 

Yuming Wang commented on PARQUET-41:


[~nchammas] You can check the related configuration parameters: 
[https://github.com/apache/parquet-mr/tree/master/parquet-hadoop 
|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop]This is an 
example:
{code:scala}
val numRows = 1024 * 1024 * 15
val df = spark.range(numRows).selectExpr(
  "id",
  "cast(id as string) as s",
  "cast(id as timestamp) as ts",
  "cast(cast(id as timestamp) as date) as td",
  "cast(id as decimal) as dec")
val benchmark = new org.apache.spark.benchmark.Benchmark(
  "Benchmark bloom filter write",
  numRows,
  minNumIters = 5)

benchmark.addCase("default") { _ =>
  withSQLConf() {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
  }
}

benchmark.addCase("Build bloom filter for ts column") { _ =>
  withSQLConf(
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> 
"false",
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#ts" 
-> "true") {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
  }
}

benchmark.addCase("Build bloom filter for ts and dec column") { _ =>
  withSQLConf(
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> 
"false",
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#ts" 
-> "true",
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#dec" 
-> "true") {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
  }
}

benchmark.addCase("Build bloom filter for all column") { _ =>
  withSQLConf(
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> 
"true") {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
  }
}
benchmark.run()
{code}

> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Junjie Chen
>Priority: Major
>  Labels: filter2, pull-request-available
> Fix For: format-2.7.0, 1.12.0
>
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1950) Define core features / compliance level



[ 
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276852#comment-17276852
 ] 

ASF GitHub Bot commented on PARQUET-1950:
-

timarmstrong commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771353955


   +1 to @emkornfield's comment - the intent of this is to establish a clear 
baseline about what is supported widely in practice - there are a bunch of 
Parquet features that are in the standard but are hard to use in practice 
because they don't have read support from other implementatoins. I think it 
should ultimately make it easier to get adoption on new features cause the 
status of each feature will be clearer.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Define core features / compliance level
> ---
>
> Key: PARQUET-1950
> URL: https://issues.apache.org/jira/browse/PARQUET-1950
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Parquet format is getting more and more features while the different 
> implementations cannot keep the pace and left behind with some features 
> implemented and some are not. In many cases it is also not clear if the 
> related feature is mature enough to be used widely or more an experimental 
> one.
> These are huge issues that makes hard ensure interoperability between the 
> different implementations.
> The following idea came up in a 
> [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E].
>  Create a now document in the parquet-format repository that lists the "core 
> features". This document is versioned by the parquet-format releases. This 
> way a certain version of "core features" defines a level of compatibility 
> between the different implementations. This version number can be written to 
> a new field (e.g. complianceLevel) in the footer. If an implementation writes 
> a file with a version in the field it must implement all the related "core 
> features" (read and write) and must not use any other features at write 
> because it makes the data unreadable by another implementation if only the 
> same level of "core features" are implemented.
> For example if we have encoding A listed in the version 1 "core features" but 
> encoding B is not then at "complianceLevel = 1" we can use encoding A but we 
> cannot use encoding B because it would make the related data unreadable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-format] timarmstrong commented on pull request #164: PARQUET-1950: Define core features



timarmstrong commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771353955


   +1 to @emkornfield's comment - the intent of this is to establish a clear 
baseline about what is supported widely in practice - there are a bunch of 
Parquet features that are in the standard but are hard to use in practice 
because they don't have read support from other implementatoins. I think it 
should ultimately make it easier to get adoption on new features cause the 
status of each feature will be clearer.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2021-02-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276842#comment-17276842
 ] 

Nicholas Chammas commented on PARQUET-41:
-

Where is the user documentation for all the bloom filter-related functionality 
that will be released as part of parquet-mr 1.12? I'm thinking of user settings 
like {{parquet.filter.bloom.enabled}} and {{parquet.bloom.filter.*}}, along 
with anything else a user might care about.

For example, if a Spark user wants to use or configure bloom filters on their 
Parquet data, what documentation should they reference?

> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Junjie Chen
>Priority: Major
>  Labels: filter2, pull-request-available
> Fix For: format-2.7.0, 1.12.0
>
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1969) Test by GithubAction



[ 
https://issues.apache.org/jira/browse/PARQUET-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276836#comment-17276836
 ] 

ASF GitHub Bot commented on PARQUET-1969:
-

wangyum commented on pull request #860:
URL: https://github.com/apache/parquet-mr/pull/860#issuecomment-771332722


   Tested by 
https://github.com/wangyum/parquet-mr/runs/1811695243?check_suite_focus=true



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Test by GithubAction
> 
>
> Key: PARQUET-1969
> URL: https://issues.apache.org/jira/browse/PARQUET-1969
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-mr] wangyum commented on pull request #860: PARQUET-1969: Test by GithubAction



wangyum commented on pull request #860:
URL: https://github.com/apache/parquet-mr/pull/860#issuecomment-771332722


   Tested by 
https://github.com/wangyum/parquet-mr/runs/1811695243?check_suite_focus=true



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1969) Test by GithubAction



[ 
https://issues.apache.org/jira/browse/PARQUET-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276833#comment-17276833
 ] 

Yuming Wang commented on PARQUET-1969:
--

Travis has been broken for several days. I have tested GithubAction: 
https://github.com/wangyum/parquet-mr/actions/runs/529590762

> Test by GithubAction
> 
>
> Key: PARQUET-1969
> URL: https://issues.apache.org/jira/browse/PARQUET-1969
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1950) Define core features / compliance level



[ 
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276830#comment-17276830
 ] 

ASF GitHub Bot commented on PARQUET-1950:
-

emkornfield commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771331050


   @raduteo the main driver for this PR is there has been a lot of confusion as 
what is defined as needing core support.  I think once we finish this PR I'm 
not fully opposed to the idea of supporting this field but I think we need to 
go into greater detail in the specification of what supporting the individual 
files actually means (and i think willing to help both Java and C++ support 
both can go a long way to convincing people that it should become a core 
feature).  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Define core features / compliance level
> ---
>
> Key: PARQUET-1950
> URL: https://issues.apache.org/jira/browse/PARQUET-1950
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Parquet format is getting more and more features while the different 
> implementations cannot keep the pace and left behind with some features 
> implemented and some are not. In many cases it is also not clear if the 
> related feature is mature enough to be used widely or more an experimental 
> one.
> These are huge issues that makes hard ensure interoperability between the 
> different implementations.
> The following idea came up in a 
> [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E].
>  Create a now document in the parquet-format repository that lists the "core 
> features". This document is versioned by the parquet-format releases. This 
> way a certain version of "core features" defines a level of compatibility 
> between the different implementations. This version number can be written to 
> a new field (e.g. complianceLevel) in the footer. If an implementation writes 
> a file with a version in the field it must implement all the related "core 
> features" (read and write) and must not use any other features at write 
> because it makes the data unreadable by another implementation if only the 
> same level of "core features" are implemented.
> For example if we have encoding A listed in the version 1 "core features" but 
> encoding B is not then at "complianceLevel = 1" we can use encoding A but we 
> cannot use encoding B because it would make the related data unreadable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-format] emkornfield commented on pull request #164: PARQUET-1950: Define core features



emkornfield commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771331050


   @raduteo the main driver for this PR is there has been a lot of confusion as 
what is defined as needing core support.  I think once we finish this PR I'm 
not fully opposed to the idea of supporting this field but I think we need to 
go into greater detail in the specification of what supporting the individual 
files actually means (and i think willing to help both Java and C++ support 
both can go a long way to convincing people that it should become a core 
feature).  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1967) Upgrade Zstd-jni to 1.4.8-2



[ 
https://issues.apache.org/jira/browse/PARQUET-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276761#comment-17276761
 ] 

ASF GitHub Bot commented on PARQUET-1967:
-

dongjoon-hyun commented on pull request #859:
URL: https://github.com/apache/parquet-mr/pull/859#issuecomment-771263507


   Do I need to rebase this PR to see the green build?
   > Let's wait for a green build.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade Zstd-jni to 1.4.8-2
> ---
>
> Key: PARQUET-1967
> URL: https://issues.apache.org/jira/browse/PARQUET-1967
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #859: PARQUET-1967: Upgrade Zstd-jni to 1.4.8-2



dongjoon-hyun commented on pull request #859:
URL: https://github.com/apache/parquet-mr/pull/859#issuecomment-771263507


   Do I need to rebase this PR to see the green build?
   > Let's wait for a green build.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Updated invitation: Parquet Sync @ Monthly from 9am to 10am on the fourth Tuesday (PST) (dev@parquet.apache.org)

2021-02-01 Thread shangx

BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20210223T09
DTEND;TZID=America/Los_Angeles:20210223T10
RRULE:FREQ=MONTHLY;BYDAY=4TU
DTSTAMP:20210201T174437Z
ORGANIZER;CN=sha...@uber.com:mailto:sha...@uber.com
UID:1mslidkh8edtiorvhelvtuslp8_r20210223t170...@google.com
ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE;C
 N=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.com_
 53454131313931326e6441766530387468426c616b656c793756432d343836313237@resour
 ce.calendar.google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=sha...@uber.com;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Matthew Turner;X-NUM-GUESTS=0:mailto:matthew.m.tur...@outlook.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=wesmck...@gmail.com;X-NUM-GUESTS=0:mailto:wesmck...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszky@cl
 oudera.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN="Lekshmi Narayanan, Arun Balajiee";X-NUM-GUESTS=0:mailto:arl...@pitt.ed
 u
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekruseja...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=dev@parquet.apache.org;X-NUM-GUESTS=0:mailto:dev@parquet.apache.org
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=Robert Kruszewski;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Ivan Gavryliuk;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Ryan Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com
X-MICROSOFT-CDO-OWNERAPPTID:-488130593
CREATED:20200210T155820Z
DESCRIPTION:Xinli shang is inviting you to a scheduled Zoom meeting.Join Zoom Meeting - password is requiredhttps://uber.zoom.us/
 j/3523778975" id="ow4314" __is_owner="true">https://uber.zoom.us/j/35237789
 75Meeting ID: 352 377 8975Password: 030115One tap m
 obile+16699006833\,\,3523778975# US (San Jose)+16468769923\,\,35237
 78975# US (New York)Dial by your location\;\;\;&
 nbsp\;\;\;\;\;+1 669 900 6833 US (San Jose)\;&
 nbsp\;\;\;\;\;\;\;+1 646 876 9923 US (New Yor
 k)\;\;\;\;\;\;\;\;877 369 0926 
 US Toll-free\;\;\;\;\;\;\;\;855
  880 1246 US Toll-freeMeeting ID: 352 377 8975Find your local numbe
 r: https://uber.zoom.us/u/aZKZunOZ9;>https://uber.zoom.us/u/aZKZun
 OZ9Join by SIP35237
 78...@zoomcrc.comJoin by H.323162.255.37.11 (US West)16
 2.255.36.11 (US East)221.122.88.195 (China)115.114.131.7 (India Mum
 bai)115.114.115.7 (India Hyderabad)213.19.144.110 (EMEA)103.122
 .166.55 (Australia)209.9.211.110 (Hong Kong)64.211.144.160 (Brazil)
 69.174.57.160 (Canada)207.226.132.110 (Japan)Meeting ID: 352 37
 7 8975\n\n-::~:~::~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~
 :~:~:~:~:~:~:~::~:~::-\nPlease do not edit this section of the description.
 \n\nView your event at https://calendar.google.com/calendar/event?action=VI
 EW=MW1zbGlka2g4ZWR0aW9ydmhlbHZ0dXNscDhfUjIwMjEwMjIzVDE3MDAwMCBkZXZAcGFy
 cXVldC5hcGFjaGUub3Jn=MTUjc2hhbmd4QHViZXIuY29tNWViMDc4NTgwNDE2YTViYzkzOG
 U0NzVhMmJiZjgzZjU4ZDBhYjc4NA=America%2FLos_Angeles=en=0.\n-::~:~:
 :~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~::~
 :~::-
LAST-MODIFIED:20210201T174435Z
LOCATION:https://uber.zoom.us/j/3523778975\, SEA | 1191 2nd Ave-8th-Blakely
  (7) [Zoom]
SEQUENCE:1
STATUS:CONFIRMED
SUMMARY:Parquet Sync 
TRANSP:OPAQUE
END:VEVENT
END:VCALENDAR


invite.ics
Description: application/ics

Updated invitation: Parquet Sync @ Monthly from 9am to 10am on the fourth Tuesday from Tue Jan 26 to Mon Feb 22 (PST) (dev@parquet.apache.org)

2021-02-01 Thread shangx

BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20210126T09
DTEND;TZID=America/Los_Angeles:20210126T10
RRULE:FREQ=MONTHLY;UNTIL=20210223T075959Z;BYDAY=4TU
DTSTAMP:20210201T174437Z
ORGANIZER;CN=sha...@uber.com:mailto:sha...@uber.com
UID:1mslidkh8edtiorvhelvtuslp8_r20210126t170...@google.com
ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE;C
 N=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.com_
 53454131313931326e6441766530387468426c616b656c793756432d343836313237@resour
 ce.calendar.google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=sha...@uber.com;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Matthew Turner;X-NUM-GUESTS=0:mailto:matthew.m.tur...@outlook.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE
 ;CN=wesmck...@gmail.com;X-NUM-GUESTS=0:mailto:wesmck...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszky@cl
 oudera.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN="Lekshmi Narayanan, Arun Balajiee";X-NUM-GUESTS=0:mailto:arl...@pitt.ed
 u
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekruseja...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=dev@parquet.apache.org;X-NUM-GUESTS=0:mailto:dev@parquet.apache.org
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=Robert Kruszewski;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Ivan Gavryliuk;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
X-MICROSOFT-CDO-OWNERAPPTID:-432234048
CREATED:20200210T155820Z
DESCRIPTION:Xinli shang is inviting you to a scheduled Zoom meeting.Join Zoom Meeting - password is requiredhttps://uber.zoom.us/
 j/3523778975" id="ow4314" __is_owner="true">https://uber.zoom.us/j/35237789
 75Meeting ID: 352 377 8975Password: 030115One tap m
 obile+16699006833\,\,3523778975# US (San Jose)+16468769923\,\,35237
 78975# US (New York)Dial by your location\;\;\;&
 nbsp\;\;\;\;\;+1 669 900 6833 US (San Jose)\;&
 nbsp\;\;\;\;\;\;\;+1 646 876 9923 US (New Yor
 k)\;\;\;\;\;\;\;\;877 369 0926 
 US Toll-free\;\;\;\;\;\;\;\;855
  880 1246 US Toll-freeMeeting ID: 352 377 8975Find your local numbe
 r: https://uber.zoom.us/u/aZKZunOZ9;>https://uber.zoom.us/u/aZKZun
 OZ9Join by SIP35237
 78...@zoomcrc.comJoin by H.323162.255.37.11 (US West)16
 2.255.36.11 (US East)221.122.88.195 (China)115.114.131.7 (India Mum
 bai)115.114.115.7 (India Hyderabad)213.19.144.110 (EMEA)103.122
 .166.55 (Australia)209.9.211.110 (Hong Kong)64.211.144.160 (Brazil)
 69.174.57.160 (Canada)207.226.132.110 (Japan)Meeting ID: 352 37
 7 8975\n\n-::~:~::~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~
 :~:~:~:~:~:~:~::~:~::-\nPlease do not edit this section of the description.
 \n\nView your event at https://calendar.google.com/calendar/event?action=VI
 EW=MW1zbGlka2g4ZWR0aW9ydmhlbHZ0dXNscDhfUjIwMjEwMTI2VDE3MDAwMCBkZXZAcGFy
 cXVldC5hcGFjaGUub3Jn=MTUjc2hhbmd4QHViZXIuY29tZGZmZTZlYTJlMGNkOTY5NDc1ND
 g4NjBlYmM2NTg5ZTJiNGI2YWRhYQ=America%2FLos_Angeles=en=0.\n-::~:~:
 :~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~::~
 :~::-
LAST-MODIFIED:20210201T174435Z
LOCATION:https://uber.zoom.us/j/3523778975\, SEA | 1191 2nd Ave-8th-Blakely
  (7) [Zoom]
SEQUENCE:1
STATUS:CONFIRMED
SUMMARY:Parquet Sync 
TRANSP:OPAQUE
END:VEVENT
END:VCALENDAR


invite.ics
Description: application/ics

[jira] [Commented] (PARQUET-1950) Define core features / compliance level



[ 
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276735#comment-17276735
 ] 

ASF GitHub Bot commented on PARQUET-1950:
-

raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827


   @gszadovszky and @emkornfield it's highly coincidental that I was just 
looking into cleaning up apache/arrow#8130 when I noticed this thread.
   External column chunks support is one of the key features that attracted me 
to parquet in the first place and I would like the chance to lobby for keeping 
it and actually expanding its adoption - I already have the complete PR 
mentioned above and I can help with supporting it across other implementations.
   There are a few major domains where I see this as valuable component:
   1. Allowing concurrent read to fully flushed row groups while parquet file 
is still being appended to. A slight variant of this is allowing subsequent row 
group appends to a parquet file without impacting potential readers.
   2. Being able to aggregate multiple data sets in a master parquet file: One 
scenario if cumulative recordings like stock prices that get collected daily 
and need to be presented as one unified historical file, another the case of 
enrichment where we want to add new columns to an existing data set.
   3. Allowing for bi-temporal changes to parquet file: External columns chunks 
allows one to apply small corrections by simply creating delta files and new 
footers that simply swap out the chunks that require changes and point to the 
new ones.
   
   If the above use cases are addressed by other parquet overlays or they don't 
line up with the intended usage of parquet I can look elsewhere but it seems 
like huge opportunity and the development cost for supporting it are quite 
minor by comparison  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Define core features / compliance level
> ---
>
> Key: PARQUET-1950
> URL: https://issues.apache.org/jira/browse/PARQUET-1950
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Parquet format is getting more and more features while the different 
> implementations cannot keep the pace and left behind with some features 
> implemented and some are not. In many cases it is also not clear if the 
> related feature is mature enough to be used widely or more an experimental 
> one.
> These are huge issues that makes hard ensure interoperability between the 
> different implementations.
> The following idea came up in a 
> [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E].
>  Create a now document in the parquet-format repository that lists the "core 
> features". This document is versioned by the parquet-format releases. This 
> way a certain version of "core features" defines a level of compatibility 
> between the different implementations. This version number can be written to 
> a new field (e.g. complianceLevel) in the footer. If an implementation writes 
> a file with a version in the field it must implement all the related "core 
> features" (read and write) and must not use any other features at write 
> because it makes the data unreadable by another implementation if only the 
> same level of "core features" are implemented.
> For example if we have encoding A listed in the version 1 "core features" but 
> encoding B is not then at "complianceLevel = 1" we can use encoding A but we 
> cannot use encoding B because it would make the related data unreadable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features



raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827


   @gszadovszky and @emkornfield it's highly coincidental that I was just 
looking into cleaning up apache/arrow#8130 when I noticed this thread.
   External column chunks support is one of the key features that attracted me 
to parquet in the first place and I would like the chance to lobby for keeping 
it and actually expanding its adoption - I already have the complete PR 
mentioned above and I can help with supporting it across other implementations.
   There are a few major domains where I see this as valuable component:
   1. Allowing concurrent read to fully flushed row groups while parquet file 
is still being appended to. A slight variant of this is allowing subsequent row 
group appends to a parquet file without impacting potential readers.
   2. Being able to aggregate multiple data sets in a master parquet file: One 
scenario if cumulative recordings like stock prices that get collected daily 
and need to be presented as one unified historical file, another the case of 
enrichment where we want to add new columns to an existing data set.
   3. Allowing for bi-temporal changes to parquet file: External columns chunks 
allows one to apply small corrections by simply creating delta files and new 
footers that simply swap out the chunks that require changes and point to the 
new ones.
   
   If the above use cases are addressed by other parquet overlays or they don't 
line up with the intended usage of parquet I can look elsewhere but it seems 
like huge opportunity and the development cost for supporting it are quite 
minor by comparison  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Xinli Shang (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276664#comment-17276664
 ] 

Xinli Shang commented on PARQUET-1968:
--

Sure, will connect with you shortly. 

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1967) Upgrade Zstd-jni to 1.4.8-2



[ 
https://issues.apache.org/jira/browse/PARQUET-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276646#comment-17276646
 ] 

ASF GitHub Bot commented on PARQUET-1967:
-

dongjoon-hyun commented on pull request #859:
URL: https://github.com/apache/parquet-mr/pull/859#issuecomment-771151698


   Thank you for reviews.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade Zstd-jni to 1.4.8-2
> ---
>
> Key: PARQUET-1967
> URL: https://issues.apache.org/jira/browse/PARQUET-1967
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #859: PARQUET-1967: Upgrade Zstd-jni to 1.4.8-2



dongjoon-hyun commented on pull request #859:
URL: https://github.com/apache/parquet-mr/pull/859#issuecomment-771151698


   Thank you for reviews.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1950) Define core features / compliance level



[ 
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276640#comment-17276640
 ] 

ASF GitHub Bot commented on PARQUET-1950:
-

emkornfield commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771141210


   > Because non of the ideas of external column chunks nor the summary files 
were spread across the different implementations (because of the lack of 
specification) I think we should not include the usage of the field file_path 
in this document or even explicitly specify that this field is not supported.
   
   Being explicit seems reasonable to me if others are OK with it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Define core features / compliance level
> ---
>
> Key: PARQUET-1950
> URL: https://issues.apache.org/jira/browse/PARQUET-1950
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Parquet format is getting more and more features while the different 
> implementations cannot keep the pace and left behind with some features 
> implemented and some are not. In many cases it is also not clear if the 
> related feature is mature enough to be used widely or more an experimental 
> one.
> These are huge issues that makes hard ensure interoperability between the 
> different implementations.
> The following idea came up in a 
> [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E].
>  Create a now document in the parquet-format repository that lists the "core 
> features". This document is versioned by the parquet-format releases. This 
> way a certain version of "core features" defines a level of compatibility 
> between the different implementations. This version number can be written to 
> a new field (e.g. complianceLevel) in the footer. If an implementation writes 
> a file with a version in the field it must implement all the related "core 
> features" (read and write) and must not use any other features at write 
> because it makes the data unreadable by another implementation if only the 
> same level of "core features" are implemented.
> For example if we have encoding A listed in the version 1 "core features" but 
> encoding B is not then at "complianceLevel = 1" we can use encoding A but we 
> cannot use encoding B because it would make the related data unreadable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-format] emkornfield commented on pull request #164: PARQUET-1950: Define core features



emkornfield commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771141210


   > Because non of the ideas of external column chunks nor the summary files 
were spread across the different implementations (because of the lack of 
specification) I think we should not include the usage of the field file_path 
in this document or even explicitly specify that this field is not supported.
   
   Being explicit seems reasonable to me if others are OK with it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Ryan Blue (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276548#comment-17276548
 ] 

Ryan Blue commented on PARQUET-1968:


Thank you! I'm not sure why it was no longer on my calendar. I have the invite 
now and I plan to attend the sync on the 23rd. If you'd like, we can also set 
up a time to talk about this integration specifically, since it may take a 
while.

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Xinli Shang (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276533#comment-17276533
 ] 

Xinli Shang commented on PARQUET-1968:
--

Hi [~rdblue]. We didn't discuss it in last week's Parquet sync meeting since 
you were not there.  The next Parquet sync is Feb 23th 9:00am. I just added you 
explicitly with your Netflix email account. Hopefully, you can join. 

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Ryan Blue (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276526#comment-17276526
 ] 

Ryan Blue commented on PARQUET-1968:


I would really like to see a new Parquet API that can support some of the 
additional features we needed for Iceberg. I proposed adopting Iceberg's filter 
expressions a year or two ago, so I'm glad to see that the idea has some 
support from other PMC members. This is one reason why the API is in a separate 
module. I think we were planning to talk about this at the next Parquet sync, 
although I'm not sure when that will be.

FYI [~sha...@uber.com].

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1969) Test by GithubAction



[ 
https://issues.apache.org/jira/browse/PARQUET-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276423#comment-17276423
 ] 

Gabor Szadovszky commented on PARQUET-1969:
---

[~yumwang], maybe it's only me who don't know much about github actions but 
could you please describe why it is better than the already existing 
configuration for Travis?

> Test by GithubAction
> 
>
> Key: PARQUET-1969
> URL: https://issues.apache.org/jira/browse/PARQUET-1969
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate



[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276421#comment-17276421
 ] 

Gabor Szadovszky commented on PARQUET-1968:
---

This one sounds great. Meanwhile, we were talking about the filtering APIs 
between Iceberg and Parquet with [~rdblue]. It seems that Iceberg's API already 
contains this feature and it seems to be more clear and usable than the one 
implemented in Parquet. It might be a good idea to separate this filtering API 
in Iceberg and use/implement it in Parquet. (See 
https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/expressions/Expression.java
 for Iceberg's API.)

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1966) Fix build with JDK11 for JDK8



[ 
https://issues.apache.org/jira/browse/PARQUET-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276413#comment-17276413
 ] 

ASF GitHub Bot commented on PARQUET-1966:
-

gszadovszky commented on pull request #858:
URL: https://github.com/apache/parquet-mr/pull/858#issuecomment-770954301


   Thanks, @dossett. I am not sure about the Travis status either. I've 
restarted the build, let's see what happens.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix build with JDK11 for JDK8
> -
>
> Key: PARQUET-1966
> URL: https://issues.apache.org/jira/browse/PARQUET-1966
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Blocker
>
> However the target is set to 1.8 it seems to be not enough as of building 
> with JDK11 it fails at runtime with the following exception: 
> {code:java}
> ava.lang.NoSuchMethodError: 
> java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
> at 
> org.apache.parquet.bytes.CapacityByteArrayOutputStream.write(CapacityByteArrayOutputStream.java:197)
> at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridEncoder.writeOrAppendBitPackedRun(RunLengthBitPackingHybridEncoder.java:193)
> at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridEncoder.writeInt(RunLengthBitPackingHybridEncoder.java:179)
> at 
> org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.getBytes(DictionaryValuesWriter.java:167)
> at 
> org.apache.parquet.column.values.fallback.FallbackValuesWriter.getBytes(FallbackValuesWriter.java:74)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:60)
> at 
> org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:387)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:235)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:222)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29)
> at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:307)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:465)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:148)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
> {code}
> To reproduce execute the following.
> {code}
> export JAVA_HOME={the path to the JDK11 home}
> mvn clean install -Djvm={the path to the JRE8 java executable}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-mr] gszadovszky commented on pull request #858: PARQUET-1966: Fix build with JDK11 for JDK8



gszadovszky commented on pull request #858:
URL: https://github.com/apache/parquet-mr/pull/858#issuecomment-770954301


   Thanks, @dossett. I am not sure about the Travis status either. I've 
restarted the build, let's see what happens.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1969) Test by GithubAction



[ 
https://issues.apache.org/jira/browse/PARQUET-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276403#comment-17276403
 ] 

ASF GitHub Bot commented on PARQUET-1969:
-

wangyum opened a new pull request #860:
URL: https://github.com/apache/parquet-mr/pull/860


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Test by GithubAction
> 
>
> Key: PARQUET-1969
> URL: https://issues.apache.org/jira/browse/PARQUET-1969
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-mr] wangyum opened a new pull request #860: PARQUET-1969: Test by GithubAction



wangyum opened a new pull request #860:
URL: https://github.com/apache/parquet-mr/pull/860


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (PARQUET-1969) Test by GithubAction

Yuming Wang created PARQUET-1969:


 Summary: Test by GithubAction
 Key: PARQUET-1969
 URL: https://issues.apache.org/jira/browse/PARQUET-1969
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-1968) FilterApi support In predicate

Yuming Wang created PARQUET-1968:


 Summary: FilterApi support In predicate
 Key: PARQUET-1968
 URL: https://issues.apache.org/jira/browse/PARQUET-1968
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.0
Reporter: Yuming Wang


FilterApi should support native In predicate.

Spark:

https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605

Impala:

https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1966) Fix build with JDK11 for JDK8



[ 
https://issues.apache.org/jira/browse/PARQUET-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276323#comment-17276323
 ] 

ASF GitHub Bot commented on PARQUET-1966:
-

dossett commented on pull request #858:
URL: https://github.com/apache/parquet-mr/pull/858#issuecomment-770862284


   That's a nice use of profiles.  I'm (non-binding) +1. I don't see any logs 
for the travis build failures, don't know if they've expired or maybe the tests 
just failed to launch for some reason.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix build with JDK11 for JDK8
> -
>
> Key: PARQUET-1966
> URL: https://issues.apache.org/jira/browse/PARQUET-1966
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Blocker
>
> However the target is set to 1.8 it seems to be not enough as of building 
> with JDK11 it fails at runtime with the following exception: 
> {code:java}
> ava.lang.NoSuchMethodError: 
> java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
> at 
> org.apache.parquet.bytes.CapacityByteArrayOutputStream.write(CapacityByteArrayOutputStream.java:197)
> at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridEncoder.writeOrAppendBitPackedRun(RunLengthBitPackingHybridEncoder.java:193)
> at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridEncoder.writeInt(RunLengthBitPackingHybridEncoder.java:179)
> at 
> org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.getBytes(DictionaryValuesWriter.java:167)
> at 
> org.apache.parquet.column.values.fallback.FallbackValuesWriter.getBytes(FallbackValuesWriter.java:74)
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:60)
> at 
> org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:387)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:235)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:222)
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29)
> at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:307)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:465)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:148)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
> {code}
> To reproduce execute the following.
> {code}
> export JAVA_HOME={the path to the JDK11 home}
> mvn clean install -Djvm={the path to the JRE8 java executable}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-mr] dossett commented on pull request #858: PARQUET-1966: Fix build with JDK11 for JDK8



dossett commented on pull request #858:
URL: https://github.com/apache/parquet-mr/pull/858#issuecomment-770862284


   That's a nice use of profiles.  I'm (non-binding) +1. I don't see any logs 
for the travis build failures, don't know if they've expired or maybe the tests 
just failed to launch for some reason.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters



[ 
https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276312#comment-17276312
 ] 

Yuming Wang commented on PARQUET-1805:
--

Thank you [~gszadovszky] [~junjie] This is what I want:
{code:sql}
set parquet.bloom.filter.enabled=false;
set parquet.bloom.filter.enabled#ts=true;
set parquet.bloom.filter.enabled#dec=true;
{code}
Benchmark and benchmark result:
{code:scala}
val numRows = 1024 * 1024 * 15
val df = spark.range(numRows).selectExpr(
  "id",
  "cast(id as string) as s",
  "cast(id as timestamp) as ts",
  "cast(cast(id as timestamp) as date) as td",
  "cast(id as decimal) as dec")
val benchmark = new org.apache.spark.benchmark.Benchmark(
  "Benchmark bloom filter write",
  numRows,
  minNumIters = 5)

benchmark.addCase("default") { _ =>
  withSQLConf() {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
  }
}

benchmark.addCase("Build bloom filter for ts column") { _ =>
  withSQLConf(
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> 
"false",
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#ts" 
-> "true") {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
  }
}

benchmark.addCase("Build bloom filter for ts and dec column") { _ =>
  withSQLConf(
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> 
"false",
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#ts" 
-> "true",
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#dec" 
-> "true") {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
  }
}

benchmark.addCase("Build bloom filter for all column") { _ =>
  withSQLConf(
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> 
"true") {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
  }
}
benchmark.run()
{code}
{noformat}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark bloom filter write: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

default5207   5314  
72  3.0 331.1   1.0X
Build bloom filter for ts column   5808   6065 
245  2.7 369.2   0.9X
Build bloom filter for ts and dec column   6685   6776  
79  2.4 425.0   0.8X
Build bloom filter for all column  9077   9889 
629  1.7 577.1   0.6X
{noformat}

cc [~dongjoon]

> Refactor the configuration for bloom filters
> 
>
> Key: PARQUET-1805
> URL: https://issues.apache.org/jira/browse/PARQUET-1805
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Refactor the hadoop configuration for bloom filters according to PARQUET-1784.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (PARQUET-1899) [C++] Deprecated ReadBatchSpaced in parquet/column_reader

2021-02-01 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-1899.
-
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 8015
[https://github.com/apache/arrow/pull/8015]

> [C++] Deprecated ReadBatchSpaced in parquet/column_reader
> -
>
> Key: PARQUET-1899
> URL: https://issues.apache.org/jira/browse/PARQUET-1899
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> This method is not used any place outside of unit tests and doesn't space 
> elements properly in the context of deeply nested structures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1950) Define core features / compliance level



[ 
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276280#comment-17276280
 ] 

ASF GitHub Bot commented on PARQUET-1950:
-

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r567767040



##
File path: CoreFeatures.md
##
@@ -0,0 +1,178 @@
+
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+ Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not 
listed
+here.
+
+ Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* 
[PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features



gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r567767040



##
File path: CoreFeatures.md
##
@@ -0,0 +1,178 @@
+
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+ Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not 
listed
+here.
+
+ Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* 
[PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+*

[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters



[ 
https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276213#comment-17276213
 ] 

Gabor Szadovszky commented on PARQUET-1805:
---

Oh, I got it, thanks [~junjie]. I've felt it was more logical this way. The 
"major" configuration is for all columns and the "column specific" one is to 
configure otherwise. Since the "major" one is false by default you only need to 
enable the bloom filters for the columns one-by-one. You don't even need to set 
`parquet.bloom.filter.enabled` but the columns specific ones only. We've tried 
to describe this in the 
[README|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md].

> Refactor the configuration for bloom filters
> 
>
> Key: PARQUET-1805
> URL: https://issues.apache.org/jira/browse/PARQUET-1805
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Refactor the hadoop configuration for bloom filters according to PARQUET-1784.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1950) Define core features / compliance level



[ 
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276204#comment-17276204
 ] 

ASF GitHub Bot commented on PARQUET-1950:
-

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r567688108



##
File path: CoreFeatures.md
##
@@ -0,0 +1,178 @@
+
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+ Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not 
listed
+here.
+
+ Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* 
[PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features



gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r567688108



##
File path: CoreFeatures.md
##
@@ -0,0 +1,178 @@
+
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+ Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not 
listed
+here.
+
+ Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* 
[PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN

Review comment:
   The parts describing how the parquet-mr implementations work were not 
meant to be part of the final document. As I don't know too much about other 
implementations

[jira] [Comment Edited] (PARQUET-1805) Refactor the configuration for bloom filters

2021-02-01 Thread Junjie Chen (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276198#comment-17276198
 ] 

Junjie Chen edited comment on PARQUET-1805 at 2/1/21, 9:43 AM:
---

I think what [~yumwang] concern is we enable all columns' bloom filter when 
{{parquet.bloom.filter.enabled}} is set to true. That behaviour is a bit odd 
consider if we have a table with a heap of columns. We could change to use 
{{parquet.bloom.filter.enabled#column.path}} to enable the bloom filter for the 
specific column after setting {{parquet.bloom.filter.enabled}}.


was (Author: junjie):
I think what [~yumwang] concern is we enable all columns' bloom filter when 
{{parquet.bloom.filter.enabled}} is set to true. That behaviour is a bit odd, 
we could change to use {{parquet.bloom.filter.enabled#column.path}} to enable 
the bloom filter for the specific column after setting 
{{parquet.bloom.filter.enabled}}.

> Refactor the configuration for bloom filters
> 
>
> Key: PARQUET-1805
> URL: https://issues.apache.org/jira/browse/PARQUET-1805
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Refactor the hadoop configuration for bloom filters according to PARQUET-1784.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters

2021-02-01 Thread Junjie Chen (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276198#comment-17276198
 ] 

Junjie Chen commented on PARQUET-1805:
--

I think what [~yumwang] concern is we enable all columns' bloom filter when 
{{parquet.bloom.filter.enabled}} is set to true. That behaviour is a bit odd, 
we could change to use {{parquet.bloom.filter.enabled#column.path}} to enable 
the bloom filter for the specific column after setting 
{{parquet.bloom.filter.enabled}}.

> Refactor the configuration for bloom filters
> 
>
> Key: PARQUET-1805
> URL: https://issues.apache.org/jira/browse/PARQUET-1805
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Refactor the hadoop configuration for bloom filters according to PARQUET-1784.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1950) Define core features / compliance level



[ 
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276187#comment-17276187
 ] 

ASF GitHub Bot commented on PARQUET-1950:
-

gszadovszky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-770716726


   @emkornfield, in parquet-mr there was another reason to use the `file_path` 
in the footer. The feature is called _summary files_. The idea was to have a 
separate file containing a summarized footer of several parquet files so you 
might do filtering and pruning without even checking a file's own footer. As 
far as I know this implementation exists in parquet-mr only and there are no 
specification for it in parquet-format.
   This feature is more or less abandoned meaning during the development of 
some newer features (e.g. column indexes, bloom filters) the related parts 
might not updated properly. There were a couple of discussions about this topic 
in the dev list: 
[here](https://lists.apache.org/thread.html/fb232d024d3ca0f3900b76fb884b55fad11dffafb182d6f336b37a69%40%3Cdev.parquet.apache.org%3E)
 and 
[here](https://lists.apache.org/thread.html/r2e539c50c1cc818304de2b7dc28a4109aaa529955a42664e3073f811%40%3Cdev.parquet.apache.org%3E).
   
   Because non of the ideas of _external column chunks_ nor the _summary files_ 
were spread across the different implementations (because of the lack of 
specification) I think we should not include the usage of the field `file_path` 
in this document or even explicitly specify that this field is not supported.
   
   I am open to specify such features properly and after the required 
demonstration we may include them in a later version of the core features. 
However, I think these requirements (e.g. snapshot API, summary files) are not 
necessarily needed by all of our clients or already implemented in some ways 
(e.g. storing statistics in HMS, Iceberg).
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Define core features / compliance level
> ---
>
> Key: PARQUET-1950
> URL: https://issues.apache.org/jira/browse/PARQUET-1950
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Parquet format is getting more and more features while the different 
> implementations cannot keep the pace and left behind with some features 
> implemented and some are not. In many cases it is also not clear if the 
> related feature is mature enough to be used widely or more an experimental 
> one.
> These are huge issues that makes hard ensure interoperability between the 
> different implementations.
> The following idea came up in a 
> [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E].
>  Create a now document in the parquet-format repository that lists the "core 
> features". This document is versioned by the parquet-format releases. This 
> way a certain version of "core features" defines a level of compatibility 
> between the different implementations. This version number can be written to 
> a new field (e.g. complianceLevel) in the footer. If an implementation writes 
> a file with a version in the field it must implement all the related "core 
> features" (read and write) and must not use any other features at write 
> because it makes the data unreadable by another implementation if only the 
> same level of "core features" are implemented.
> For example if we have encoding A listed in the version 1 "core features" but 
> encoding B is not then at "complianceLevel = 1" we can use encoding A but we 
> cannot use encoding B because it would make the related data unreadable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [parquet-format] gszadovszky commented on pull request #164: PARQUET-1950: Define core features

gszadovszky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-770716726

@emkornfield, in parquet-mr there was another reason to use the `file_path`
in the footer. The feature is called _summary files_. The idea was to have a
separate file containing a summarized footer of several parquet files so you
might do filtering and pruning without even checking a file's own footer. As
far as I know this implementation exists in parquet-mr only and there are no
specification for it in parquet-format.
This feature is more or less abandoned meaning during the development of
some newer features (e.g. column indexes, bloom filters) the related parts
might not updated properly. There were a couple of discussions about this topic
in the dev list:
[here](https://lists.apache.org/thread.html/fb232d024d3ca0f3900b76fb884b55fad11dffafb182d6f336b37a69%40%3Cdev.parquet.apache.org%3E)
and
[here](https://lists.apache.org/thread.html/r2e539c50c1cc818304de2b7dc28a4109aaa529955a42664e3073f811%40%3Cdev.parquet.apache.org%3E).

Because non of the ideas of _external column chunks_ nor the _summary files_
were spread across the different implementations (because of the lack of
specification) I think we should not include the usage of the field `file_path`
in this document or even explicitly specify that this field is not supported.

I am open to specify such features properly and after the required
demonstration we may include them in a later version of the core features.
However, I think these requirements (e.g. snapshot API, summary files) are not
necessarily needed by all of our clients or already implemented in some ways
(e.g. storing statistics in HMS, Iceberg).

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters