[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-31 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853433#comment-16853433
 ] 

Wes McKinney commented on ARROW-1983:
-

Yes. I don't think it necessarily to resolve all of this in a single patch, so 
we can open a follow-up JIRA to implement the optimization to read a row group 
given a _metadata file. There is some other complexity there such as how to 
open the filepath (you need a FileSystem handle -- see the filesystem API work 
that is in process)

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-31 Thread Rick Zamora (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853430#comment-16853430
 ] 

Rick Zamora commented on ARROW-1983:


Right, I see what you are saying.  You can pass in a list of files to 
pq.ParquetDataset (obtained by calling read_metadata on the metadata file), but 
the footer metadata will be unecessarily parsed a second time.   For dask, this 
is probably not much of an issue, because each worker will only be dealing with 
a subset of the global dataset. In many other cases this is clearly 
undesireable.

 

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-31 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853383#comment-16853383
 ] 

Wes McKinney commented on ARROW-1983:
-

Well, one issue is how to use the _metadata file to read data from the files it 
lists within without having to parse those files' respective metadata again. I 
think this may require a little bit of refactoring in the Parquet C++ library

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-31 Thread Rick Zamora (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853378#comment-16853378
 ] 

Rick Zamora commented on ARROW-1983:


I submitted a PR to perform the metadata aggregation and metadata-only file 
write ([https://github.com/apache/arrow/pull/4405]).  I just syncronized with 
the master branch, so hopefully I can address any suggestions/concerns people 
have relatively quickly.

Are there any additional features that we need for "utilizing" the metadata 
file within arrow.parque itself?  I believe the existing read_metadata function 
should be sufficient for the needs of dask.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-27 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849187#comment-16849187
 ] 

Joris Van den Bossche commented on ARROW-1983:
--

I think so yes (at least when reading, it returns a single FileMetadata 
instance with all row groups).

Besides the "append" operation, we also need a "write" method for such 
FileMetadata instance (I suppose this only needs some work on the python/cython 
side, since this is just writing a parquet file without actual data, although 
didn't check C++). There is currently a {{write_metadata}}, but that requires 
an *arrow* schema, and not a *parquet* schema. 
Regarding the public API, I suppose we can modify {{write_metadata}} to also 
accept a parquet schema, to not have to add an extra function. That will need 
some changes under the hood in {{ParquetWriter}} to be able to accept a given 
FileMetadata object.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-27 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849176#comment-16849176
 ] 

Wes McKinney commented on ARROW-1983:
-

I see, that was not clear to me. So what we actually need is a 
{{FileMetaData::AppendRowGroups}} method (that merges one metadata into 
another), is that correct?

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-27 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849118#comment-16849118
 ] 

Joris Van den Bossche commented on ARROW-1983:
--

{quote}Correspondingly, please also write a function that parses a 
multiple-metadata file like{quote}

We already have {{read_metadata}} (on the python side, it actually uses the 
normal parquet file reading), that does read such {{_metadata}} files. Are you 
still looking for something else?

To my understanding, it is not really a "multiple-metadata file", but a file 
with a single FileMetadata where the row groups of all metadata objects are 
combined.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-27 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849106#comment-16849106
 ] 

Wes McKinney commented on ARROW-1983:
-

Probably the most flexible thing for writing would be a function that appends a 
metadata to an OutputStream

{code:java}
Status AppendFileMetaData(const FileMetaData& metadata, 
arrow::io::OutputStream* out);
{code}

Correspondingly, please also write a function that parses a multiple-metadata 
file like

{code}
Status ParseMetaDataFile(arrow::io::InputStream* input, 
std::vector>* out);
{code}

Lastly, AFAIK we are not able to instantiate {{ParquetFileReader}} given 
previously-read {{FileMetaData}} -- probably can do that in a separate JIRA

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-16 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841668#comment-16841668
 ] 

Joris Van den Bossche commented on ARROW-1983:
--

> We also need to make sure that the file path is being set in the metadata, 
> otherwise the {{_metadata}} file is useless

Indeed, I opened ARROW-5349 for that earlier today. I assume this is separate 
from the {{WriteMultipleMetadata}} you mention above, as it is the user (eg 
Dask) who knows where the files that correspond to the different metadata 
objects are put?



> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-16 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841625#comment-16841625
 ] 

Wes McKinney commented on ARROW-1983:
-

Right. This is a relatively straightforward C++ function to write – Pearu 
actually already had partially implemented it in one of the patch iterations. 
The API would be something like
{code:java}
Status WriteMultipleMetadata(const std::vector>& 
metadatas,
 arrow::io::OutputStream* out);
{code}
Does someone want to write it (I mean, I can do it, but it would be good for 
other people to get some experience with the Parquet codebase)? We also need to 
make sure that the file path is being set in the metadata, otherwise the 
{{_metadata}} file is useless

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-16 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841424#comment-16841424
 ] 

Joris Van den Bossche commented on ARROW-1983:
--

> what would be desired the interface for writing and reading the dataset 
> metadata instances?

For reading, I think the existing {{parquet.read_metadata}} is sufficient? 
(that already works with such metadata files generated by other libraries).

For writing, the current {{parquet.write_metadata}} expects a pyarrow schema, 
so for that we indeed need a new method, or adapt the existing one to also 
accept a parquet FileMetaData object.

 

But for writing, I think the main part that is missing is a way to combined the 
list of metadata objects into a single FileMetaData object?

 

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-16 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841417#comment-16841417
 ] 

Joris Van den Bossche commented on ARROW-1983:
--

Copying the questions that [~pearu] mentioned above / asked on github here:

{quote}2. After collecting all file metadata of dataset pieces (possibly from 
different dask processes), what would be desired the interface for writing and 
reading the dataset metadata instances? Here's an initial proposal for Python:

{code:python}
def write_dataset_metadata(metadata_list, where):
...

def read_dataset_metadata(where, memory_map=False):
...
return metadata_list
{code}

3. Clearly, each file metadata instance in `metadata_list` contains same 
information as in all metadata instances. Would it make sense and use to 
introduce a new `DatasetMetadata` class (in C++, also exposed to Python) that 
would collect the metadata information of dataset pieces in a more compact way 
as well as would provide I/O methods? (This also addresses the question 
2.){quote}



> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-04 Thread Pearu Peterson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833141#comment-16833141
 ] 

Pearu Peterson commented on ARROW-1983:
---

For this issue, questions raised in 
https://github.com/apache/arrow/pull/4236#issuecomment-489017226 needs to be 
addressed.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-04 Thread Pearu Peterson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833140#comment-16833140
 ] 

Pearu Peterson commented on ARROW-1983:
---

ARROW-5258 provides a way to collect file metadata objects created by 
write_to_dataset function.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-01 Thread Pearu Peterson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831195#comment-16831195
 ] 

Pearu Peterson commented on ARROW-1983:
---

Arrow [PR 4236|https://github.com/apache/arrow/pull/4236] implements another 
approach for collecting the file metadata instances of dataset pieces: 
`write_to_dataset` can take a `metadata_collector` kw argument that list value 
will be filled with the file metadata instances.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-25 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826179#comment-16826179
 ] 

Martin Durant commented on ARROW-1983:
--

I don't know about deprecated, and I wouldn't hold my breath over a Parquet 
2.0... the metadata file is very useful in the context of Dask, so that we can 
know what files exist in the dataset without doing a glob, and we can do simple 
filtering on max/min values without having to touch every file, which can be 
very slow on remote storage.

The spec certainly allows for a metadata file even if it could be seen as more 
of a convention from Hive: a column chunk allows for a file_path attribute, 
indicating that the data itself is in another file (which, actually, doesn't 
need to be a proper parquet file, but in practice always is).

You could indeed construct the metadata by repeatedly appending to a file, 
since you will know that everything in the file is the thrift footer block 
except for the PAR1 bytes. However, you will not preserve row-group order in 
general, which might be important, and writes had better be atomic. 
Furthermore, remote storage will usually *not allow* small appends from 
multiple processes in this way.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-25 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826178#comment-16826178
 ] 

Martin Durant commented on ARROW-1983:
--

I don't know about deprecated, and I wouldn't hold my breath over a Parquet 
2.0... the metadata file is very useful in the context of Dask, so that we can 
know what files exist in the dataset without doing a glob, and we can do simple 
filtering on max/min values without having to touch every file, which can be 
very slow on remote storage.

The spec certainly allows for a metadata file even if it could be seen as more 
of a convention from Hive: a column chunk allows for a file_path attribute, 
indicating that the data itself is in another file (which, actually, doesn't 
need to be a proper parquet file, but in practice always is).

You could indeed construct the metadata by repeatedly appending to a file, 
since you will know that everything in the file is the thrift footer block 
except for the PAR1 bytes. However, you will not preserve row-group order in 
general, which might be important, and writes had better be atomic. 
Furthermore, remote storage will usually *not allow* small appends from 
multiple processes in this way.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-25 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826171#comment-16826171
 ] 

Martin Durant commented on ARROW-1983:
--

I don't know about "deprecated" (or whether a Parquet 2.0 is really a thing), 
but...
It is definitely useful for the Dask case of wanting to know what files exist 
and the max/mins for basic filtering, without having to touch all the files 
before makign that decision.

The metadata file was indeed taken on by Hive, and could be seen as a 
convention, but it is certainly allowed by the spec since the column_chunk 
structure allows for a path, and so for a row-group to have its data in another 
file. You can indeed just append to a file as you go, since in the case that 
there is no actual data, you know exactly how much space is taken up before the 
start of the thrift block (4 bytes: PAR1). However, the writes must be atomic, 
and will in general *not work* for remote storage, where small appends don't 
happen as you might assume.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823112#comment-16823112
 ] 

Wes McKinney commented on ARROW-1983:
-

I'm just returning from vacation and catching up on e-mail etc., this effort is 
a priority for me so I will review the discussion and PR and give feedback as 
soon as I can

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-17 Thread Pearu Peterson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820504#comment-16820504
 ] 

Pearu Peterson commented on ARROW-1983:
---

Arrow [PR 4166|https://github.com/apache/arrow/pull/4166] implements the 
approach 1 above.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-17 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820090#comment-16820090
 ] 

Martin Durant commented on ARROW-1983:
--

> If readers would be able to use metadata from a separate file (not sure if 
> parquet format would allow it), duplicating metadata storage in both 
> approaches could be avoided.

Yes, this is exactly what should happen: the metadata file contains all of the 
metadata (with paths pointing to the actual data files) and the data files 
contain the metadata with only their own row-groups specified. Each row group 
def appears thus in two places, and the schema is repeated many times. This 
duplication is desirable to allow viewing the data as a complete set, or each 
file independently.

If reading via the separate metadata file, the reader does not need to touch 
the footers of the data files at all, since it already has everything it needs.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-17 Thread Pearu Peterson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820084#comment-16820084
 ] 

Pearu Peterson commented on ARROW-1983:
---

There seems to be two options to write a separate metadata file from Arrow:
 # Following Wes comment "On the C++ side we would expose an API to append row 
group metadata into a common file.", introduce the second sink argument to 
ParquetFileWriter::Open that will be used for collecting FileMetaData content 
during the writing. [~wesmckinn], can you confirm that this would be the right 
approach?
 # Introduce a flag to ParquetFileWriter that when enabled will cause skipping 
all data writes and would write only FileMetaData content. As a result, one 
would need to call the dataset write twice, one for writing data (and metadata) 
as currently, and the second time for writing metadata-only (writes would be 
collected to a single file).

Comparing the two approaches, the approach 2 is simpler but suboptimal as the 
writing process is executed twice. In both cases, the metadata would have 
duplicated storage (in data files as currently, and in the separate metadata 
file). If readers would be able to use metadata from a separate file (not sure 
if parquet format would allow it), duplicating metadata storage in both 
approaches could be avoided.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-15 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818370#comment-16818370
 ] 

Martin Durant commented on ARROW-1983:
--

> Note that the Parquet format has three different metadata structures

No, this is incorrect, unfortunately the tern "metadata" is used with multiple 
meanings here. 

- All parquet files contain FileMetaData in the file footer, which may include 
one or more key-value pairs, and includes other important things like the schema
- If the file contains any row-groups or references to row-groups in other 
files, it will also contain ColumnMetaData (and possible more key-value pairs); 
this is all *within* the FileMetaData structure
- the special file `_metadata` may exist, which contains *only* FileMetaData, 
and any row-groups have only links to other files and no data within the file.
- the special file `_common_metadata` may exist, which also only contains a 
FileMetaData structure, but has no row group components at all. 
- ordinary data files should have the same common metadata (schema, 
key-values), so you can load any one of them, but they contain only the 
row-groups of that one file and no links to any others.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-15 Thread Pearu Peterson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818343#comment-16818343
 ] 

Pearu Peterson commented on ARROW-1983:
---

Note that the Parquet format has three different metadata structures, see 
[https://github.com/apache/parquet-format#metadata] .

The "_metadata" corresponds to `FileMetaData.key_value_metadata` (in 
parquet-format specification) while the "statistics" (that is of interest of 
Dask, if I understand it correctly) corresponds to 
`ColumnMetadata.key_value_metadata`.
Yes, Arrow can read all this information and more. My basic questions are:
 # What information needs to be collected? Note that some information is 
internal to parquet files that one would never need, hence it would just a 
waste of space to collect it, especially when the Datasets become huge (as 
would be expected in Dask applications).
 # Where this information should be gathered for easy and efficient access?

 

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-15 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818334#comment-16818334
 ] 

Martin Durant commented on ARROW-1983:
--

A convention, yes, but not in the parquet standard as such. I believe spark 
might have started this, although they producing them is now optional. You 
typically have `_metadata`, containing schema, references to all the row-groups 
and information about them such as column statistics, and `_common_metadata`, 
which contains only the schema (and is therefore much smaller).

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-15 Thread Matthew Rocklin (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818319#comment-16818319
 ] 

Matthew Rocklin commented on ARROW-1983:


My understanding is that there is already a standard around using a "_metadata" 
file that presumably is expected to have certain data laid out in a certain 
way.  It may be that [~mdurant] can provide a nice reference to the 
expectations.

It also looks like PyArrow has a nice reader for this information.  If I open 
up a Parquet Dataset that has a `_metadata` file I find that my object has all 
of the right information, so that might also be a good place to look.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-04-14 Thread Pearu Peterson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817308#comment-16817308
 ] 

Pearu Peterson commented on ARROW-1983:
---

Currently, ParquetDataset metadata has the following approximate data structure 
(type-specs are shown only for the relevant attributes):
{noformat}
ParquetDataset:
  list pieces
  list paths
  fs
  common_metadata, common_metadata_path
  metadata, metadata_path

ParquetDatasetPiece:
  sting path
  get_metadata() -> FileMetaData
  partition_keys

FileMetaData:
  list row_groups
  ParquetSchema schema
  dict metadata = {b‘pandas’: }
  int num_rows, num_columns
  str format_version, created_by

RowGroupMetaData:
  list columns
  int num_rows, total_byte_size

ColumnChunkMetaData:
  str physical_type, encodings, path_in_schema, compression
  int num_values, total_uncompressed_size, total_compressed_size, 
data_page_offset, index_page_offset, dictionary_page_offset
  RowGroupStatistics statistics

RowGroupStatistics:
  bool has_min_max
  int min, max, null_count, distinct_count, num_values
  str physical_type{noformat}
If only the data in RowGroupStatistics is relevant for this issue (please 
confirm), then the statistics data could be collected into a single Parquet 
file, say `_statistics`, containing the following columns:
{noformat}
, , , {noformat}
[~mrocklin], would the information in `_statistics` sufficient for Dask needs?

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-03-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792681#comment-16792681
 ] 

Wes McKinney commented on ARROW-1983:
-

This will have to get done for 0.14. We're basically out of time for 0.13 and 
only fixing bugs now

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robbie Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-03-07 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787367#comment-16787367
 ] 

Wes McKinney commented on ARROW-1983:
-

No timeline. I'm not sure who is going to do the work; I will not be able to in 
time for 0.13

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robbie Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.13.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-03-01 Thread Matthew Rocklin (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782163#comment-16782163
 ] 

Matthew Rocklin commented on ARROW-1983:


Hi all, thought I would check in here.  I'll likely start planning work around 
Dask Parquet reader/writer functionality soon, and am curious is there is any 
timeline on this issue.  "Nope" is a totally fine answer, just looking for 
information for planning purposes. 

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robbie Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.13.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-01-24 Thread Matthew Rocklin (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751500#comment-16751500
 ] 

Matthew Rocklin commented on ARROW-1983:


In https://github.com/dask/dask/issues/4410 we learn that metadata information 
can grow to be large in the case where there are many columns and many 
partitions.  There is some value to ensuring that the metadata results are 
somewhat compact in memory, though I also wouldn't spend a ton of effort 
optimizing here.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robert Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.13.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-12-31 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731472#comment-16731472
 ] 

Wes McKinney commented on ARROW-1983:
-

Velocity on these things should pick up in 2019 since the Ursa Labs team is 
growing. The "Arrow Dataset" project extends beyond Parquet (where Parquet is 
one storage format). Ideally this work will happen in Q1 2019. Handling the 
"_metadata" file is lower hanging fruit so that can likely get done a lot sooner

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robert Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.13.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-12-31 Thread Matthew Rocklin (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731467#comment-16731467
 ] 

Matthew Rocklin commented on ARROW-1983:


>  I'm planning to move more of the multifile dataset handling into C++ because 
> we also need it in Ruby and R, so would make sense to maintain one 
> implementation for the 3 languages

Makes sense to me.  No pressure, but is there a time in particular when you're 
planning to do this?  This will help me with planning on the Dask side.  I'm 
also happy to help with things on the Python Arrow side near term if they come 
up.  

For context see https://github.com/dask/dask/pull/4336#issuecomment-450686100

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robert Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.13.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-12-31 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731464#comment-16731464
 ] 

Wes McKinney commented on ARROW-1983:
-

Mechanically this isn't a huge change. On the C++ side we would expose an API 
to append row group metadata into a common file. This can be used from the 
Python side, then. I'm planning to move more of the multifile dataset handling 
into C++ because we also need it in Ruby and R, so would make sense to maintain 
one implementation for the 3 languages

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robert Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.13.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-12-29 Thread Matthew Rocklin (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730844#comment-16730844
 ] 

Matthew Rocklin commented on ARROW-1983:


> If I understand correctly, we need to combine all of the row group metadata 
> for all files in a directory.

Yes.  Ideally when writing a row group we would get some metadata object in 
memory. We would then collect all of those objects and hand them to some 
`write_metadata` function afterwards.

> When a new file is written, does this file have to be updated?
 
Yes, or it can be removed/invalidated.
 
As a side note, this is probably one of a small number of issues that stop Dask 
Dataframe from using PyArrow by default.  Metadata files with full row group 
information are especially valuable for us, particularly with remote/cloud 
storage.  (I'm going through Dask's parquet handling now)

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robert Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.13.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-11-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687186#comment-16687186
 ] 

Wes McKinney commented on ARROW-1983:
-

If I understand correctly, we need to combine all of the row group metadata for 
all files in a directory. When a new file is written, does this file have to be 
updated?

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robert Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.13.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-10-01 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633762#comment-16633762
 ] 

Wes McKinney commented on ARROW-1983:
-

More work is needed here it seems. Moving to 0.12

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robert Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.12.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-07-12 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541803#comment-16541803
 ] 

Robert Gruener commented on ARROW-1983:
---

[~xhochy] I made this dependent task PARQUET-1348

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.11.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-07-11 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540231#comment-16540231
 ] 

Robert Gruener commented on ARROW-1983:
---

This looks like it would need changes in parquet-cpp as the [arrow writer only 
takes a 
Schema|https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.h#L116]
 and not the FileMetaData object which contains the row group information.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.11.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)