[jira] [Commented] (ARROW-2801) [Python] Implement splt_row_groups for ParquetDataset

2018-12-05 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710584#comment-16710584
 ] 

Robert Gruener commented on ARROW-2801:
---

I might have time to finish this up next week. I actually already have it 
implemented for not using the _metadata file but need to write the unit tests.

> [Python] Implement splt_row_groups for ParquetDataset
> -
>
> Key: ARROW-2801
> URL: https://issues.apache.org/jira/browse/ARROW-2801
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Minor
>  Labels: parquet, pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently the split_row_groups argument in ParquetDataset yields a not 
> implemented error. An easy and efficient way to implement this is by using 
> the summary metadata file instead of opening every footer file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level

2018-09-05 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604756#comment-16604756
 ] 

Robert Gruener commented on ARROW-1796:
---

That sounds good to me. I would like to point out it would be nice if it would 
be possible to apply it at the ParquetDataset level as well extending the 
filter parameter that already exists to handle both hive partitions and row 
group level filtering 
[https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L777] It 
could do this by using the summary _metadata file or by reading all footers.

> [Python] RowGroup filtering on file level
> -
>
> Key: ARROW-1796
> URL: https://issues.apache.org/jira/browse/ARROW-1796
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup 
> filters: 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 
> and translate them into the C++ enums we will define in 
> https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to 
> provide the user with a simple predicate pushdown API that we can extend in 
> the background from RowGroup to Page level later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-08-20 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener reassigned ARROW-1983:
-

Assignee: Robert Gruener

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Robert Gruener
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.11.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2800) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-03 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568244#comment-16568244
 ] 

Robert Gruener commented on ARROW-2800:
---

Is there a way to move this ticket to be under PARQUET? (I do not see move 
under the options) Otherwise I can make a new ticket for this there.

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: ARROW-2800
> URL: https://issues.apache.org/jira/browse/ARROW-2800
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: 0.11.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2800) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-02 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener reassigned ARROW-2800:
-

Assignee: Robert Gruener

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: ARROW-2800
> URL: https://issues.apache.org/jira/browse/ARROW-2800
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: 0.11.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2800) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-02 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567531#comment-16567531
 ] 

Robert Gruener commented on ARROW-2800:
---

Ok so I was wondering why parquet-mr 1.10.0 can read the old corrupt statistics 
but parquet-cpp would not and I found that 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L633]
 is the java code. Basically parquet-cpp is not considering the case of min and 
max being the same (which would actually be nice to have for our use case)

 

This is something that can be fixed in parquet-cpp. I should be able to 
implement it so it has parity with the java implementation.

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: ARROW-2800
> URL: https://issues.apache.org/jira/browse/ARROW-2800
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: 0.11.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2800) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-02 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567515#comment-16567515
 ] 

Robert Gruener commented on ARROW-2800:
---

Nevermind, I found 
[https://github.com/apache/parquet-cpp/blob/master/src/parquet/metadata.cc#L125]

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: ARROW-2800
> URL: https://issues.apache.org/jira/browse/ARROW-2800
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: 0.11.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2800) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-02 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567478#comment-16567478
 ] 

Robert Gruener commented on ARROW-2800:
---

So I have dug into the code here a bit out of curiosity and see that indeed 
when column order is set it will use the new statistics 
[https://github.com/apache/parquet-cpp/blob/853abb96e95bd440b7a175a01909e70fe760665b/src/parquet/metadata.cc#L53]

The column order is passed in from InitColumnOrders() which looks to see if the 
column_orders field is set on the FileMetaData and if so adds the appropriate 
type: 
[https://github.com/apache/parquet-cpp/blob/853abb96e95bd440b7a175a01909e70fe760665b/src/parquet/metadata.cc#L380]
 

However in the file I posted which was written with parquet 1.8.3 column_orders 
is not set on the FileMetadata (that field does not exist yet in the 1.8.3 
format). I have confirmed this by running the java parquet-mr code on the file 
included and indeed `fileMetadata.isSetColumn_orders()` returns False. 
Therefore I would expect it to have an undefined column order and retrieve the 
older parquet statistics. I am not saying this is the preferred behavior, but I 
am just trying to wrap my head around the code and what is going on there.

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: ARROW-2800
> URL: https://issues.apache.org/jira/browse/ARROW-2800
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: 0.11.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2800) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-07-26 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558352#comment-16558352
 ] 

Robert Gruener commented on ARROW-2800:
---

Ah, I see. Though why can parquet-cpp read the statistics for the int and 
double fields in that case?

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: ARROW-2800
> URL: https://issues.apache.org/jira/browse/ARROW-2800
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: 0.11.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2911) [Python] Parquet binary statistics that end in '\0' truncate last byte

2018-07-25 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556161#comment-16556161
 ] 

Robert Gruener commented on ARROW-2911:
---

This is likely related to ARROW-2800 ?

> [Python] Parquet binary statistics that end in '\0' truncate last byte
> --
>
> Key: ARROW-2911
> URL: https://issues.apache.org/jira/browse/ARROW-2911
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0, 0.10.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> This is due to an intermediate step treating them as c-strings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS

2018-07-25 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener resolved ARROW-2842.
---
Resolution: Invalid

I have not been able to reproduce well. It likely was due to an hdfs connection 
issue and not an issue with pyarrow

> [Python] Cannot read parquet files with row group size of 1 From HDFS
> -
>
> Key: ARROW-2842
> URL: https://issues.apache.org/jira/browse/ARROW-2842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Gruener
>Priority: Major
> Attachments: single-row.parquet
>
>
> This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
> this down but basically given a file with a single row on hdfs, reading it 
> with pyarrow yields this error
> ```
> TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the 
> stream
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
>  @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
>  @ parquet::SerializedFile::ParseMetaData()
>  @ 
> parquet::ParquetFileReader::Contents::Open(std::unique_ptr  std::default_delete >, 
> parquet::ReaderProperties const&, std::shared_ptr 
> const&)
>  @ 
> parquet::ParquetFileReader::Open(std::unique_ptr std::default_delete >, parquet::ReaderProperties 
> const&, std::shared_ptr const&)
>  @ parquet::arrow::OpenFile(std::shared_ptr 
> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
> std::shared_ptr const&, 
> std::unique_ptr std::default_delete >*)
>  @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
> _object*)
> ```
> The following code causes it:
> ```
> import pyarrow
> import pyarrow.parquet as pq
>  
> fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in 
> namenode information
> file_object = fs.open('single-row.parquet') # update for hdfs path of file
> pq.read_metadata(file_object) # this works
> parquet_file = pq.ParquetFile(file_object)
> parquet_file.read_row_group(0) # throws error
> ```
>  
> I am working on writing a unit test for this. Note that I am using libhdfs3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS

2018-07-12 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener updated ARROW-2842:
--
Description: 
This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in 
namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this. Note that I am using libhdfs3.

  was:
This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
"10.103.182.28:50010": End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in 
namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this. Note that I am using libhdfs3.


> [Python] Cannot read parquet files with row group size of 1 From HDFS
> -
>
> Key: ARROW-2842
> URL: https://issues.apache.org/jira/browse/ARROW-2842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Gruener
>Priority: Major
> Attachments: single-row.parquet
>
>
> This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
> this down but basically given a file with a single row on hdfs, reading it 
> with pyarrow yields this error
> ```
> TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the 
> stream
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
>  @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
>  @ parquet::SerializedFile::ParseMetaData()
>  @ 
> parquet::ParquetFileReader::Contents::Open(std::unique_ptr  std::default_delete >, 
> parquet::ReaderProperties const&, std::shared_ptr 
> const&)
>  @ 
> parquet::ParquetFileReader::Open(std::unique_ptr std::default_delete >, parquet::ReaderProperties 
> const&, std::shared_ptr const&)
>  @ parquet::arrow::OpenFile(std::shared_ptr 
> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
> std::shared_ptr const&, 
> std::unique_ptr std::default_delete >*)
>  @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
> _object*)
> ```
> The following code causes it:
> ```
> import pyarrow
> import pyarrow.parquet as pq
>  
> fs = pyarrow.hdfs.connect('my-namenode-url', 

[jira] [Updated] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS

2018-07-12 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener updated ARROW-2842:
--
Description: 
This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
"10.103.182.28:50010": End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in 
namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this. Note that I am using libhdfs3.

  was:
This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
"10.103.182.28:50010": End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect() # fill in namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this


> [Python] Cannot read parquet files with row group size of 1 From HDFS
> -
>
> Key: ARROW-2842
> URL: https://issues.apache.org/jira/browse/ARROW-2842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Gruener
>Priority: Major
> Attachments: single-row.parquet
>
>
> This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
> this down but basically given a file with a single row on hdfs, reading it 
> with pyarrow yields this error
> ```
> TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
> "10.103.182.28:50010": End of the stream
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
>  @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
>  @ parquet::SerializedFile::ParseMetaData()
>  @ 
> parquet::ParquetFileReader::Contents::Open(std::unique_ptr  std::default_delete >, 
> parquet::ReaderProperties const&, std::shared_ptr 
> const&)
>  @ 
> parquet::ParquetFileReader::Open(std::unique_ptr std::default_delete >, parquet::ReaderProperties 
> const&, std::shared_ptr const&)
>  @ parquet::arrow::OpenFile(std::shared_ptr 
> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
> std::shared_ptr const&, 
> std::unique_ptr std::default_delete >*)
>  @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
> _object*)
> ```
> The following code causes it:
> ```
> import pyarrow
> import pyarrow.parquet as pq
>  
> fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # 

[jira] [Created] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS

2018-07-12 Thread Robert Gruener (JIRA)
Robert Gruener created ARROW-2842:
-

 Summary: [Python] Cannot read parquet files with row group size of 
1 From HDFS
 Key: ARROW-2842
 URL: https://issues.apache.org/jira/browse/ARROW-2842
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Gruener
 Attachments: single-row.parquet

This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
"10.103.182.28:50010": End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect() # fill in namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-07-12 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541803#comment-16541803
 ] 

Robert Gruener commented on ARROW-1983:
---

[~xhochy] I made this dependent task PARQUET-1348

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.11.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-07-11 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540231#comment-16540231
 ] 

Robert Gruener commented on ARROW-1983:
---

This looks like it would need changes in parquet-cpp as the [arrow writer only 
takes a 
Schema|https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.h#L116]
 and not the FileMetaData object which contains the row group information.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.11.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2801) [Python] Implement splt_row_groups for ParquetDataset

2018-07-06 Thread Robert Gruener (JIRA)
Robert Gruener created ARROW-2801:
-

 Summary: [Python] Implement splt_row_groups for ParquetDataset
 Key: ARROW-2801
 URL: https://issues.apache.org/jira/browse/ARROW-2801
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Robert Gruener


Currently the split_row_groups argument in ParquetDataset yields a not 
implemented error. An easy and efficient way to implement this is by using the 
summary metadata file instead of opening every footer file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2656) [Python] Improve ParquetManifest creation time

2018-07-06 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener reassigned ARROW-2656:
-

Assignee: Robert Gruener

> [Python] Improve ParquetManifest creation time 
> ---
>
> Key: ARROW-2656
> URL: https://issues.apache.org/jira/browse/ARROW-2656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When a parquet dataset is highly partitioned, the time to call the 
> constructor for 
> [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]
>  takes a significant amount of time since it serially visits directories to 
> find all parquet files. In a dataset with thousands of partition values this 
> can take several minutes from a personal laptop.
> A quick win to vastly improve this performance would be to use a ThreadPool 
> to have calls to {{_visit_level}} happen concurrently to prevent wasting a 
> ton of time waiting on I/O.
> An even faster option could be to allow for optional indexing of dataset 
> metadata in something like the {{common_metadata}}. This could contain all 
> files in the manifest and their row_group information. This would also allow 
> for 
> [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
>  to be implemented efficiently without needing to open every parquet file in 
> the dataset to retrieve the metadata which is quite time consuming for large 
> datasets. The main problem with the indexing approach are it requires 
> immutability of the dataset, which doesn't seem too unreasonable. This 
> specific implementation seems related to 
> https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
> write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2018-07-06 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535115#comment-16535115
 ] 

Robert Gruener commented on ARROW-1956:
---

Can this not already be done using the filters argument on a dataset? 
https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L713

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.11.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2763) [Python] Make parquet _metadata file accessible from ParquetDataset

2018-06-29 Thread Robert Gruener (JIRA)
Robert Gruener created ARROW-2763:
-

 Summary: [Python] Make parquet _metadata file accessible from 
ParquetDataset
 Key: ARROW-2763
 URL: https://issues.apache.org/jira/browse/ARROW-2763
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Gruener


Currently when creating a ParquetDataset it gives you access to the 
_common_metadata file but not the _metadata file.

We access the metadata file to get row group information of the dataset without 
opening each footer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2761) Support set filter operators on Hive partitioned Parquet files

2018-06-28 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526544#comment-16526544
 ] 

Robert Gruener commented on ARROW-2761:
---

https://github.com/apache/arrow/pull/2188

> Support set filter operators on Hive partitioned Parquet files
> --
>
> Key: ARROW-2761
> URL: https://issues.apache.org/jira/browse/ARROW-2761
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Gruener
>Priority: Minor
>  Labels: features, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Pyarrow supports many operators on hive partitioned parquet files. It should 
> add in support for set operations similar to 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L335



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2761) Support set filter operators on Hive partitioned Parquet files

2018-06-28 Thread Robert Gruener (JIRA)
Robert Gruener created ARROW-2761:
-

 Summary: Support set filter operators on Hive partitioned Parquet 
files
 Key: ARROW-2761
 URL: https://issues.apache.org/jira/browse/ARROW-2761
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Gruener


Pyarrow supports many operators on hive partitioned parquet files. It should 
add in support for set operations similar to 
https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L335



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2656) [Python] Improve ParquetManifest creation time

2018-06-27 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525236#comment-16525236
 ] 

Robert Gruener commented on ARROW-2656:
---

I have opened [https://github.com/apache/arrow/pull/2185] which is a quick win 
that gives a rather large performance boost for creating a parquet manifest on 
a partitioned dataset which lives in hdfs.

I think using the summary file will be another improvement to make on top of 
this, but this is extremely useful in the meantime.

> [Python] Improve ParquetManifest creation time 
> ---
>
> Key: ARROW-2656
> URL: https://issues.apache.org/jira/browse/ARROW-2656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Gruener
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a parquet dataset is highly partitioned, the time to call the 
> constructor for 
> [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]
>  takes a significant amount of time since it serially visits directories to 
> find all parquet files. In a dataset with thousands of partition values this 
> can take several minutes from a personal laptop.
> A quick win to vastly improve this performance would be to use a ThreadPool 
> to have calls to {{_visit_level}} happen concurrently to prevent wasting a 
> ton of time waiting on I/O.
> An even faster option could be to allow for optional indexing of dataset 
> metadata in something like the {{common_metadata}}. This could contain all 
> files in the manifest and their row_group information. This would also allow 
> for 
> [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
>  to be implemented efficiently without needing to open every parquet file in 
> the dataset to retrieve the metadata which is quite time consuming for large 
> datasets. The main problem with the indexing approach are it requires 
> immutability of the dataset, which doesn't seem too unreasonable. This 
> specific implementation seems related to 
> https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
> write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2656) [Python] Improve ParquetManifest creation time

2018-05-31 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener updated ARROW-2656:
--
Description: 
When a parquet dataset is highly partitioned, the time to call the constructor 
for 
[ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]
 takes a significant amount of time since it serially visits directories to 
find all parquet files. In a dataset with thousands of partition values this 
can take several minutes from a personal laptop.

A quick win to vastly improve this performance would be to use a ThreadPool to 
have calls to {{_visit_level}} happen concurrently to prevent wasting a ton of 
time waiting on I/O.

An even faster option could be to allow for optional indexing of dataset 
metadata in something like the {{common_metadata}}. This could contain all 
files in the manifest and their row_group information. This would also allow 
for 
[split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
 to be implemented efficiently without needing to open every parquet file in 
the dataset to retrieve the metadata which is quite time consuming for large 
datasets. The main problem with the indexing approach are it requires 
immutability of the dataset, which doesn't seem too unreasonable. This specific 
implementation seems related to 
https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
write portion.

  was:
When a parquet dataset is highly partitioned, the time to call the constructor 
for 
[ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]takes
 a significant amount of time since it serially visits directories to find all 
parquet files. In a dataset with thousands of partition values this can take 
several minutes from a personal laptop.

A quick win to vastly improve this performance would be to use a ThreadPool to 
have calls to {{_visit_level}} happen concurrently to prevent wasting a ton of 
time waiting on I/O.

An even faster option could be to allow for optional indexing of dataset 
metadata in something like the {{common_metadata}}. This could contain all 
files in the manifest and their row_group information. This would also allow 
for 
[split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
 to be implemented efficiently without needing to open every parquet file in 
the dataset to retrieve the metadata which is quite time consuming for large 
datasets. The main problem with the indexing approach are it requires 
immutability of the dataset, which doesn't seem too unreasonable. This specific 
implementation seems related to 
https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
write portion.


> [Python] Improve ParquetManifest creation time 
> ---
>
> Key: ARROW-2656
> URL: https://issues.apache.org/jira/browse/ARROW-2656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Gruener
>Priority: Major
>
> When a parquet dataset is highly partitioned, the time to call the 
> constructor for 
> [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]
>  takes a significant amount of time since it serially visits directories to 
> find all parquet files. In a dataset with thousands of partition values this 
> can take several minutes from a personal laptop.
> A quick win to vastly improve this performance would be to use a ThreadPool 
> to have calls to {{_visit_level}} happen concurrently to prevent wasting a 
> ton of time waiting on I/O.
> An even faster option could be to allow for optional indexing of dataset 
> metadata in something like the {{common_metadata}}. This could contain all 
> files in the manifest and their row_group information. This would also allow 
> for 
> [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
>  to be implemented efficiently without needing to open every parquet file in 
> the dataset to retrieve the metadata which is quite time consuming for large 
> datasets. The main problem with the indexing approach are it requires 
> immutability of the dataset, which doesn't seem too unreasonable. This 
> specific implementation seems related to 
> https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
> write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2656) [Python] Improve ParquetManifest creation time

2018-05-31 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener updated ARROW-2656:
--
Description: 
When a parquet dataset is highly partitioned, the time to call the constructor 
for 
[ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]takes
 a significant amount of time since it serially visits directories to find all 
parquet files. In a dataset with thousands of partition values this can take 
several minutes from a personal laptop.

A quick win to vastly improve this performance would be to use a ThreadPool to 
have calls to {{_visit_level}} happen concurrently to prevent wasting a ton of 
time waiting on I/O.

An even faster option could be to allow for optional indexing of dataset 
metadata in something like the {{common_metadata}}. This could contain all 
files in the manifest and their row_group information. This would also allow 
for 
[split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
 to be implemented efficiently without needing to open every parquet file in 
the dataset to retrieve the metadata which is quite time consuming for large 
datasets. The main problem with the indexing approach are it requires 
immutability of the dataset, which doesn't seem too unreasonable. This specific 
implementation seems related to 
https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
write portion.

  was:
When a parquet dataset is highly partitioned, the time to call the constructor 
for 
[ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588][
 
|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]takes
 a significant amount of time since it serially visits directories to find all 
parquet files. In a dataset with thousands of partition values this can take 
several minutes from a personal laptop.

A quick win to vastly improve this performance would be to use a ThreadPool to 
have calls to {{_visit_level}} happen concurrently to prevent wasting a ton of 
time waiting on I/O.

An even faster option could be to allow for optional indexing of dataset 
metadata in something like the {{common_metadata}}. This could contain all 
files in the manifest and their row_group information. This would also allow 
for 
[split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
 to be implemented efficiently without needing to open every parquet file in 
the dataset to retrieve the metadata which is quite time consuming for large 
datasets. The main problem with the indexing approach are it requires 
immutability of the dataset, which doesn't seem too unreasonable. This specific 
implementation seems related to 
https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
write portion.


> [Python] Improve ParquetManifest creation time 
> ---
>
> Key: ARROW-2656
> URL: https://issues.apache.org/jira/browse/ARROW-2656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Gruener
>Priority: Major
>
> When a parquet dataset is highly partitioned, the time to call the 
> constructor for 
> [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]takes
>  a significant amount of time since it serially visits directories to find 
> all parquet files. In a dataset with thousands of partition values this can 
> take several minutes from a personal laptop.
> A quick win to vastly improve this performance would be to use a ThreadPool 
> to have calls to {{_visit_level}} happen concurrently to prevent wasting a 
> ton of time waiting on I/O.
> An even faster option could be to allow for optional indexing of dataset 
> metadata in something like the {{common_metadata}}. This could contain all 
> files in the manifest and their row_group information. This would also allow 
> for 
> [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
>  to be implemented efficiently without needing to open every parquet file in 
> the dataset to retrieve the metadata which is quite time consuming for large 
> datasets. The main problem with the indexing approach are it requires 
> immutability of the dataset, which doesn't seem too unreasonable. This 
> specific implementation seems related to 
> https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
> write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2656) [Python] Improve ParquetManifest creation time

2018-05-31 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener updated ARROW-2656:
--
Summary: [Python] Improve ParquetManifest creation time   (was: [Python] 
Improve ParquetManifest creation time for highly )

> [Python] Improve ParquetManifest creation time 
> ---
>
> Key: ARROW-2656
> URL: https://issues.apache.org/jira/browse/ARROW-2656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Gruener
>Priority: Major
>
> When a parquet dataset is highly partitioned, the time to call the 
> constructor for 
> [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588][
>  
> |https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]takes
>  a significant amount of time since it serially visits directories to find 
> all parquet files. In a dataset with thousands of partition values this can 
> take several minutes from a personal laptop.
> A quick win to vastly improve this performance would be to use a ThreadPool 
> to have calls to {{_visit_level}} happen concurrently to prevent wasting a 
> ton of time waiting on I/O.
> An even faster option could be to allow for optional indexing of dataset 
> metadata in something like the {{common_metadata}}. This could contain all 
> files in the manifest and their row_group information. This would also allow 
> for 
> [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
>  to be implemented efficiently without needing to open every parquet file in 
> the dataset to retrieve the metadata which is quite time consuming for large 
> datasets. The main problem with the indexing approach are it requires 
> immutability of the dataset, which doesn't seem too unreasonable. This 
> specific implementation seems related to 
> https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
> write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2656) [Python] Improve ParquetManifest creation time for highly

2018-05-31 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497147#comment-16497147
 ] 

Robert Gruener commented on ARROW-2656:
---

I will attempt to get some code as a benchmark however (I believe) you cannot 
create partitioned parquet datasets through pyarrow right now which will make 
it necessary to do so through spark.

> [Python] Improve ParquetManifest creation time for highly 
> --
>
> Key: ARROW-2656
> URL: https://issues.apache.org/jira/browse/ARROW-2656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Gruener
>Priority: Major
>
> When a parquet dataset is highly partitioned, the time to call the 
> constructor for 
> [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588][
>  
> |https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]takes
>  a significant amount of time since it serially visits directories to find 
> all parquet files. In a dataset with thousands of partition values this can 
> take several minutes from a personal laptop.
> A quick win to vastly improve this performance would be to use a ThreadPool 
> to have calls to {{_visit_level}} happen concurrently to prevent wasting a 
> ton of time waiting on I/O.
> An even faster option could be to allow for optional indexing of dataset 
> metadata in something like the {{common_metadata}}. This could contain all 
> files in the manifest and their row_group information. This would also allow 
> for 
> [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
>  to be implemented efficiently without needing to open every parquet file in 
> the dataset to retrieve the metadata which is quite time consuming for large 
> datasets. The main problem with the indexing approach are it requires 
> immutability of the dataset, which doesn't seem too unreasonable. This 
> specific implementation seems related to 
> https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
> write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2656) [Python] Improve ParquetManifest creation time for highly

2018-05-31 Thread Robert Gruener (JIRA)
Robert Gruener created ARROW-2656:
-

 Summary: [Python] Improve ParquetManifest creation time for highly 
 Key: ARROW-2656
 URL: https://issues.apache.org/jira/browse/ARROW-2656
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Gruener


When a parquet dataset is highly partitioned, the time to call the constructor 
for 
[ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588][
 
|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]takes
 a significant amount of time since it serially visits directories to find all 
parquet files. In a dataset with thousands of partition values this can take 
several minutes from a personal laptop.

A quick win to vastly improve this performance would be to use a ThreadPool to 
have calls to {{_visit_level}} happen concurrently to prevent wasting a ton of 
time waiting on I/O.

An even faster option could be to allow for optional indexing of dataset 
metadata in something like the {{common_metadata}}. This could contain all 
files in the manifest and their row_group information. This would also allow 
for 
[split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
 to be implemented efficiently without needing to open every parquet file in 
the dataset to retrieve the metadata which is quite time consuming for large 
datasets. The main problem with the indexing approach are it requires 
immutability of the dataset, which doesn't seem too unreasonable. This specific 
implementation seems related to 
https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)