[ 
https://issues.apache.org/jira/browse/ARROW-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073116#comment-17073116
 ] 

Joris Van den Bossche edited comment on ARROW-2882 at 4/14/20, 7:46 AM:
------------------------------------------------------------------------

This kind of partitioning scheme is already implemented in the Datasets 
project, and exposed in the python bindings.

You can do:

{code}
import pyarrow.dataset as ds

ds.dataset("root_directory/", partitioning=ds.partitioning(["year", "month", 
"day"])
{code}

to give the names for the parts of the file path. Alternatively, you can also 
pass an actual schema, in which case you specify data types per field as well, 
instead of letting it be inferred from the file path.




was (Author: jorisvandenbossche):
This kind of partitioning scheme is already implemented in the Datasets 
project, and exposed in the python bindings.

You can do:

{code}
import pyarrow.dataset as ds

ds.dataset("root_directory/", partitioning=ds.partitioning(["year", "month", 
day"])
{code}

to give the names for the parts of the file path. Alternatively, you can also 
pass an actual schema, in which case you specify data types per field as well, 
instead of letting it be inferred from the file path.



> [C++][Python] Support AWS Firehose partition_scheme implementation for 
> Parquet datasets
> ---------------------------------------------------------------------------------------
>
>                 Key: ARROW-2882
>                 URL: https://issues.apache.org/jira/browse/ARROW-2882
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Python
>            Reporter: Pablo Javier Takara
>            Priority: Major
>              Labels: dataset, dataset-parquet-read, parquet
>             Fix For: 2.0.0
>
>
> I'd like to be able to read a ParquetDataset generated by AWS Firehose.
> The only implementation at the time of writting was the partition scheme 
> created by hive (year=2018/month=01/day=11).
> AWS Firehose partition scheme is a little bit different (2018/01/11).
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to