[jira] [Commented] (ARROW-8039) [C++][Python][Dataset] Assemble a minimal ParquetDataset shim

Joris Van den Bossche (Jira) Tue, 10 Mar 2020 12:00:24 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056284#comment-17056284
 ]


Joris Van den Bossche commented on ARROW-8039:
----------------------------------------------

> We might focus this by saying that the objective is to satisfy the .read() 
> method of ParquetDataset and to at least support the filters argument to the 
> init method (with the bonus feature that you can filter on any column, not 
> just partition keys, as an incentive to use the new code).

If that is the goal, I think this should be trivial. Which isn't to say it's 
not useful! Being able to run part of the tests with might discover issues. I 
did something similar for the read_table function at 
https://github.com/apache/arrow/pull/6303 (the utility code to convert 
old-format filters to the new expressions might be useful here as well). In 
case this issue is not yet started, I could also add this to that PR tomorrow.  
 
This would also stress test the manifest / dataset discovery part (which has a 
custom python implementation, so that would be useful to compare to what the 
datasets API does), but not sure the tests for this are very extensive.

> This would exclude supporting object attributes like "pieces", which we could 
> address separately for dask et al..

Yes, but this are the hard parts (and the parts that dask extensively uses). So 
it's mostly for those parts that we will need to decide whether we want to try 
to create an API-compatible shim, or rather try to provide the necessary 
features to be able to migrate to the new API.


> [C++][Python][Dataset] Assemble a minimal ParquetDataset shim
> -------------------------------------------------------------
>
>                 Key: ARROW-8039
>                 URL: https://issues.apache.org/jira/browse/ARROW-8039
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++ - Dataset, Python
>    Affects Versions: 0.16.0
>            Reporter: Ben Kietzman
>            Assignee: Ben Kietzman
>            Priority: Major
>             Fix For: 0.17.0
>
>
> Assemble a minimal ParquetDataset shim backed by {{pyarrow.dataset.*}}. 
> Replace the existing ParquetDataset with the shim by default, allow opt-out 
> for users who need the current ParquetDataset
> This is mostly exploratory to see which of the python tests fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8039) [C++][Python][Dataset] Assemble a minimal ParquetDataset shim

Reply via email to