[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398227#comment-17398227
 ] 

Weston Pace commented on ARROW-12358:
-------------------------------------

Do you clear your append only dataset after step 4?  In other words, is it just 
a temporary staging area (which stays rather small) or are you wanting to keep 
the duplicate rows in your base dataset?

So, to check my understanding, I think what you are describing is a 
materialized view with incremental refresh.  Does this sound right?

In other words, the results of a query (in your case the query is sort of a 
"group by date" where you take the latest row instead of aggregating anything) 
are saved off (saved off meaning you don't have to recompute the query each 
time) as a view and you want to update the view when new data arrives but 
should only have to read the new data when computing the update.

Some thoughts to your current approach...

* If you can perform the update often enough then the append-only table ought 
to still be cached in memory in the kernel disk cache so reading the newly 
added data should be fast.  If you don't need to keep this data then write it 
in IPC format (perhaps even to a tmpfs mount) and the access should be even 
faster.  There are some durability concerns here but Arrow doesn't generally 
make durability guarantees anyways.
* For steps 2 & 3 there is more and more work being done in the CPU layer to 
add compute and relational algebra into Arrow itself.  Eventually the hope is 
to support a sort of low level query IR which is currently being discussed in 
the ML.  This may ease some of the work here but the learning curve is pretty 
steep at the moment.  You could scan the append-only dataset to get the minimum 
date value.  Then you could create a second dataset which is a scan of the old 
view >= min_date.  Then you could union these two datasets, apply an order by 
date, and drop duplicates.  This would allow you to do everything in Arrow.  
However, it's currently all in-progress.  "order by" was just added 
(ARROW-13540) and there is no drop duplicates yet that I am aware of although 
there may be a way to do this with group by and the right aggregate kernel.



> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12358
>                 URL: https://issues.apache.org/jira/browse/ARROW-12358
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>             Fix For: 6.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to