[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400304#comment-17400304
 ] 

Weston Pace commented on ARROW-13644:
-------------------------------------

> You would acquire the semaphore when opening a file and release it when 
> closing the file

When writing a dataset we only close the files at the very end when all data 
has been processed.  There is no way to know ahead of time when we are finished 
working with a file.  The very last batch of a 50GB dataset might write to the 
same file that the first batch did.

The current implementation handles this by leaving all files open for the 
entire scan.  That leads to running out of file handles.  So as a compromise we 
can set a limit to how many files we have open.  When we reach the limit we 
have to close one of the current open files.  This might mean we create more 
files than strictly necessary.  LRU is one way to handle this.

This compromise won't always work well.  Maybe it will help to consider a case 
where LRU performs poorly.  If a dataset looks something like...

{code:csv}
part_col, val
1,0
2,1
3,2
4,3
5,4
1,5
2,6
3,7
4,8
5,9
1,10
2,11
3,12
4,13
5,14
{code}

If we partition on `part_col` with a batch size of 5 we will get three batches 
and then if we don't do any limit on how many files we can have open then we 
end up with 5 files:

{code}
part_col=1/part-0.arrow
part_col=2/part-1.arrow
part_col=3/part-2.arrow
part_col=4/part-3.arrow
part_col=5/part-4.arrow
{code}

We will open 5 files on the first batch and keep them open the entire read.  If 
we need to limit how many files we have open (let's say 3) then we need to 
figure something out.  With LRU we'd end up with 15 files...

{code}
part_col=1/part-0.arrow
part_col=1/part-5.arrow
part_col=1/part-10.arrow
part_col=2/part-1.arrow
part_col=2/part-6.arrow
...
{code}

Another way to handle it would be to sort the complete data by the partition 
column(s) but that would introduce a pipeline breaker.  Or we could close the 
file that has the most rows in it but that would require a priority queue which 
isn't really simpler.

> [C++] Create LruCache that works with futures
> ---------------------------------------------
>
>                 Key: ARROW-13644
>                 URL: https://issues.apache.org/jira/browse/ARROW-13644
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to