[jira] [Commented] (ARROW-13644) [C++] Create LruCache that works with futures

2021-08-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400315#comment-17400315
 ] 

Antoine Pitrou commented on ARROW-13644:


Ah, I see. My mistake.

> [C++] Create LruCache that works with futures
> -
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13644) [C++] Create LruCache that works with futures

2021-08-17 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400312#comment-17400312
 ] 

Weston Pace commented on ARROW-13644:
-

I do not append.  I close the file and start a new file.  That 15 file "worst 
case" would be 15 files each with 1 row.

> [C++] Create LruCache that works with futures
> -
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13644) [C++] Create LruCache that works with futures

2021-08-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400307#comment-17400307
 ] 

Antoine Pitrou commented on ARROW-13644:


This assumes that your file format supports appending efficiently. This does 
not seem to be easily resolvable in the general case.

> [C++] Create LruCache that works with futures
> -
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13644) [C++] Create LruCache that works with futures

2021-08-17 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400304#comment-17400304
 ] 

Weston Pace commented on ARROW-13644:
-

> You would acquire the semaphore when opening a file and release it when 
> closing the file

When writing a dataset we only close the files at the very end when all data 
has been processed.  There is no way to know ahead of time when we are finished 
working with a file.  The very last batch of a 50GB dataset might write to the 
same file that the first batch did.

The current implementation handles this by leaving all files open for the 
entire scan.  That leads to running out of file handles.  So as a compromise we 
can set a limit to how many files we have open.  When we reach the limit we 
have to close one of the current open files.  This might mean we create more 
files than strictly necessary.  LRU is one way to handle this.

This compromise won't always work well.  Maybe it will help to consider a case 
where LRU performs poorly.  If a dataset looks something like...

{code:csv}
part_col, val
1,0
2,1
3,2
4,3
5,4
1,5
2,6
3,7
4,8
5,9
1,10
2,11
3,12
4,13
5,14
{code}

If we partition on `part_col` with a batch size of 5 we will get three batches 
and then if we don't do any limit on how many files we can have open then we 
end up with 5 files:

{code}
part_col=1/part-0.arrow
part_col=2/part-1.arrow
part_col=3/part-2.arrow
part_col=4/part-3.arrow
part_col=5/part-4.arrow
{code}

We will open 5 files on the first batch and keep them open the entire read.  If 
we need to limit how many files we have open (let's say 3) then we need to 
figure something out.  With LRU we'd end up with 15 files...

{code}
part_col=1/part-0.arrow
part_col=1/part-5.arrow
part_col=1/part-10.arrow
part_col=2/part-1.arrow
part_col=2/part-6.arrow
...
{code}

Another way to handle it would be to sort the complete data by the partition 
column(s) but that would introduce a pipeline breaker.  Or we could close the 
file that has the most rows in it but that would require a priority queue which 
isn't really simpler.

> [C++] Create LruCache that works with futures
> -
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13644) [C++] Create LruCache that works with futures

2021-08-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400283#comment-17400283
 ] 

Antoine Pitrou commented on ARROW-13644:


Of course, you would still need an async semaphore (one where Acquire() / 
Lock() returns a Future). But that's much simpler than an async LRU cache.

> [C++] Create LruCache that works with futures
> -
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13644) [C++] Create LruCache that works with futures

2021-08-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400281#comment-17400281
 ] 

Antoine Pitrou commented on ARROW-13644:


You would acquire the semaphore when opening a file and release it when closing 
the file. I'm not sure I understand your use case?

> [C++] Create LruCache that works with futures
> -
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13644) [C++] Create LruCache that works with futures

2021-08-17 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400247#comment-17400247
 ] 

Weston Pace commented on ARROW-13644:
-

I guess I'm not sure I follow.  Let's assume the user only wants two files open 
(obviously it should never be this low).  A batch comes in that is partitioned 
on files X and Y.  Then a batch comes in that is partitioned on files X and Z.  
I want to close file Y before I open file Z.  With a semaphore I have something 
like...


{code:python}
def queue_write(f, batch):
  # actual write
  release(1)

for f, batch in files:
  acquire(1)
  queue_write(f, batch)
{code}

It seems to me that the semaphore would just block indefinitely.  File Y will 
not close itself, even when it finishes writing (files are held open in case 
more data comes for that file).

> [C++] Create LruCache that works with futures
> -
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13644) [C++] Create LruCache that works with futures

2021-08-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400239#comment-17400239
 ] 

Antoine Pitrou commented on ARROW-13644:


> A semaphore will allow me to control how many files I'm writing to at once 
> but not how many files I have open

Why not?

> [C++] Create LruCache that works with futures
> -
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13644) [C++] Create LruCache that works with futures

2021-08-17 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400238#comment-17400238
 ] 

Weston Pace commented on ARROW-13644:
-

A semaphore will allow me to control how many files I'm writing to at once but 
not how many files I have open (whether I am writing to them or not).  I need 
to close the old files so that the file handle is returned to the OS.  That 
means closing the writer and starting a new file later if more data comes in 
for that file.

> [C++] Create LruCache that works with futures
> -
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13644) [C++] Create LruCache that works with futures

2021-08-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400233#comment-17400233
 ] 

Antoine Pitrou commented on ARROW-13644:


Why do you need a LRU cache for this? A semaphore-like facility should be 
sufficient.

> [C++] Create LruCache that works with futures
> -
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it 
> can respect a "max open files" property (see ARROW-12321).  A synchronous 
> LruCache implementation already exists but on eviction from the cache we need 
> to wait until all pending writes have completed before we evict the item and 
> open a new file.  This ticket is to create an AsyncLruCache which will allow 
> the creation of items and the eviction of items to be asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)