[jira] [Comment Edited] (ARROW-18113) Implement a read range process without caching

Jira Tue, 15 Nov 2022 21:13:18 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634647#comment-17634647
 ]


Percy Camilo Triveño Aucahuasi edited comment on ARROW-18113 at 11/16/22 5:12 
AM:
----------------------------------------------------------------------------------

{quote}Sounds reasonable to me, at first glance. Maybe leave behind a using 
CacheOptions = CoalesceOptions for compatibility (can you deprecate a using 
declaration?) 
{quote}
David, I think we can deprecate it (in case we choose to rename&move it).
{quote}So if we add a ReadManyAsync then I think it should not have a cache 
options parameter. Instead that should be a property of the filesystem if it 
needs to be configurable.
{quote}
Weston, yes initially I thought that _CoalesceOptions_ would be part of 
_arrow::io::IOContext_ (as an attribute) and _ReadManyAsync_ could use/pass the 
_CoalesceOptions_ to the filesystem.  But it make sense to let the filesystem 
handle all of that, so in that case:
 # we still may choose to rename _arrow::io::CacheOptions_ to 
{_}arrow::io::{_}{_}CoalesceOptions{_} and move it into {_}interfaces.h{_}, so 
each filesystem's ctor will require {_}arrow::io::CoalesceOptions{_}.
 # or we just can include _caching.h_ in every filesystem declaration without 
changing/renaming _arrow::io::CacheOptions_ (so each filesystem's ctor will 
require {_}arrow::io::{_}{_}CacheOptions{_})

Let me know which one sounds better to you, thanks.


was (Author: aucahuasi):
??Sounds reasonable to me, at first glance. Maybe leave behind a {{using 
CacheOptions = CoalesceOptions}} for compatibility (can you deprecate a 
{{using}} declaration?)??
 
David, I think we can deprecate it (in case we choose to rename&move it).

 

??So if we add a _ReadManyAsync_ then I think it should not have a cache 
options parameter. Instead that should be a property of the filesystem if it 
needs to be configurable.??

 

Weston, yes initially I thought that _CoalesceOptions_ would be part of 
_arrow::io::IOContext_ (as an attribute) and _ReadManyAsync_ could use/pass the 
_CoalesceOptions_ to the filesystem.  But it make sense to let the filesystem 
handle all of that, so in that case:
 # we still may choose to rename _arrow::io::CacheOptions_ to 
{_}arrow::io::{_}{_}CoalesceOptions{_} and move it into {_}interfaces.h{_}, so 
each filesystem's ctor will require {_}arrow::io::CoalesceOptions{_}.
 # or we just can include _caching.h_ in every filesystem declaration without 
changing/renaming _arrow::io::CacheOptions_ (so each filesystem's ctor will 
require {_}arrow::io::{_}{_}CacheOptions{_})

Let me know which one sounds better to you, thanks.

> Implement a read range process without caching
> ----------------------------------------------
>
>                 Key: ARROW-18113
>                 URL: https://issues.apache.org/jira/browse/ARROW-18113
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Percy Camilo Triveño Aucahuasi
>            Assignee: Percy Camilo Triveño Aucahuasi
>            Priority: Major
>
> The current 
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
>  is mixing caching with coalescing and making difficult to implement readers 
> capable to really perform concurrent reads on coalesced data (see this 
> [github 
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
> additional context); for instance, right now the prebuffering feature of 
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to 
> ReadRangeCache for performing non-cache reads (doing only the coalescing part 
> instead).  So, once we have that new capability, we can port the parquet and 
> IPC readers to this new component and keep improving the reading process 
> (that would be part of other set of follow-up tickets).  Similar ideas were 
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system 
> abstraction (as part of a dedicated method to read coalesced data) and where 
> the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-18113) Implement a read range process without caching

Reply via email to