[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

2021-09-29 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17422446#comment-17422446
 ] 

David Li commented on ARROW-10439:
--

The approach used for Flight would really only work for IPC, unfortunately. It 
optimistically assumes batches are below the limit and hooks into the low-level 
IPC writer implementation so that it gets passed the already-serialized batches 
- that way, it doesn't waste work computing the actual serialized size (which 
is expensive) - and if the batch is over the size limit, rejects it. The caller 
is then expected to try again. I suppose you could generalize this to CSV (by 
serializing rows to a buffer before writing them out), though that would be an 
expensive/invasive refactor (and I have no clue about Parquet).

I'll note even "in-memory size" can be difficult to compute if you have slices. 
The GetRecordBatchSize function actually serializes the batch under the hood 
and reads the bytes written.

> [C++][Dataset] Add max file size as a dataset writing option
> 
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Minor
>  Labels: beginner, dataset, query-engine
> Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

2021-09-29 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17422441#comment-17422441
 ] 

Weston Pace commented on ARROW-10439:
-

So the challenge with a bytes limit is that we need to know how many bytes are 
going to be written before the potentially blocking write call.  The way the 
file writers are currently structured that is not easy.  Options available:

 * Modify the file writers to be truly asynchronous and return "{ bytes_queued: 
int64_t, Future<>: write_future }" (or they could return a Future<> and have a 
method to query how many total bytes have been queued to be written to the 
file).
 * Use the in-memory size of the data (the downside is that this can be quite 
different from the written size if compression is used which is often the case).
 * Enforce a best-effort limit which checks the current file size when 
determining if a new file should be opened.  The problem in this case is we 
will queue some number of batches more than we should so the limit will be a 
soft limit that we will typically shoot past by some amount.

Does anyone have any other ideas or suggestions or have a preference amongst 
the available options?  [~lidavidm] what was the approach used for flight?

> [C++][Dataset] Add max file size as a dataset writing option
> 
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Minor
>  Labels: beginner, dataset, query-engine
> Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

2021-09-14 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415285#comment-17415285
 ] 

Weston Pace commented on ARROW-10439:
-

I'm going to leave this open to track adding a `max_bytes_per_file` limit.

> [C++][Dataset] Add max file size as a dataset writing option
> 
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Minor
>  Labels: beginner, dataset, query-engine
> Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

2021-08-23 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403455#comment-17403455
 ] 

Weston Pace commented on ARROW-10439:
-

https://github.com/apache/arrow/pull/10955 (as part of ARROW-13650) adds a 
`max_rows_per_file` option.  Max bytes is a little trickier (table.nbytes is 
the in-memory size and I assume one would want the on-disk size) although 
doable (the file writer's should be able to keep track of how many bytes 
they've written but they don't do this today.)  I'd prefer to avoid max bytes 
unless someone has a need for it though.

> [C++][Dataset] Add max file size as a dataset writing option
> 
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Major
>  Labels: beginner, dataset, query-engine
> Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

2021-08-18 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17401132#comment-17401132
 ] 

Alessandro Molina commented on ARROW-10439:
---

It would probably make sense to see if we can come up with an API that allows 
something more flexible. For example I can see cases where someone might want 
to "start a new file" every 1M rows instead of starting a new file instead of 
every 100MB.

> [C++][Dataset] Add max file size as a dataset writing option
> 
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Major
>  Labels: beginner, dataset, query-engine
> Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

2021-08-09 Thread Ruben Laguna (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395851#comment-17395851
 ] 

Ruben Laguna commented on ARROW-10439:
--

My current workaround is to use ParquetWriter and create a new file when the 
input data is bigger than x amount of bytes. So that indirectly limits the 
output file size but the output file size still varies depending on compression 
ratio for the particular input data in the split. 


https://stackoverflow.com/a/68679635/90580
{code:python}
bytes_written = 0
index = 0
writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema)

for i in range(30):
  writer.write_table(table)
  bytes_written = bytes_written + table.nbytes
  if bytes_written >= 5: # 500MB, start a new file
writer.close()
index = index + 1
writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema)
bytes_written = 0

writer.close()
{code}


> [C++][Dataset] Add max file size as a dataset writing option
> 
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Major
>  Labels: beginner, dataset
> Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

2021-07-14 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380812#comment-17380812
 ] 

Weston Pace commented on ARROW-10439:
-

[~motibz] I'm sorry that I did not see this earlier.  One possible workaround 
is to split the input into multiple pieces manually and then call 
`write_dataset` multiple times using a different `basename_template` for each 
write but it is not a great workaround.

> [C++][Dataset] Add max file size as a dataset writing option
> 
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Major
>  Labels: dataset
> Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

2021-05-03 Thread Mordechai Ben Zechariah (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338511#comment-17338511
 ] 

Mordechai Ben Zechariah commented on ARROW-10439:
-

Hi

[~westonpace] does anybody have a good workaround for this request?

I'm trying to use write_data_set and I want to create multiple files in each 
partition.

> [C++][Dataset] Add max file size as a dataset writing option
> 
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Major
>  Labels: dataset
> Fix For: 5.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

2020-10-30 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223874#comment-17223874
 ] 

Ben Kietzman commented on ARROW-10439:
--

[~jorisvandenbossche]

> [C++][Dataset] Add max file size as a dataset writing option
> 
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
> Fix For: 3.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)