[ https://issues.apache.org/jira/browse/ARROW-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Szucs resolved ARROW-7661. ------------------------------------ Resolution: Fixed Issue resolved by pull request 6305 [https://github.com/apache/arrow/pull/6305] > [Python] Non-optimal CSV chunking when no newline at end > -------------------------------------------------------- > > Key: ARROW-7661 > URL: https://issues.apache.org/jira/browse/ARROW-7661 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.14.1, 0.15.0, 0.15.1 > Reporter: Sascha Hofmann > Assignee: Antoine Pitrou > Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > We are reading a very simple csv (see below). > The file is only 245 bytes so way below the default _block_size_ in the > _ReadOptions_. Thus we would expect the resulting table to have only one > batch. At least, if I understand correctly that a _block_ refers to the > number of lines of certain byte size? > The docs state: _This will determine multi-threading granularity as well as > the size of individual chunks in the Table._ For me, that means also the size > of individual batches? > Previously, we thought by fixing the block_size to the total file size, we > would ensure that even for files larger than 1MB we get a pa.Table with only > one batch. This mini file seems to prove us wrong? > Additionally, if I convert back and forth to pandas we get only one batch. > > To reproduce: > {code:java} > import os > from pyarrow import csv as pc > import pyarrow as pa > path = "test.csv" > read_options = pc.ReadOptions(block_size=os.stat(path).st_size) > df = pc.read_csv(path, read_options=read_options) > print(len(df.to_batches())) > # returns 2 > print(pa.Table.from_batches([df.to_batches()[1]]).to_pandas()) > # returns the last line of the file > pdf = df.to_pandas() > ndf = pa.Table.from_pandas(pdf) > print(len(ndf.to_batches())) > # returns 1{code} > test.csv: > {code:java} > "Name","Month","Change in %" > "Surrey Quays","Sep 18","1.01" > "Surrey Quays","Oct 18","0.38" > "Surrey Quays","Nov 18","0.97" > "Surrey Quays","Dec 18","1.28" > "Surrey Quays","Jan 19","2.43" > "Surrey Quays","Feb 19","2.49" > "Surrey Quays","Mar 19","0.81" > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)