[jira] [Comment Edited] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True

2021-04-19 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325004#comment-17325004
 ] 

David Li edited comment on ARROW-12428 at 4/19/21, 12:39 PM:
-

{noformat}
Whole file:
Pandas/S3FS (no pre-buffer, no readahead): 692.2505334559828 seconds
Pandas/S3FS (no pre-buffer, readahead): 99.55904859001748 seconds
Pandas/S3FS (pre-buffer, no readahead): 39.282157234149054 seconds
Pandas/S3FS (pre-buffer, readahead): 41.564441804075614 seconds
PyArrow (no pre-buffer): 242.97687190794386 seconds
PyArrow (pre-buffer): 39.5321765630506 seconds
===
Column selection:
Pandas/S3FS (no pre-buffer, no readahead): 153.64498204295523 seconds
Pandas/S3FS (no pre-buffer, readahead): 82.44589220592752 seconds
Pandas/S3FS (pre-buffer, no readahead): 114.55768134980462 seconds
Pandas/S3FS (pre-buffer, readahead): 133.1232347697951 seconds
PyArrow (no pre-buffer): 54.11452938010916 seconds
PyArrow (pre-buffer): 12.865494727157056 seconds
{noformat}
{code:python}
import time
import pandas as pd
import pyarrow.fs
import pyarrow.parquet as pq

columns = ['vendor_id', 'pickup_latitude', 'pickup_longitude', 'extra']

print("Whole file:")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
storage_options={
'default_block_size': 1,  # 0 is ignored
'default_fill_cache': False,
}, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, no readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
storage_options={
'default_block_size': 1,  # 0 is ignored
'default_fill_cache': False,
}, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, readahead):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
pre_buffer=False).to_pandas()
duration = time.monotonic() - start
print("PyArrow (no pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
pre_buffer=True).to_pandas()
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")

print("===")
print("Column selection:")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
storage_options={
'default_block_size': 1,  # 0 is ignored
'default_fill_cache': False,
}, columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, no readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer, readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
storage_options={
'default_block_size': 1,  # 0 is ignored
'default_fill_cache': False,
}, columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, readahead):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
columns=columns, pre_buffer=False).to_pandas()
duration = time.monotonic() - start
print("PyArrow (no pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
columns=columns, pre_buffer=True).to_pandas()
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")
{code}


was (Author: lidavidm):
{noformat}
Whole file:
Pandas/S3FS (no pre-buffer, no readahead): 692.2505334559828 seconds
Pandas/S3FS (no pre-buffer, readahead): 99.55904859001748 seconds
Pandas/S3FS (pre-buffer, no readahead): 39.282157234149054 seconds
Pandas/S3FS (pre-buffer, readahead): 41.564441804075614 seconds
PyArrow (no pre-buffer): 242.97687190794386 seconds
PyArrow (pre-buffer): 39.5321765630506 seconds
===
Column selection:
Pandas/S3FS (no pre-buffer, no readahead): 153.64498204295523 seconds
Pandas/S3FS (no pre-buffer, readahead): 82.4458

[jira] [Comment Edited] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True

2021-04-16 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324043#comment-17324043
 ] 

David Li edited comment on ARROW-12428 at 4/16/21, 7:41 PM:


And for local files, to confirm that pre_buffer isn't a negative:
{noformat}
Pandas: 14.584974920144305 seconds
PyArrow: 6.650648137088865 seconds
PyArrow (pre-buffer): 6.587288308190182 seconds
{noformat}
This is on a system with NVME storage, so results may vary for spinning-rust or 
SATA SSDs.

(Updated results to read once without measuring before taking the measurement, 
in case disk cache is a factor)


was (Author: lidavidm):
And for local files, to confirm that pre_buffer isn't a negative:
{noformat}
Pandas: 14.566267257090658 seconds
PyArrow: 6.649410092970356 seconds
PyArrow (pre-buffer): 6.627140663098544 seconds {noformat}
This is on a system with NVME storage, so results may vary for spinning-rust or 
SATA SSDs.

> [Python] pyarrow.parquet.read_* should use pre_buffer=True
> --
>
> Key: ARROW-12428
> URL: https://issues.apache.org/jira/browse/ARROW-12428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
> Fix For: 5.0.0
>
>
> If the user is synchronously reading a single file, we should try to read it 
> as fast as possible. The one sticking point might be whether it's beneficial 
> to enable this no matter the filesystem or whether we should try to only 
> enable it on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)