[jira] [Commented] (ARROW-5318) [Python] pyarrow hdfs reader overrequests

Ivan Dimitrov (JIRA) Tue, 28 May 2019 15:11:40 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16850190#comment-16850190
 ]


Ivan Dimitrov commented on ARROW-5318:
--------------------------------------

Resolution at https://issues.apache.org/jira/browse/ARROW-5432

> [Python] pyarrow hdfs reader overrequests  
> -------------------------------------------
>
>                 Key: ARROW-5318
>                 URL: https://issues.apache.org/jira/browse/ARROW-5318
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.10.0
>            Reporter: Ivan Dimitrov
>            Priority: Blocker
>
> I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, 
> I often get 0%-300% more data sent over the network. My suspicion is that 
> pyarrow is reading ahead.
> The pyarrow parquet reader doesn't have this behavior, and I am looking for a 
> way to turn off read ahead for the general HDFS interface.
> I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 
> (newest released version). I am on python 2.7
> I have been using wireshark to track the packets passed on the network.
> I suspect it is read ahead since the time for the 1st read is much greater 
> than the time for 2nd read.
>  
> The regular pyarrow reader
> {code:java}
> import pyarrow as pa 
> fs = pa.hdfs.connect(hostname, driver='libhdfs') 
> file_path = 'dataset/train/piece0000' 
> f = fs.open(file_path) 
> f.seek(0) 
> n_bytes = 3000000 
> f.read(n_bytes)
> {code}
>  
> Parquet code without the same issue
> {code:java}
> parquet_file = 'dataset/train/parquet/part-22e3' 
> pf = fs.open(parquet_path) 
> pqf = pa.parquet.ParquetFile(pf)
> data = pqf.read_row_group(0, columns=['col_name'])
>  {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5318) [Python] pyarrow hdfs reader overrequests

Reply via email to