[ https://issues.apache.org/jira/browse/ARROW-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16850190#comment-16850190 ]
Ivan Dimitrov commented on ARROW-5318: -------------------------------------- Resolution at https://issues.apache.org/jira/browse/ARROW-5432 > [Python] pyarrow hdfs reader overrequests > ------------------------------------------- > > Key: ARROW-5318 > URL: https://issues.apache.org/jira/browse/ARROW-5318 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.10.0 > Reporter: Ivan Dimitrov > Priority: Blocker > > I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, > I often get 0%-300% more data sent over the network. My suspicion is that > pyarrow is reading ahead. > The pyarrow parquet reader doesn't have this behavior, and I am looking for a > way to turn off read ahead for the general HDFS interface. > I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 > (newest released version). I am on python 2.7 > I have been using wireshark to track the packets passed on the network. > I suspect it is read ahead since the time for the 1st read is much greater > than the time for 2nd read. > > The regular pyarrow reader > {code:java} > import pyarrow as pa > fs = pa.hdfs.connect(hostname, driver='libhdfs') > file_path = 'dataset/train/piece0000' > f = fs.open(file_path) > f.seek(0) > n_bytes = 3000000 > f.read(n_bytes) > {code} > > Parquet code without the same issue > {code:java} > parquet_file = 'dataset/train/parquet/part-22e3' > pf = fs.open(parquet_path) > pqf = pa.parquet.ParquetFile(pf) > data = pqf.read_row_group(0, columns=['col_name']) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)