[ https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Gruener resolved ARROW-2842. ----------------------------------- Resolution: Invalid I have not been able to reproduce well. It likely was due to an hdfs connection issue and not an issue with pyarrow > [Python] Cannot read parquet files with row group size of 1 From HDFS > --------------------------------------------------------------------- > > Key: ARROW-2842 > URL: https://issues.apache.org/jira/browse/ARROW-2842 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Robert Gruener > Priority: Major > Attachments: single-row.parquet > > > This might be a bug in parquet-cpp, I need to spend a bit more time tracking > this down but basically given a file with a single row on hdfs, reading it > with pyarrow yields this error > ``` > TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the > stream > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) > @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) > @ parquet::SerializedFile::ParseMetaData() > @ > parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource, > std::default_delete<parquet::RandomAccessSource> >, > parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> > const&) > @ > parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, > std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties > const&, std::shared_ptr<parquet::FileMetaData> const&) > @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> > const&, arrow::MemoryPool*, parquet::ReaderProperties const&, > std::shared_ptr<parquet::FileMetaData> const&, > std::unique_ptr<parquet::arrow::FileReader, > std::default_delete<parquet::arrow::FileReader> >*) > @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, > _object*) > ``` > The following code causes it: > ``` > import pyarrow > import pyarrow.parquet as pq > > fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in > namenode information > file_object = fs.open('single-row.parquet') # update for hdfs path of file > pq.read_metadata(file_object) # this works > parquet_file = pq.ParquetFile(file_object) > parquet_file.read_row_group(0) # throws error > ``` > > I am working on writing a unit test for this. Note that I am using libhdfs3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)