Robert Gruener created ARROW-2842: ------------------------------------- Summary: [Python] Cannot read parquet files with row group size of 1 From HDFS Key: ARROW-2842 URL: https://issues.apache.org/jira/browse/ARROW-2842 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Robert Gruener Attachments: single-row.parquet
This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error ``` TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from "10.103.182.28:50010": End of the stream @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) @ parquet::SerializedFile::ParseMetaData() @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&) @ parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&) @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&, std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >*) @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*) ``` The following code causes it: ``` import pyarrow import pyarrow.parquet as pq fs = pyarrow.hdfs.connect() # fill in namenode information file_object = fs.open('single-row.parquet') # update for hdfs path of file pq.read_metadata(file_object) # this works parquet_file = pq.ParquetFile(file_object) parquet_file.read_row_group(0) # throws error ``` I am working on writing a unit test for this -- This message was sent by Atlassian JIRA (v7.6.3#76005)