[
https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17659867#comment-17659867
]
Rok Mihevc commented on ARROW-2842:
-----------------------------------
This issue has been migrated to [issue
#19217|https://github.com/apache/arrow/issues/19217] on GitHub. Please see the
[migration documentation|https://github.com/apache/arrow/issues/14542] for
further details.
> [Python] Cannot read parquet files with row group size of 1 From HDFS
> ---------------------------------------------------------------------
>
> Key: ARROW-2842
> URL: https://issues.apache.org/jira/browse/ARROW-2842
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Robbie Gruener
> Priority: Major
> Attachments: single-row.parquet
>
>
> This might be a bug in parquet-cpp, I need to spend a bit more time tracking
> this down but basically given a file with a single row on hdfs, reading it
> with pyarrow yields this error
> ```
> TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the
> stream
> @ Unknown
> @ Unknown
> @ Unknown
> @ Unknown
> @ Unknown
> @ Unknown
> @ Unknown
> @ Unknown
> @ Unknown
> @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
> @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
> @ parquet::SerializedFile::ParseMetaData()
> @
> parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource,
> std::default_delete<parquet::RandomAccessSource> >,
> parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData>
> const&)
> @
> parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource,
> std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties
> const&, std::shared_ptr<parquet::FileMetaData> const&)
> @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile>
> const&, arrow::MemoryPool*, parquet::ReaderProperties const&,
> std::shared_ptr<parquet::FileMetaData> const&,
> std::unique_ptr<parquet::arrow::FileReader,
> std::default_delete<parquet::arrow::FileReader> >*)
> @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*,
> _object*)
> ```
> The following code causes it:
> ```
> import pyarrow
> import pyarrow.parquet as pq
>
> fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in
> namenode information
> file_object = fs.open('single-row.parquet') # update for hdfs path of file
> pq.read_metadata(file_object) # this works
> parquet_file = pq.ParquetFile(file_object)
> parquet_file.read_row_group(0) # throws error
> ```
>
> I am working on writing a unit test for this. Note that I am using libhdfs3.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)