Catherine created ARROW-7957: -------------------------------- Summary: ParquetDataset cannot take HadoopFileSystem as filesystem Key: ARROW-7957 URL: https://issues.apache.org/jira/browse/ARROW-7957 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Catherine
{{from pyarrow.fs import HadoopFileSystem}} {{import pyarrow.parquet as pq}} {{file_name = "hdfs://localhost:9000/test/file_name.pq"}} {{hdfs, path = HadoopFileSystem.from_uri(file_name)}} {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}} has error: {{ raise IOError('Unrecognized filesystem: \{0}'.format(fs_type))}} {{OSError: Unrecognized filesystem: <class 'pyarrow._hdfs.HadoopFileSystem'>}} When I tried using the deprecated {{HadoopFileSystem}}: {{import pyarrow}} {{import pyarrow.parquet as pq}} {{file_name = }}{{"hdfs://localhost:9000/test/file_name.pq"}}{{}} {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}} {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}} {{pa_schema = dataset.schema.to_arrow_schema()}} {{pieces = dataset.pieces}} {{for piece in pieces: }} {{ print(piece.path)}} {{piece.path }}lose the{{ hdfs://localhost:9000 }}prefix. I think{{ ParquetDataset }}should accept{{ }}{{pyarrow.fs.}}{{HadoopFileSystem }}as filesystem?{{}} And {{piece.path }}should have the prefix? -- This message was sent by Atlassian Jira (v8.3.4#803005)