[ https://issues.apache.org/jira/browse/ARROW-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Catherine updated ARROW-7957: ----------------------------- Description: {{from pyarrow.fs import HadoopFileSystem}} {{import pyarrow.parquet as pq}} {{file_name = "hdfs://localhost:9000/test/file_name.pq"}} {{hdfs, path = HadoopFileSystem.from_uri(file_name)}} {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}} has error: {{OSError: Unrecognized filesystem: <class 'pyarrow._hdfs.HadoopFileSystem'>}} When I tried using the deprecated {{HadoopFileSystem}}: {{import pyarrow}} {{import pyarrow.parquet as pq}} {{file_name = "hdfs://localhost:9000/test/file_name.pq"}} {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}} {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}} {{pa_schema = dataset.schema.to_arrow_schema()}} {{pieces = dataset.pieces}} {{for piece in pieces: }} {{ print(piece.path)}} {{piece.path}} lose the {{hdfs://localhost:9000}} prefix. I think{{ ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as filesystem?}} And {{piece.path}} should have the prefix? was: {{from pyarrow.fs import HadoopFileSystem}} {{import pyarrow.parquet as pq}} {{file_name = "hdfs://localhost:9000/test/file_name.pq"}} {{hdfs, path = HadoopFileSystem.from_uri(file_name)}} {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}} has error: {{ raise IOError('Unrecognized filesystem: \{0}'.format(fs_type))}} {{OSError: Unrecognized filesystem: <class 'pyarrow._hdfs.HadoopFileSystem'>}} When I tried using the deprecated {{HadoopFileSystem}}: {{import pyarrow}} {{import pyarrow.parquet as pq}} {{file_name = }}{{"hdfs://localhost:9000/test/file_name.pq"}}{{}} {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}} {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}} {{pa_schema = dataset.schema.to_arrow_schema()}} {{pieces = dataset.pieces}} {{for piece in pieces: }} {{ print(piece.path)}} {{piece.path }}lose the{{ hdfs://localhost:9000 }}prefix. I think{{ ParquetDataset }}should accept{{ }}{{pyarrow.fs.}}{{HadoopFileSystem }}as filesystem?{{}} And {{piece.path }}should have the prefix? > ParquetDataset cannot take HadoopFileSystem as filesystem > --------------------------------------------------------- > > Key: ARROW-7957 > URL: https://issues.apache.org/jira/browse/ARROW-7957 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.16.0 > Reporter: Catherine > Priority: Critical > > {{from pyarrow.fs import HadoopFileSystem}} > {{import pyarrow.parquet as pq}} > > {{file_name = "hdfs://localhost:9000/test/file_name.pq"}} > {{hdfs, path = HadoopFileSystem.from_uri(file_name)}} > {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}} > > has error: > {{OSError: Unrecognized filesystem: <class > 'pyarrow._hdfs.HadoopFileSystem'>}} > > When I tried using the deprecated {{HadoopFileSystem}}: > {{import pyarrow}} > {{import pyarrow.parquet as pq}} > > {{file_name = "hdfs://localhost:9000/test/file_name.pq"}} > {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}} > {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}} > {{pa_schema = dataset.schema.to_arrow_schema()}} > {{pieces = dataset.pieces}} > {{for piece in pieces: }} > {{ print(piece.path)}} > > {{piece.path}} lose the {{hdfs://localhost:9000}} prefix. > > I think{{ ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as > filesystem?}} > And {{piece.path}} should have the prefix? -- This message was sent by Atlassian Jira (v8.3.4#803005)