[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228551#comment-17228551 ]
Joris Van den Bossche commented on ARROW-10517: ----------------------------------------------- [~ldacey] can you paste the full error tracebacks you see? Right now it's quite hard to follow what error you exactly get and from where it is coming. > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > ------------------------------------------------------------------------ > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > Reporter: Lance Dacey > Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > > > > If I downgrade adlfs to 0.2.5 and azure-blob-storage to 2.1, and then upgrade > fsspec (0.6.2 has errors with a detail kwarg, so I need to upgrade it): > > {code:java} > pa.dataset.write_dataset(data=table, > base_dir="test/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", pa.int64()), > ("month", pa.int16()), ("day", pa.int16())])), > schema=table.schema, > filesystem=blob_fs){code} > > {code:java} > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is not found > --> 228 self.fs.mkdir(path, create_parents=recursive){code} > > It does not look like there is a mkdir option. However, the output of > fs.find() returns a dictionary as expected: > {code:java} > selected_files = blob_fs.find( > "test/test6", maxdepth=None, withdirs=True, detail=True > ){code} > > Now if I install the latest version of adlfs it upgrades my blob SDK to 12.5 > (unfortunately, I cannot use this in production since Airflow requires 2.1, > so this is only for testing purposes): > {code:java} > Successfully installed adlfs-0.5.5 azure-storage-blob-12.5.0{code} > > Now fs.find() returns a list, but I am able to use fs.mkdir(). > {code:java} > ['test/test6/year=2020', > 'test/test6/year=2020/month=11', > 'test/test6/year=2020/month=11/day=1', > > 'test/test6/year=2020/month=11/day=1/8ee6c66320ca47908c37f112f0cffd6c.parquet', > > 'test/test6/year=2020/month=11/day=1/ef753f016efc44b7b0f0800c35d084fc.parquet',]{code} > > This causes issues later when I try to read a dataset (the code is expecting > a dictionary still): > {code:java} > dataset = ds.dataset("test/test5", filesystem=blob_fs, format="parquet"){code} > {code:java} > --> > 221 for path, info in selected_files.items(): > 222 infos.append(self._create_file_info(path, info)) > 223 AttributeError: 'list' object has no attribute 'items'{code} > > I am still able to read individual files: > {code:java} > dataset = ds.dataset("test/test4/year=2020/month=11/2020-11.parquet", > filesystem=blob_fs, format="parquet"){code} > And I can read the dataset if I pass in a list of blob names "manually": > > {code:java} > blobs = wasb.list_blobs(container_name="test", prefix="test4") > dataset = ds.dataset(source=["test/" + blob.name for blob in blobs], > format="parquet", > partitioning="hive", > filesystem=blob_fs) > {code} > > For all of my examples, blob_fs is defined by: > {code:java} > blob_fs = fsspec.filesystem( > protocol="abfs", account_name=base.login, account_key=base.password > ){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)