[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

Joris Van den Bossche (Jira) Mon, 09 Nov 2020 04:41:46 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228551#comment-17228551
 ]


Joris Van den Bossche commented on ARROW-10517:
-----------------------------------------------

[~ldacey] can you paste the full error tracebacks you see?  Right now it's 
quite hard to follow what error you exactly get and from where it is coming.

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> ------------------------------------------------------------------------
>
>                 Key: ARROW-10517
>                 URL: https://issues.apache.org/jira/browse/ARROW-10517
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 18.04
>            Reporter: Lance Dacey
>            Priority: Major
>              Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
>
>  
>  
> If I downgrade adlfs to 0.2.5 and azure-blob-storage to 2.1, and then upgrade 
> fsspec (0.6.2 has errors with a detail kwarg, so I need to upgrade it):
>  
> {code:java}
> pa.dataset.write_dataset(data=table, 
>  base_dir="test/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", pa.int64()), 
> ("month", pa.int16()), ("day", pa.int16())])), 
>  schema=table.schema,
>  filesystem=blob_fs){code}
>  
> {code:java}
> 226 def create_dir(self, path, recursive):  
> 227 # mkdir also raises FileNotFoundError when base directory is not found 
> --> 228 self.fs.mkdir(path, create_parents=recursive){code}
>  
> It does not look like there is a mkdir option. However, the output of 
> fs.find() returns a dictionary as expected:
> {code:java}
> selected_files = blob_fs.find(
>  "test/test6", maxdepth=None, withdirs=True, detail=True
> ){code}
>  
> Now if I install the latest version of adlfs it upgrades my blob SDK to 12.5 
> (unfortunately, I cannot use this in production since Airflow requires 2.1, 
> so this is only for testing purposes):
> {code:java}
> Successfully installed adlfs-0.5.5 azure-storage-blob-12.5.0{code}
>  
> Now fs.find() returns a list, but I am able to use fs.mkdir().
> {code:java}
> ['test/test6/year=2020',
>  'test/test6/year=2020/month=11',
>  'test/test6/year=2020/month=11/day=1',
>  
> 'test/test6/year=2020/month=11/day=1/8ee6c66320ca47908c37f112f0cffd6c.parquet',
>  
> 'test/test6/year=2020/month=11/day=1/ef753f016efc44b7b0f0800c35d084fc.parquet',]{code}
>  
> This causes issues later when I try to read a dataset (the code is expecting 
> a dictionary still):
> {code:java}
> dataset = ds.dataset("test/test5", filesystem=blob_fs, format="parquet"){code}
> {code:java}
> --> 
> 221 for path, info in selected_files.items():  
> 222 infos.append(self._create_file_info(path, info))  
> 223 AttributeError: 'list' object has no attribute 'items'{code}
>  
> I am still able to read individual files:
> {code:java}
> dataset = ds.dataset("test/test4/year=2020/month=11/2020-11.parquet", 
> filesystem=blob_fs, format="parquet"){code}
>  And I can read the dataset if I pass in a list of blob names "manually":
>  
> {code:java}
> blobs = wasb.list_blobs(container_name="test", prefix="test4")
> dataset = ds.dataset(source=["test/" + blob.name for blob in blobs], 
>  format="parquet", 
>  partitioning="hive",
>  filesystem=blob_fs)
> {code}
>  
> For all of my examples, blob_fs is defined by:
> {code:java}
> blob_fs = fsspec.filesystem(
>  protocol="abfs", account_name=base.login, account_key=base.password
>  ){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

Reply via email to