[ 
https://issues.apache.org/jira/browse/ARROW-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510062#comment-17510062
 ] 

Callista Rogers commented on ARROW-15910:
-----------------------------------------

Oh that's interesting. The first time I run fs.info, I get:

{'kind': 'storage#object', 'id': 
'MyBucket/path/name_of_parquet.parquet//1646930508287024', 'selfLink': 
'https://www.googleapis.com/storage/v1/b/MyBucket%2Fpath/name_of_parquet.parquet%2F',
 'mediaLink': 
'https://storage.googleapis.com/download/storage/v1/b/MyBucket/o/path%2Fname_of_parquet.parquet%2F?generation=1646930508287024&alt=media',
 'name': 'MyBucket/path/name_of_parquet.parquet/', 'bucket': 'MyBucket', 
'generation': '1646930508287024', 'metageneration': '1', 'contentType': 
'application/octet-stream', 'storageClass': 'STANDARD', 'size': 0, 'md5Hash': 
'1B2M2Y8AsgTpgAmY7PhCfg==', 'crc32c': 'AAAAAA==', 'etag': 'CLCYqp/+u/YCEAE=', 
'timeCreated': '2022-03-10T16:41:48.428Z', 'updated': 
'2022-03-10T16:41:48.428Z', 'timeStorageClassUpdated': 
'2022-03-10T16:41:48.428Z', 'type': 'file'}


from {{pa_fs.get_file_info(file_path))}} the first time
<FileInfo for 'myBucket/features/MyParquet.parquet/': type=FileType.File, 
size=0>

> [Python] pyarrow.parquet.read_table either returns FileNotFound or 
> ArrowInvalid
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-15910
>                 URL: https://issues.apache.org/jira/browse/ARROW-15910
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 6.0.1, 7.0.0
>         Environment: GCP JupyterLab notebooks
>            Reporter: Callista Rogers
>            Priority: Major
>
> running below results in {{"GetFileIno() yielded path 
> 'myBucket/features/MyParquet.parquet/year=2022/part-0019.snappy.parquet' 
> which is outside base dir 'gs://myBucket/features/MyParquet.parquet/' "}}
> {code}
> import pyarrow.parquet as pq
> import gcsfs
> file_path="gs://myBucket/features/MyParquet.parquet/"
> fs=gcsfs.GCSFileSystem()
> table=pq.read_table(file_path,filesystem=fs) 
> {code}
> Removing the gs:// from file_path results in a {{FileNotFoundError}}. Any 
> variation of / or // at the beginning of the path gives me the 'outside base 
> dir' error.
> I also ran the below and got valid results using both file_path patterns, so 
> I know it finds the path just fine.
> {code}
> from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler
> filesys = PyFileSystem(FSSpecHandler(fs))
> selector = FileSelector(file_path, recursive=True)
> filesys.get_file_info(selector)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to