Re: [I] [Python] unexpected URL encoded path (white spaces) when uploading to S3 [arrow]

via GitHub Wed, 10 Apr 2024 05:35:54 -0700


mitchelladam commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-2047425289


   This is the case for GCS as well as S3.
   we just encountered this when updating from pyarrow 10.0.1 to 14.0.2 but is 
present in all versions from 11.0.0 onwards.
   it is present for both the GCSFS library and the pyarrow.fs.GcsFileSystem
   example code:
   
   `import gcsfs
   import pyarrow as pa
   import pyarrow.fs as pafs
   import pyarrow.dataset as ds
   import datetime
   
   #%%
   fs = gcsfs.GCSFileSystem()
   #%%
   data = {
           "some_timestamp": [datetime.datetime.now(tz=datetime.timezone.utc) - 
datetime.timedelta(days=1),
                               datetime.datetime.now(tz=datetime.timezone.utc) 
- datetime.timedelta(days=2),
                               datetime.datetime.now(tz=datetime.timezone.utc) 
- datetime.timedelta(days=3)],
           "value1": ["hello", "world", "foo"],
           "value2": [123, 456, 789]
       }
   schema = pa.schema([
       pa.field("some_timestamp", pa.timestamp("ms")),
       pa.field("value1", pa.string()),
       pa.field("value2", pa.int64())
   ])
   #%%
   result_pya_table = pa.Table.from_pydict(data, schema=schema)
   #%%
   # fs = pafs.GcsFileSystem()
   ds.write_dataset(
       data=result_pya_table,
       base_dir=f"adam_ryota_data/pyarrowfstest/2023.12.2.post1-10.0.1/",
       format='parquet',
       partitioning=["some_timestamp"],
       partitioning_flavor='hive',
       existing_data_behavior='overwrite_or_ignore',
       basename_template="data-{i}.parquet",
       filesystem=fs
   )`
   
   10.0.1
   results in:
   
![image](https://github.com/apache/arrow/assets/17411882/6b4e2eb0-0166-47a9-bef7-7cc447e3927d)
   
   11.0.0 or higher results in:
   
![image](https://github.com/apache/arrow/assets/17411882/2ec27759-167d-4571-9a51-e1dce6972bda)
   
   
   note that it is not part of the overall uri being encoded. only the data 
within the dataset is affected by this.
   when using the hive partition as part of the path:
   `data = {
           # "some_timestamp": [datetime.datetime.now(tz=datetime.timezone.utc) 
- datetime.timedelta(days=1),
           #                     
datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=2),
           #                     
datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=3)],
           "value1": ["hello", "world", "foo"],
           "value2": [123, 456, 789]
       }
   schema = pa.schema([
       # pa.field("some_timestamp", pa.timestamp("ms")),
       pa.field("value1", pa.string()),
       pa.field("value2", pa.int64())
   ])
   #%%
   result_pya_table = pa.Table.from_pydict(data, schema=schema)
   #%%
   # fs = pafs.GcsFileSystem()
   # some_timestamp=2024-04-07 11:13:27.169
   ds.write_dataset(
       data=result_pya_table,
       
base_dir=f"adam_ryota_data/manualhive/2023.12.2.post1-10.0.1/some_timestamp=2024-04-07
 11:13:27.169/",
       format='parquet',
       # partitioning=["some_timestamp"],
       # partitioning_flavor='hive',
       existing_data_behavior='overwrite_or_ignore',
       basename_template="data-{i}.parquet",
       filesystem=fs`
   
   even in 11.0.0+
   the data is written as expected.
   
![image](https://github.com/apache/arrow/assets/17411882/ecba744b-3808-42d6-893d-39609f9ef180)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] unexpected URL encoded path (white spaces) when uploading to S3 [arrow]

Reply via email to