mitchelladam commented on issue #34905:
URL: https://github.com/apache/arrow/issues/34905#issuecomment-2047425289
This is the case for GCS as well as S3.
we just encountered this when updating from pyarrow 10.0.1 to 14.0.2 but is
present in all versions from 11.0.0 onwards.
it is present for both the GCSFS library and the pyarrow.fs.GcsFileSystem
example code:
`import gcsfs
import pyarrow as pa
import pyarrow.fs as pafs
import pyarrow.dataset as ds
import datetime
#%%
fs = gcsfs.GCSFileSystem()
#%%
data = {
"some_timestamp": [datetime.datetime.now(tz=datetime.timezone.utc) -
datetime.timedelta(days=1),
datetime.datetime.now(tz=datetime.timezone.utc)
- datetime.timedelta(days=2),
datetime.datetime.now(tz=datetime.timezone.utc)
- datetime.timedelta(days=3)],
"value1": ["hello", "world", "foo"],
"value2": [123, 456, 789]
}
schema = pa.schema([
pa.field("some_timestamp", pa.timestamp("ms")),
pa.field("value1", pa.string()),
pa.field("value2", pa.int64())
])
#%%
result_pya_table = pa.Table.from_pydict(data, schema=schema)
#%%
# fs = pafs.GcsFileSystem()
ds.write_dataset(
data=result_pya_table,
base_dir=f"adam_ryota_data/pyarrowfstest/2023.12.2.post1-10.0.1/",
format='parquet',
partitioning=["some_timestamp"],
partitioning_flavor='hive',
existing_data_behavior='overwrite_or_ignore',
basename_template="data-{i}.parquet",
filesystem=fs
)`
10.0.1
results in:

11.0.0 or higher results in:

note that it is not part of the overall uri being encoded. only the data
within the dataset is affected by this.
when using the hive partition as part of the path:
`data = {
# "some_timestamp": [datetime.datetime.now(tz=datetime.timezone.utc)
- datetime.timedelta(days=1),
#
datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=2),
#
datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=3)],
"value1": ["hello", "world", "foo"],
"value2": [123, 456, 789]
}
schema = pa.schema([
# pa.field("some_timestamp", pa.timestamp("ms")),
pa.field("value1", pa.string()),
pa.field("value2", pa.int64())
])
#%%
result_pya_table = pa.Table.from_pydict(data, schema=schema)
#%%
# fs = pafs.GcsFileSystem()
# some_timestamp=2024-04-07 11:13:27.169
ds.write_dataset(
data=result_pya_table,
base_dir=f"adam_ryota_data/manualhive/2023.12.2.post1-10.0.1/some_timestamp=2024-04-07
11:13:27.169/",
format='parquet',
# partitioning=["some_timestamp"],
# partitioning_flavor='hive',
existing_data_behavior='overwrite_or_ignore',
basename_template="data-{i}.parquet",
filesystem=fs`
even in 11.0.0+
the data is written as expected.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]