There isn’t really a need to read parquet into arrow tables. You can just use
pyarrow to read row groups from the smaller files and write them to a new file
using pyarrow.parquet.ParquetFile
________________________________
From: Karthik Deivasigamani via user <[email protected]>
Sent: Wednesday, October 9, 2024 6:39:59 AM
To: [email protected] <[email protected]>
Subject: PyArrow <-> Pandas Timestamp Conversion Error
External Email: Use caution with links and attachments
Hi,
I have a simple usecase of merging data from multiple parquet file into a
single file. Usually I'm dealing with 50 files of size 100k and trying to form
a single parquet file. The code looks something like this :
dfs = []
full_schema = None
for s3_url in s3_urls:
table = ds.dataset(s3_url, format="parquet").to_table()
dfs.append(table.to_pandas(safe=False))
full_schema = merge_schema(full_schema, table.schema) ## we keep merging any
new columns that appear in the parquet file
dfs = pd.concat([])
df.drop_duplicates(inplace=True, subset=["id"]) ## drop any duplicates
Table.from_pandas(df, nthreads=1, schema=table.schema)
All the above code does is read files from s3 converts them to an Table and
then gets the schema, converts table to dataframe and then concats the
dataframes.
The problem I notice is that one of the columns is a timestamp field and while
converting back from pandas dataframe to Arrow Table I encounter the following
error
"Could not convert Timestamp('2023-02-12 18:19:25+0000', tz='UTC') with type
Timestamp: tried to convert to int64", 'Conversion failed for column
datetime_end_time_199 with type object'
>From my understanding of parquet the Timestamp is a logical datatype while the
>underlying primitive is still int64. In this case why is the column being cast
>to an object? What am I missing here?
Any help is really appreciated. Thanks
~
Karthik
This message may contain information that is confidential or privileged. If you
are not the intended recipient, please advise the sender immediately and delete
this message. See
http://www.blackrock.com/corporate/compliance/email-disclaimers for further
information. Please refer to
http://www.blackrock.com/corporate/compliance/privacy-policy for more
information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/about-us/contacts-locations.
© 2024 BlackRock, Inc. All rights reserved.