Re: PyArrow <-> Pandas Timestamp Conversion Error

Lee, David (ITE) Thu, 24 Oct 2024 15:17:34 -0700

Pandas by default treats timestamps as python objects. to_pandas() has this 
option below.


date_as_objectbool<https://docs.python.org/3/library/stdtypes.html#bltin-boolean-values>,
 default True<https://docs.python.org/3/library/constants.html#True>

Cast dates to objects. If False, convert to datetime64 dtype with the 
equivalent time unit (if supported). Note: in pandas version < 2.0, only 
datetime64[ns] conversion is supported.

But using pyarrow.parquet for this task is pretty instantaneous with zero 
memory overhead since you’re just copying chunks from one file and appending it 
to end of another..

https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing

________________________________
From: Lee, David (ITE) <[email protected]>
Sent: Thursday, October 24, 2024 7:55:59 AM
To: [email protected] <[email protected]>; Karthik Deivasigamani 
<[email protected]>
Subject: Re: PyArrow <-> Pandas Timestamp Conversion Error


There isn’t really a need to read parquet into arrow tables. You can just use 
pyarrow to read row groups from the smaller files and write them to a new file 
using pyarrow.parquet.ParquetFile
________________________________
From: Karthik Deivasigamani via user <[email protected]>
Sent: Wednesday, October 9, 2024 6:39:59 AM
To: [email protected] <[email protected]>
Subject: PyArrow <-> Pandas Timestamp Conversion Error


External Email: Use caution with links and attachments


Hi,
   I have a simple usecase of merging data from multiple parquet file into a 
single file. Usually I'm dealing with 50 files of size 100k and trying to form 
a single parquet file. The code looks something like this :

dfs = []
full_schema = None
for s3_url in s3_urls:
  table = ds.dataset(s3_url, format="parquet").to_table()
  dfs.append(table.to_pandas(safe=False))
  full_schema = merge_schema(full_schema, table.schema) ## we keep merging any 
new columns that appear in the parquet file
dfs = pd.concat([])
df.drop_duplicates(inplace=True, subset=["id"]) ## drop any duplicates
Table.from_pandas(df, nthreads=1, schema=table.schema)

All the above code does is read files from s3 converts them to an Table and 
then gets the schema, converts table to dataframe and then concats the 
dataframes.
The problem I notice is that one of the columns is a timestamp field and while 
converting back from pandas dataframe to Arrow Table I encounter the following 
error

"Could not convert Timestamp('2023-02-12 18:19:25+0000', tz='UTC') with type 
Timestamp: tried to convert to int64", 'Conversion failed for column 
datetime_end_time_199 with type object'

>From my understanding of parquet the Timestamp is a logical datatype while the 
>underlying primitive is still int64. In this case why is the column being cast 
>to an object? What am I missing here?
Any help is really appreciated. Thanks

~
Karthik



This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2024 BlackRock, Inc. All rights reserved.

Re: PyArrow <-> Pandas Timestamp Conversion Error

Reply via email to