Pandas by default treats timestamps as python objects. to_pandas() has this option below.
date_as_objectbool<https://docs.python.org/3/library/stdtypes.html#bltin-boolean-values>, default True<https://docs.python.org/3/library/constants.html#True> Cast dates to objects. If False, convert to datetime64 dtype with the equivalent time unit (if supported). Note: in pandas version < 2.0, only datetime64[ns] conversion is supported. But using pyarrow.parquet for this task is pretty instantaneous with zero memory overhead since you’re just copying chunks from one file and appending it to end of another.. https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing ________________________________ From: Lee, David (ITE) <[email protected]> Sent: Thursday, October 24, 2024 7:55:59 AM To: [email protected] <[email protected]>; Karthik Deivasigamani <[email protected]> Subject: Re: PyArrow <-> Pandas Timestamp Conversion Error There isn’t really a need to read parquet into arrow tables. You can just use pyarrow to read row groups from the smaller files and write them to a new file using pyarrow.parquet.ParquetFile ________________________________ From: Karthik Deivasigamani via user <[email protected]> Sent: Wednesday, October 9, 2024 6:39:59 AM To: [email protected] <[email protected]> Subject: PyArrow <-> Pandas Timestamp Conversion Error External Email: Use caution with links and attachments Hi, I have a simple usecase of merging data from multiple parquet file into a single file. Usually I'm dealing with 50 files of size 100k and trying to form a single parquet file. The code looks something like this : dfs = [] full_schema = None for s3_url in s3_urls: table = ds.dataset(s3_url, format="parquet").to_table() dfs.append(table.to_pandas(safe=False)) full_schema = merge_schema(full_schema, table.schema) ## we keep merging any new columns that appear in the parquet file dfs = pd.concat([]) df.drop_duplicates(inplace=True, subset=["id"]) ## drop any duplicates Table.from_pandas(df, nthreads=1, schema=table.schema) All the above code does is read files from s3 converts them to an Table and then gets the schema, converts table to dataframe and then concats the dataframes. The problem I notice is that one of the columns is a timestamp field and while converting back from pandas dataframe to Arrow Table I encounter the following error "Could not convert Timestamp('2023-02-12 18:19:25+0000', tz='UTC') with type Timestamp: tried to convert to int64", 'Conversion failed for column datetime_end_time_199 with type object' >From my understanding of parquet the Timestamp is a logical datatype while the >underlying primitive is still int64. In this case why is the column being cast >to an object? What am I missing here? Any help is really appreciated. Thanks ~ Karthik This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2024 BlackRock, Inc. All rights reserved.
