Hi,
   I have a simple usecase of merging data from multiple parquet file into
a single file. Usually I'm dealing with 50 files of size 100k and trying to
form a single parquet file. The code looks something like this :

dfs = []
full_schema = None
for s3_url in s3_urls:
  table = ds.dataset(s3_url, format="parquet").to_table()
  dfs.append(table.to_pandas(safe=False))
  full_schema = merge_schema(full_schema, table.schema) ## we keep merging
any new columns that appear in the parquet file
dfs = pd.concat([])
df.drop_duplicates(inplace=True, subset=["id"]) ## drop any duplicates
Table.from_pandas(df, nthreads=1, schema=table.schema)

All the above code does is read files from s3 converts them to an Table and
then gets the schema, converts table to dataframe and then concats the
dataframes.
The problem I notice is that one of the columns is a timestamp field and
while converting back from pandas dataframe to Arrow Table I encounter the
following error

"Could not convert Timestamp('2023-02-12 18:19:25+0000', tz='UTC') with
type Timestamp: tried to convert to int64", 'Conversion failed for column
datetime_end_time_199 with type object'

>From my understanding of parquet the Timestamp is a logical datatype while
the underlying primitive is still int64. In this case why is the column
being cast to an object? What am I missing here?
Any help is really appreciated. Thanks

~
Karthik

Reply via email to