[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089011#comment-17089011 ]
Jacek Pliszka commented on ARROW-8545: -------------------------------------- OK, I checked and This is my version: {code:java} pat = pa.Table.from_pandas(df) t3 = time()print(t3-t2) pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3))) t4 = time() print(t4 - t3) pq.write_table(pat, '/tmp/testabd.pq') t5 = time() print(t5 - t4) {code} And we are getting here A) 0.3s for conversion from pandas to arrow Table B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast implemented from double to decimal(38, 3) C) 2.8s for writing table to parquet file - is it fast enough for you B and C are separate topics and should have separate issues. In B decimal128 should be easier if this is enough for you > Allow fast writing of Decimal column to parquet > ----------------------------------------------- > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 0.17.0 > Reporter: Fons de Leeuw > Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object column is not possible, it > would be nice if a workaround is possible. For example, pass an int and then > shift the dot "x" places to the left. (It is already possible to pass an int > column and specify "decimal" dtype in the Arrow schema during > `pa.Table.from_pandas()` but then it simply becomes a decimal without > decimals.) Also, it might be nice if it can be encoded as a 128-bit byte > string in the pandas column and then directly interpreted by Arrow. > *Usecase* > I need to save large dataframes (~10GB) of geospatial data with > latitude/longitude. I can't use float as comparisons need to be exact, and > the BigQuery "clustering" feature needs either an integer or a decimal but > not a float. In the meantime, I have to do a workaround where I use only ints > (the original number multiplied by 1000.) > *Snippet* > {code:java} > import decimal > from time import time > import numpy as np > import pandas as pd > d = dict() > for col in "abcdefghijklmnopqrstuvwxyz": > d[col] = np.random.rand(int(1E7)) * 100 > df = pd.DataFrame(d) > t0 = time() > df.to_parquet("/tmp/testabc.pq", engine="pyarrow") > t1 = time() > df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) > t2 = time() > df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") > t3 = time() > print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal > column {t3-t2:.3f}s") > # Saving the normal dataframe took 4.430s, with one decimal column > 17.673s{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)