[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-8545: ----------------------------------- Summary: [Python] Allow fast writing of Decimal column to parquet (was: Allow fast writing of Decimal column to parquet) > [Python] Allow fast writing of Decimal column to parquet > -------------------------------------------------------- > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 0.17.0 > Reporter: Fons de Leeuw > Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object column is not possible, it > would be nice if a workaround is possible. For example, pass an int and then > shift the dot "x" places to the left. (It is already possible to pass an int > column and specify "decimal" dtype in the Arrow schema during > `pa.Table.from_pandas()` but then it simply becomes a decimal without > decimals.) Also, it might be nice if it can be encoded as a 128-bit byte > string in the pandas column and then directly interpreted by Arrow. > *Usecase* > I need to save large dataframes (~10GB) of geospatial data with > latitude/longitude. I can't use float as comparisons need to be exact, and > the BigQuery "clustering" feature needs either an integer or a decimal but > not a float. In the meantime, I have to do a workaround where I use only ints > (the original number multiplied by 1000.) > *Snippet* > {code:java} > import decimal > from time import time > import numpy as np > import pandas as pd > d = dict() > for col in "abcdefghijklmnopqrstuvwxyz": > d[col] = np.random.rand(int(1E7)) * 100 > df = pd.DataFrame(d) > t0 = time() > df.to_parquet("/tmp/testabc.pq", engine="pyarrow") > t1 = time() > df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) > t2 = time() > df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") > t3 = time() > print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal > column {t3-t2:.3f}s") > # Saving the normal dataframe took 4.430s, with one decimal column > 17.673s{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)