[jira] [Commented] (ARROW-8545) Allow fast writing of Decimal column to parquet

Jacek Pliszka (Jira) Tue, 21 Apr 2020 12:45:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089011#comment-17089011
 ]


Jacek Pliszka commented on ARROW-8545:
--------------------------------------

OK, I checked and This is my version:

 
{code:java}
pat = pa.Table.from_pandas(df)
t3 = time()print(t3-t2)
pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3)))
t4 = time()
print(t4 - t3)
pq.write_table(pat, '/tmp/testabd.pq')
t5 = time()
print(t5 - t4)
{code}
And we are getting here

A) 0.3s for conversion from pandas to arrow Table

B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast 
implemented from double to decimal(38, 3)

C) 2.8s for writing table to parquet file - is it fast enough for you

 

B and C are separate topics and should have separate issues. In B decimal128 
should be easier if this is enough for you

 

 

> Allow fast writing of Decimal column to parquet
> -----------------------------------------------
>
>                 Key: ARROW-8545
>                 URL: https://issues.apache.org/jira/browse/ARROW-8545
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.17.0
>            Reporter: Fons de Leeuw
>            Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
>     d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8545) Allow fast writing of Decimal column to parquet

Reply via email to