AW: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

roman.karlstetter Tue, 29 Oct 2019 08:14:17 -0700

Hi Wes,

that was a bit unclear, sorry for that. With "an array", I'm referring to a 
plain c++-type array, i.e. an array of float, uint32_t, ...
This means that I do not use the arrow::Array-based write API, but I use the 
TypedColumnWriter::WriteBatch() function directly and do not have any arrow 
arrays. Are there any advantages of not using the writebatch directly and 
instead using arrow::Arrays?


Thanks,
Roman

-----Ursprüngliche Nachricht-----
Von: Wes McKinney <wesmck...@gmail.com> 
Gesendet: Dienstag, 29. Oktober 2019 15:59
An: dev <dev@arrow.apache.org>
Betreff: Re: State of decimal support in Arrow (from/to Parquet Decimal 
Logicaltype)

On Tue, Oct 29, 2019 at 3:11 AM <roman.karlstet...@gmail.com> wrote:
>
> Hi Wes,
>
> thanks for the response. There's one thing that is still a little unclear to 
> me:
> I had a look at the code for function WriteArrowSerialize<FLBAType, 
> arrow::Decimal128Type> in the reference you provided. I don't have arrow data 
> in the first place, but as I understand it, I need to have an array of 
> FixedLenByteArrays objects which then point to the actual decimal values in 
> the big_endian_values buffer. Is this the only way to write decimal types or 
> is it also possible to directly provide an array with values to writeBatch()?
>

Could you clarify what you mean by "an array"? If you use the 
arrow::Array-based write API then it will invoke this serializer specialization

https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca585ae1c/cpp/src/parquet/column_writer.cc#L1569

That's what we're calling (if I'm not mistaken, since I just worked on this 
code recently) when writing arrow::Decimal128Array. If you set a breakpoint 
with gdb there you can see the call stack

> For the issues, I also found 
> https://issues.apache.org/jira/browse/ARROW-6990, but I'm not sure if this is 
> also related to the issues you created.
>
> Thanks,
> Roman
>
> -----Ursprüngliche Nachricht-----
> Von: Wes McKinney <wesmck...@gmail.com>
> Gesendet: Montag, 28. Oktober 2019 21:11
> An: dev <dev@arrow.apache.org>
> Betreff: Re: State of decimal support in Arrow (from/to Parquet 
> Decimal Logicaltype)
>
> hi Roman,
>
> On Mon, Oct 28, 2019 at 5:56 AM <roman.karlstet...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> >
> >
> > I have a question about the state of decimal support in Arrow when 
> > reading from/writing to Parquet.
> >
> > *       Is writing decimals to parquet supposed to work? Are there any
> > examples on how to do this in C++?
>
> Yes, it's supported, the details are here
>
> https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca
> 585ae1c/cpp/src/parquet/column_writer.cc#L1511
>
> > *       When reading decimals in a parquet file with pyarrow and converting
> > the resulting table to a pandas dataframe, datatype in the cells is 
> > "object". As a consequence, performance when doing analysis on this 
> > table is suboptimal. Can I somehow directly get the decimals from 
> > the parquet file into floats/doubles in a pandas dataframe?
>
> Some work will be required. The cleanest way would be to cast
> decimal128 columns to float32/float64 prior to converting to pandas.
>
> I didn't see an issue for this right away so I opened
>
> https://issues.apache.org/jira/browse/ARROW-7010
>
> I also opened
>
> https://issues.apache.org/jira/browse/ARROW-7011
>
> about going the other way. This would be a useful thing to contribute to the 
> project.
>
> Thanks
> Wes
>
> >
> >
> > Thanks in advance,
> >
> > Roman
> >
> >
> >
>

AW: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

Reply via email to