AW: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

roman.karlstetter Wed, 30 Oct 2019 05:46:59 -0700

Hi Wes,

the data is indeed not originating from Arrow, so I was looking for how to call 
the low level WriteBatch API. I figured it out now, it's actually 
straightforward in the Arrow-API, I just got confused a little with the spec at 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#DECIMAL


So for future reference: I multiply each value in a floating point array with 
pow(10, scale) and pass the resulting array (in my case: int32_t) directly to 
WriteBatch().

One thing I can imagine that could make the API a little easier to use: Provide 
a function that directly takes an array of floats or doubles which does the 
conversion internally. But it's not really needed, so it's not really worth 
adding.

Thanks for your help and sorry for the annoyance,
Roman


-----Ursprüngliche Nachricht-----
Von: Wes McKinney <wesmck...@gmail.com> 
Gesendet: Dienstag, 29. Oktober 2019 16:19
An: dev <dev@arrow.apache.org>
Betreff: Re: State of decimal support in Arrow (from/to Parquet Decimal 
Logicaltype)

It depends on the origin of your data.

If your data is not originating from Arrow, then it may be better to produce an 
array of FixedLenByteArray and pass that to the low level WriteBatch API. If 
you would like some other API, please feel free to propose something.

On Tue, Oct 29, 2019 at 10:13 AM <roman.karlstet...@gmail.com> wrote:
>
> Hi Wes,
>
> that was a bit unclear, sorry for that. With "an array", I'm referring to a 
> plain c++-type array, i.e. an array of float, uint32_t, ...
> This means that I do not use the arrow::Array-based write API, but I use the 
> TypedColumnWriter::WriteBatch() function directly and do not have any arrow 
> arrays. Are there any advantages of not using the writebatch directly and 
> instead using arrow::Arrays?
>
> Thanks,
> Roman
>
> -----Ursprüngliche Nachricht-----
> Von: Wes McKinney <wesmck...@gmail.com>
> Gesendet: Dienstag, 29. Oktober 2019 15:59
> An: dev <dev@arrow.apache.org>
> Betreff: Re: State of decimal support in Arrow (from/to Parquet 
> Decimal Logicaltype)
>
> On Tue, Oct 29, 2019 at 3:11 AM <roman.karlstet...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > thanks for the response. There's one thing that is still a little unclear 
> > to me:
> > I had a look at the code for function WriteArrowSerialize<FLBAType, 
> > arrow::Decimal128Type> in the reference you provided. I don't have arrow 
> > data in the first place, but as I understand it, I need to have an array of 
> > FixedLenByteArrays objects which then point to the actual decimal values in 
> > the big_endian_values buffer. Is this the only way to write decimal types 
> > or is it also possible to directly provide an array with values to 
> > writeBatch()?
> >
>
> Could you clarify what you mean by "an array"? If you use the 
> arrow::Array-based write API then it will invoke this serializer 
> specialization
>
> https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca
> 585ae1c/cpp/src/parquet/column_writer.cc#L1569
>
> That's what we're calling (if I'm not mistaken, since I just worked on 
> this code recently) when writing arrow::Decimal128Array. If you set a 
> breakpoint with gdb there you can see the call stack
>
> > For the issues, I also found 
> > https://issues.apache.org/jira/browse/ARROW-6990, but I'm not sure if this 
> > is also related to the issues you created.
> >
> > Thanks,
> > Roman
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Wes McKinney <wesmck...@gmail.com>
> > Gesendet: Montag, 28. Oktober 2019 21:11
> > An: dev <dev@arrow.apache.org>
> > Betreff: Re: State of decimal support in Arrow (from/to Parquet 
> > Decimal Logicaltype)
> >
> > hi Roman,
> >
> > On Mon, Oct 28, 2019 at 5:56 AM <roman.karlstet...@gmail.com> wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I have a question about the state of decimal support in Arrow when 
> > > reading from/writing to Parquet.
> > >
> > > *       Is writing decimals to parquet supposed to work? Are there any
> > > examples on how to do this in C++?
> >
> > Yes, it's supported, the details are here
> >
> > https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497
> > ca
> > 585ae1c/cpp/src/parquet/column_writer.cc#L1511
> >
> > > *       When reading decimals in a parquet file with pyarrow and 
> > > converting
> > > the resulting table to a pandas dataframe, datatype in the cells 
> > > is "object". As a consequence, performance when doing analysis on 
> > > this table is suboptimal. Can I somehow directly get the decimals 
> > > from the parquet file into floats/doubles in a pandas dataframe?
> >
> > Some work will be required. The cleanest way would be to cast
> > decimal128 columns to float32/float64 prior to converting to pandas.
> >
> > I didn't see an issue for this right away so I opened
> >
> > https://issues.apache.org/jira/browse/ARROW-7010
> >
> > I also opened
> >
> > https://issues.apache.org/jira/browse/ARROW-7011
> >
> > about going the other way. This would be a useful thing to contribute to 
> > the project.
> >
> > Thanks
> > Wes
> >
> > >
> > >
> > > Thanks in advance,
> > >
> > > Roman
> > >
> > >
> > >
> >
>

AW: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

Reply via email to