[ 
https://issues.apache.org/jira/browse/ARROW-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209706#comment-16209706
 ] 

Phillip Cloud commented on ARROW-1588:
--------------------------------------

Here's the state of the in memory decimal format on the C++ side of things.

Right now, arrow's in memory decimal format is actually a mixture of both big 
and little endian (when the platform is little endian, when the platform is big 
endian it's just big endian).

The {{Decimal128}} class that holds the upper and lower 64 bits of a 2's 
complement 128 bit integer is laid out like this (ignoring methods)

{code}
class Decimal128 {
    int64_t upper;
    uint64_t lower;
};
{code}

When we convert this to a C array of bytes (with the {{ToBytes()}}) method we 
do the following (paraphrasing):

{code}
const uint64_t big_bytes[] = {upper, lower};
const uint8_t* raw_bytes = reinterpret_cast<const uint8_t*>(big_bytes);
{code}

What this means is that groups of 8 bytes are in big-endian order, and within 
each group of 8 bytes they are platform native.

This was an oversight on my part when writing {{ToBytes()}}, that I will 
rectify. My apologies for not mentioning this earlier.

The good news is that:

# This won't require much effort to fix.
# The work I've already done to make parquet read/write work will only require 
a one or two line change to make this work regardless of whether we choose big 
or little endian.

We still need to make a choice about which byte order to use.

Something to note:

When operating on arrow {{DecimalArray}} bytes, we have to convert each group 
of 16 bytes to {{Decimal128}} before doing any operations like addition, 
multiplication, etc. To keep performance snappy we need to spend as little time 
as possible converting bytes to {{int64_t}}/{{uint64_t}}. If bytes are passed 
to {{Decimal128}} as little endian then we elide a byteswap operation and 
simply convert each 8 byte chunk to the respective types of the upper and lower 
bits. If they're passed as big endian (as in the case of parquet) then we have 
to do some work to convert them to little endian. For this reason, I think we 
should choose little endian byte order. Of course, systems that are big endian 
take a small hit reading arrow decimal arrays.

Is there someone from the Java side that can clarify what the current in-memory 
layout for the Java equivalent of C++'s {{DecimalArray}}? Are bytes in little 
endian order?

cc [~wesmckinn] [~jnadeau]

> [C++/Format] Harden Decimal Format
> ----------------------------------
>
>                 Key: ARROW-1588
>                 URL: https://issues.apache.org/jira/browse/ARROW-1588
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Format
>    Affects Versions: 0.7.0
>            Reporter: Phillip Cloud
>            Assignee: Phillip Cloud
>             Fix For: 0.8.0
>
>
> We should finalize and harden the decimal format. The remaining issues are 
> officially writing down the choice of making every decimal value 16 bytes and 
> byte order.
> For byte order we'll need to run some benchmarks to compare little endian vs 
> big endian. I plan to work on this over the next week or two.
> [~jacq...@dremio.com] [~wesmckinn] If there are any additional items you'd 
> like to see addressed here please chime in. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to