[jira] [Commented] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit

2017-08-02 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110990#comment-16110990
 ] 

Phillip Cloud commented on ARROW-786:
-

I'm going to try to get to this 

> [Format] In-memory format for 128-bit Decimals, handling of sign bit
> 
>
> Key: ARROW-786
> URL: https://issues.apache.org/jira/browse/ARROW-786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.7.0
>
>
> cc [~cpcloud]
> We found in ARROW-655 that we needed to add an extra bit for signedness for 
> decimals stored as 128-bit values to be able to use the Boost multiprecision 
> libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed 
> size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java 
> implementation? We will need to document the memory layout for decimals that 
> maximizes compatibility across languages and eventually implement integration 
> tests for IPC. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit

2017-08-01 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110206#comment-16110206
 ] 

Wes McKinney commented on ARROW-786:


It doesn't look like this will be resolved in 0.6.0; moving to the next release

> [Format] In-memory format for 128-bit Decimals, handling of sign bit
> 
>
> Key: ARROW-786
> URL: https://issues.apache.org/jira/browse/ARROW-786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.7.0
>
>
> cc [~cpcloud]
> We found in ARROW-655 that we needed to add an extra bit for signedness for 
> decimals stored as 128-bit values to be able to use the Boost multiprecision 
> libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed 
> size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java 
> implementation? We will need to document the memory layout for decimals that 
> maximizes compatibility across languages and eventually implement integration 
> tests for IPC. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit

2017-07-26 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102037#comment-16102037
 ] 

Wes McKinney commented on ARROW-786:


It seems like we might end up having to build a from-scratch implementation of 
Java's BigDecimal in C++. It might be worth it, but it's also a lot of work. 
The JDK source code is not ASF-friendly so we would have to start from scratch 
from a mathematical resource

> [Format] In-memory format for 128-bit Decimals, handling of sign bit
> 
>
> Key: ARROW-786
> URL: https://issues.apache.org/jira/browse/ARROW-786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.6.0
>
>
> cc [~cpcloud]
> We found in ARROW-655 that we needed to add an extra bit for signedness for 
> decimals stored as 128-bit values to be able to use the Boost multiprecision 
> libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed 
> size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java 
> implementation? We will need to document the memory layout for decimals that 
> maximizes compatibility across languages and eventually implement integration 
> tests for IPC. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit

2017-07-26 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101976#comment-16101976
 ] 

Phillip Cloud commented on ARROW-786:
-

That should be possible.

Though because boost is using 128 bits + a sign bit, going from arrow-cpp to 
Java won't be possible in every case since the boost representation's values 
range from {{+/-0..2 ** 128 - 1}}, whereas the Java implementation's values 
range from {{-2 ** 127..2 ** 127 - 1}}.

The more I think about this and reread the boost multiprecision docs, I think 
we should just implement our own very small wrapper around native types.

Boost multiprecision has some optimizations that arrow doesn't care about like 
this that increase implementation complexity at best and hurt performance at 
worst:

{code}
When used at fixed precision, the size of this type is always one machine word 
larger than you would expect for an N-bit integer: the extra word stores both 
the sign, and how many machine words in the integer are actually in use.
{code}

plus the complexities of have two signed integer representations are enough to 
make me want to try jettisoning boost multiprecision.

> [Format] In-memory format for 128-bit Decimals, handling of sign bit
> 
>
> Key: ARROW-786
> URL: https://issues.apache.org/jira/browse/ARROW-786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.6.0
>
>
> cc [~cpcloud]
> We found in ARROW-655 that we needed to add an extra bit for signedness for 
> decimals stored as 128-bit values to be able to use the Boost multiprecision 
> libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed 
> size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java 
> implementation? We will need to document the memory layout for decimals that 
> maximizes compatibility across languages and eventually implement integration 
> tests for IPC. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit

2017-07-26 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101901#comment-16101901
 ] 

Wes McKinney commented on ARROW-786:


[~cpcloud] is it possible to do bit twiddling to convert between the 16-byte 
Java/Parquet-compatible representation and the Boost::Multiprecision 
representation? 

> [Format] In-memory format for 128-bit Decimals, handling of sign bit
> 
>
> Key: ARROW-786
> URL: https://issues.apache.org/jira/browse/ARROW-786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.6.0
>
>
> cc [~cpcloud]
> We found in ARROW-655 that we needed to add an extra bit for signedness for 
> decimals stored as 128-bit values to be able to use the Boost multiprecision 
> libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed 
> size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java 
> implementation? We will need to document the memory layout for decimals that 
> maximizes compatibility across languages and eventually implement integration 
> tests for IPC. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit

2017-07-26 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101887#comment-16101887
 ] 

Phillip Cloud commented on ARROW-786:
-

[~jnadeau] {{__int128_t}} doesn't work on Windows (with vc++) and when I 
originally wrote the decimal code for arrow-cpp, it was buggy with clang. The 
symbols required to link in libc++ code necessary to use that type were not 
exported by clang. See here: https://bugs.llvm.org//show_bug.cgi?id=26156.

We ultimately went with the boost multiprecision representation (which is sign 
magnitude) because of desire to reuse existing libraries and cross platform 
capabilities out of the box.

One possible alternative (depending on whether clang issues have been resolved) 
is to write our own pared down of version of something like boost 
multiprecision that uses {{__int128_t}} on UNIXes and two {{int64_t}}s on 
Windows. It wouldn't need to have any operations at the moment, just the 
ability to print itself like a decimal number and convert decimal strings to 
the underlying type. Even those may be able to be functions and not methods on 
the class.

> [Format] In-memory format for 128-bit Decimals, handling of sign bit
> 
>
> Key: ARROW-786
> URL: https://issues.apache.org/jira/browse/ARROW-786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.6.0
>
>
> cc [~cpcloud]
> We found in ARROW-655 that we needed to add an extra bit for signedness for 
> decimals stored as 128-bit values to be able to use the Boost multiprecision 
> libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed 
> size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java 
> implementation? We will need to document the memory layout for decimals that 
> maximizes compatibility across languages and eventually implement integration 
> tests for IPC. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit

2017-07-26 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101842#comment-16101842
 ] 

Jacques Nadeau commented on ARROW-786:
--

Would also be good to confirm the memory representation of <16 x i128> vector 
using llvm on x86-64.

> [Format] In-memory format for 128-bit Decimals, handling of sign bit
> 
>
> Key: ARROW-786
> URL: https://issues.apache.org/jira/browse/ARROW-786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.6.0
>
>
> cc [~cpcloud]
> We found in ARROW-655 that we needed to add an extra bit for signedness for 
> decimals stored as 128-bit values to be able to use the Boost multiprecision 
> libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed 
> size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java 
> implementation? We will need to document the memory layout for decimals that 
> maximizes compatibility across languages and eventually implement integration 
> tests for IPC. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit

2017-07-26 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101836#comment-16101836
 ] 

Jacques Nadeau commented on ARROW-786:
--

The current format of the java implementation is an embedded sign bit. 
GCC/Clang/Intel support __int128 which I believe on x86-64 machines is 
represented with the sign bit embedded (?). I remember talking to [~nongli] 
about this years ago and (if I recall correctly), we chose the Parquet 
representation based on his experiments with GCC or Clang/LLVM. (Unfortunately, 
I'm unable to find the thread.)

The current Java implementation supports a 16-bit wide, sign-bit embedded 
twos-complement big-endian representation that is the same as the Parquet 
description here: 

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L81

> [Format] In-memory format for 128-bit Decimals, handling of sign bit
> 
>
> Key: ARROW-786
> URL: https://issues.apache.org/jira/browse/ARROW-786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.6.0
>
>
> cc [~cpcloud]
> We found in ARROW-655 that we needed to add an extra bit for signedness for 
> decimals stored as 128-bit values to be able to use the Boost multiprecision 
> libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed 
> size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java 
> implementation? We will need to document the memory layout for decimals that 
> maximizes compatibility across languages and eventually implement integration 
> tests for IPC. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)