[jira] [Created] (ARROW-6402) [C++] Suppress sign-compare warning with g++ 9.2.1

2019-08-30 Thread Sutou Kouhei (Jira)
Sutou Kouhei created ARROW-6402:
---

 Summary: [C++] Suppress sign-compare warning with g++ 9.2.1
 Key: ARROW-6402
 URL: https://issues.apache.org/jira/browse/ARROW-6402
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei


{noformat}
../src/arrow/array/builder_union.cc: In constructor 
'arrow::BasicUnionBuilder::BasicUnionBuilder(arrow::MemoryPool*, 
arrow::UnionMode::type, const std::vector 
>&, const std::shared_ptr&)':
../src/arrow/util/logging.h:86:55: error: comparison of integer expressions 
of different signedness: 'std::vector 
>::size_type' {aka 'long unsigned int'} and 'signed char' [-Werror=sign-compare]
   86 | #define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2))
  |~~~^~~~
../src/arrow/util/macros.h:43:52: note: in definition of macro 
'ARROW_PREDICT_TRUE'
   43 | #define ARROW_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1))
  |^
../src/arrow/util/logging.h:86:36: note: in expansion of macro 'ARROW_CHECK'
   86 | #define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2))
  |^~~
../src/arrow/util/logging.h:135:19: note: in expansion of macro 
'ARROW_CHECK_LT'
  135 | #define DCHECK_LT ARROW_CHECK_LT
  |   ^~
../src/arrow/array/builder_union.cc:79:3: note: in expansion of macro 
'DCHECK_LT'
   79 |   DCHECK_LT(type_id_to_children_.size(), 
std::numeric_limits::max());
  |   ^
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6401) [Java] Implement dictionary-encoded subfields for Struct type

2019-08-30 Thread Ji Liu (Jira)
Ji Liu created ARROW-6401:
-

 Summary: [Java] Implement dictionary-encoded subfields for Struct 
type
 Key: ARROW-6401
 URL: https://issues.apache.org/jira/browse/ARROW-6401
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Implement dictionary-encoded subfields for Struct type.

Each child vector will have a dictionary, the dictionary vector is struct type 
and holds all dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [ANNOUNCE] New Arrow committer: David M Li

2019-08-30 Thread Micah Kornfield
Congrats David, well desrved.

On Fri, Aug 30, 2019 at 2:02 PM Bryan Cutler  wrote:

> Congrats David!
>
> On Fri, Aug 30, 2019 at 10:19 AM Antoine Pitrou 
> wrote:
>
> >
> > Congratulations David and welcome to the team  :-)
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 30/08/2019 à 18:21, Wes McKinney a écrit :
> > > On behalf of the Arrow PMC I'm happy to announce that David has
> > > accepted an invitation to become an Arrow committer!
> > >
> > > Welcome, and thank you for your contributions!
> > >
> >
>


Re: Trouble building on Mac OS Mojave

2019-08-30 Thread Chris Teoh
That being said, is there an easier way by using a Docker container I could
use to build this in?

On Sat, 31 Aug 2019 at 12:44, Chris Teoh  wrote:

> Hey there,
>
> Brand new to Arrow here.
>
> Trying to build it following the instructions and I get errors with the
> ORC module building cpp
>
> In file included from
> /Users/test/GitHub/arrow/cpp/build/orc_ep-prefix/src/orc_ep/c++/src/wrap/orc-proto-wrapper.cc:44:
>
> /Users/test/GitHub/arrow/cpp/build/orc_ep-prefix/src/orc_ep-build/c++/src/orc_proto.pb.cc:960:145:
> error: possible misuse of comma operator here [-Werror,-Wcomma]
> static bool dynamic_init_dummy_orc_5fproto_2eproto = (
>  
> ::PROTOBUF_NAMESPACE_ID::internal::AddDescriptors(_table_orc_5fproto_2eproto),
> true);
>
>   ^
>
> /Users/test/GitHub/arrow/cpp/build/orc_ep-prefix/src/orc_ep-build/c++/src/orc_proto.pb.cc:960:57:
> note: cast expression to void to silence warning
> static bool dynamic_init_dummy_orc_5fproto_2eproto = (
>  
> ::PROTOBUF_NAMESPACE_ID::internal::AddDescriptors(_table_orc_5fproto_2eproto),
> true);
>
> I can disable the ORC module and that part builds fine, build command is:-
>
> pushd arrow/cpp/build
>
>
>
> cmake -DPYTHON_EXECUTABLE=$VIRTUAL_ENV/bin/python
> -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>
>   -DCMAKE_INSTALL_LIBDIR=lib \
>
>   -DARROW_FLIGHT=ON \
>
>   -DARROW_GANDIVA=ON \
>
>   -DARROW_ORC=OFF \
>
>   -DARROW_PARQUET=ON \
>
>   -DARROW_PYTHON=ON \
>
>   -DARROW_PLASMA=ON \
>
>   -DARROW_BUILD_TESTS=ON \
>
>   ..
>
> make -j4
>
> make install
>
> popd
>
> then I try to build the python module:-
>
> pushd arrow/python
> export PYARROW_WITH_FLIGHT=1
> export PYARROW_WITH_GANDIVA=1
> export PYARROW_WITH_ORC=0
> export PYARROW_WITH_PARQUET=1
> python setup.py build_ext --inplace
> popd
>
> and get:-
> running build_ext
> creating build
> creating build/temp.macosx-10.14-intel-2.7
> -- Running cmake for pyarrow
> cmake -DPYTHON_EXECUTABLE=/Users/test/GitHub/pyarrow/bin/python
>  -DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_PARQUET=on
> -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_BUILD_GANDIVA=on
> -DCMAKE_BUILD_TYPE=release /Users/test/GitHub/arrow/python
> -- The C compiler identification is AppleClang 10.0.1.10010046
> -- The CXX compiler identification is AppleClang 10.0.1.10010046
> -- Check for working C compiler:
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
> -- Check for working C compiler:
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
> -- works
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Check for working CXX compiler:
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
> -- Check for working CXX compiler:
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
> -- works
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Compiler command: env LANG=C
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
> -v
> -- Compiler version: Apple LLVM version 10.0.1 (clang-1001.0.46.4)
> Target: x86_64-apple-darwin18.6.0
> Thread model: posix
> InstalledDir:
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
>
> -- Compiler id: Clang
> Selected compiler clang 4.1.0svn
> -- Performing Test CXX_SUPPORTS_SSE4_2
> -- Performing Test CXX_SUPPORTS_SSE4_2 - Success
> -- Performing Test CXX_SUPPORTS_ALTIVEC
> -- Performing Test CXX_SUPPORTS_ALTIVEC - Success
> -- Performing Test CXX_SUPPORTS_ARMCRC
> -- Performing Test CXX_SUPPORTS_ARMCRC - Failed
> -- Performing Test CXX_SUPPORTS_ARMV8_CRC_CRYPTO
> -- Performing Test CXX_SUPPORTS_ARMV8_CRC_CRYPTO - Failed
> -- Arrow build warning level: PRODUCTION
> Configured for RELEASE build (set with cmake
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: RELEASE
> -- Build output directory:
> /Users/test/GitHub/arrow/python/build/temp.macosx-10.14-intel-2.7/release
> -- Found PythonInterp: /Users/test/GitHub/pyarrow/bin/python (found
> version "2.7.10")
> -- Found PythonLibs:
> /System/Library/Frameworks/Python.framework/Versions/2.7/lib/libpython2.7.dylib
> -- Found NumPy: version "1.16.5"
> /Users/test/GitHub/pyarrow/lib/python2.7/site-packages/numpy/core/include
> -- Found PkgConfig: /usr/local/bin/pkg-config (found version "0.29.2")
> -- Found the Arrow core library: /Users/test/GitHub/dist/lib/libarrow.dylib
> -- Found the Arrow Python library:
> /Users/test/GitHub/dist/lib/libarrow_python.dylib
> -- Added shared library dependency arrow_shared:
> /Users/test/GitHub/dist/lib/libarrow.dylib
> -- Added shared library dependency arrow_python_shared:
> 

Trouble building on Mac OS Mojave

2019-08-30 Thread Chris Teoh
Hey there,

Brand new to Arrow here.

Trying to build it following the instructions and I get errors with the ORC
module building cpp

In file included from
/Users/test/GitHub/arrow/cpp/build/orc_ep-prefix/src/orc_ep/c++/src/wrap/orc-proto-wrapper.cc:44:
/Users/test/GitHub/arrow/cpp/build/orc_ep-prefix/src/orc_ep-build/c++/src/orc_proto.pb.cc:960:145:
error: possible misuse of comma operator here [-Werror,-Wcomma]
static bool dynamic_init_dummy_orc_5fproto_2eproto = (
 
::PROTOBUF_NAMESPACE_ID::internal::AddDescriptors(_table_orc_5fproto_2eproto),
true);

^
/Users/test/GitHub/arrow/cpp/build/orc_ep-prefix/src/orc_ep-build/c++/src/orc_proto.pb.cc:960:57:
note: cast expression to void to silence warning
static bool dynamic_init_dummy_orc_5fproto_2eproto = (
 
::PROTOBUF_NAMESPACE_ID::internal::AddDescriptors(_table_orc_5fproto_2eproto),
true);

I can disable the ORC module and that part builds fine, build command is:-

pushd arrow/cpp/build



cmake -DPYTHON_EXECUTABLE=$VIRTUAL_ENV/bin/python
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \

  -DCMAKE_INSTALL_LIBDIR=lib \

  -DARROW_FLIGHT=ON \

  -DARROW_GANDIVA=ON \

  -DARROW_ORC=OFF \

  -DARROW_PARQUET=ON \

  -DARROW_PYTHON=ON \

  -DARROW_PLASMA=ON \

  -DARROW_BUILD_TESTS=ON \

  ..

make -j4

make install

popd

then I try to build the python module:-

pushd arrow/python
export PYARROW_WITH_FLIGHT=1
export PYARROW_WITH_GANDIVA=1
export PYARROW_WITH_ORC=0
export PYARROW_WITH_PARQUET=1
python setup.py build_ext --inplace
popd

and get:-
running build_ext
creating build
creating build/temp.macosx-10.14-intel-2.7
-- Running cmake for pyarrow
cmake -DPYTHON_EXECUTABLE=/Users/test/GitHub/pyarrow/bin/python
 -DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_PARQUET=on
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_BUILD_GANDIVA=on
-DCMAKE_BUILD_TYPE=release /Users/test/GitHub/arrow/python
-- The C compiler identification is AppleClang 10.0.1.10010046
-- The CXX compiler identification is AppleClang 10.0.1.10010046
-- Check for working C compiler:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Compiler command: env LANG=C
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-v
-- Compiler version: Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.6.0
Thread model: posix
InstalledDir:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

-- Compiler id: Clang
Selected compiler clang 4.1.0svn
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Success
-- Performing Test CXX_SUPPORTS_ALTIVEC
-- Performing Test CXX_SUPPORTS_ALTIVEC - Success
-- Performing Test CXX_SUPPORTS_ARMCRC
-- Performing Test CXX_SUPPORTS_ARMCRC - Failed
-- Performing Test CXX_SUPPORTS_ARMV8_CRC_CRYPTO
-- Performing Test CXX_SUPPORTS_ARMV8_CRC_CRYPTO - Failed
-- Arrow build warning level: PRODUCTION
Configured for RELEASE build (set with cmake
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Build output directory:
/Users/test/GitHub/arrow/python/build/temp.macosx-10.14-intel-2.7/release
-- Found PythonInterp: /Users/test/GitHub/pyarrow/bin/python (found version
"2.7.10")
-- Found PythonLibs:
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/libpython2.7.dylib
-- Found NumPy: version "1.16.5"
/Users/test/GitHub/pyarrow/lib/python2.7/site-packages/numpy/core/include
-- Found PkgConfig: /usr/local/bin/pkg-config (found version "0.29.2")
-- Found the Arrow core library: /Users/test/GitHub/dist/lib/libarrow.dylib
-- Found the Arrow Python library:
/Users/test/GitHub/dist/lib/libarrow_python.dylib
-- Added shared library dependency arrow_shared:
/Users/test/GitHub/dist/lib/libarrow.dylib
-- Added shared library dependency arrow_python_shared:
/Users/test/GitHub/dist/lib/libarrow_python.dylib
-- Checking for module 'parquet'
--   No package 'parquet' found
--  Could not find the parquet library. Looked in  system search paths.
CMake Error at CMakeLists.txt:417 (message):
  Unable to locate Parquet libraries


-- Configuring incomplete, errors occurred!

My "dist" folder is as follows:-
dist
dist/bin
dist/bin/plasma-store-server
dist/include

[jira] [Created] (ARROW-6400) Arrow Java Library Build Error

2019-08-30 Thread Tanveer (Jira)
Tanveer created ARROW-6400:
--

 Summary: Arrow Java Library Build Error
 Key: ARROW-6400
 URL: https://issues.apache.org/jira/browse/ARROW-6400
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 0.14.1
Reporter: Tanveer
 Attachments: Screenshot from 2019-08-30 23-16-25.png, Screenshot from 
2019-08-30 23-44-34.png

Arrow Java Library is not being built with both 'master' and 'maint-0.14.x ' 
branches.

Please see the attachments.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [ANNOUNCE] New Arrow committer: David M Li

2019-08-30 Thread Bryan Cutler
Congrats David!

On Fri, Aug 30, 2019 at 10:19 AM Antoine Pitrou  wrote:

>
> Congratulations David and welcome to the team  :-)
>
> Regards
>
> Antoine.
>
>
> Le 30/08/2019 à 18:21, Wes McKinney a écrit :
> > On behalf of the Arrow PMC I'm happy to announce that David has
> > accepted an invitation to become an Arrow committer!
> >
> > Welcome, and thank you for your contributions!
> >
>


[jira] [Created] (ARROW-6399) [C++] More extensive attributes usage could improve debugging

2019-08-30 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6399:


 Summary: [C++] More extensive attributes usage could improve 
debugging
 Key: ARROW-6399
 URL: https://issues.apache.org/jira/browse/ARROW-6399
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Benjamin Kietzman


Wrapping  raw or smart pointer parameters and other declarations with 
{{gsl::not_null}} will assert they are not null. The check is dropped for 
release builds.

Status is tagged with ARROW_MUST_USE_RESULT which emits warnings when a Status 
might be ignored if compiling with clang, but Result<> should probably be 
tagged with this too



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [ANNOUNCE] New Arrow committer: David M Li

2019-08-30 Thread Antoine Pitrou


Congratulations David and welcome to the team  :-)

Regards

Antoine.


Le 30/08/2019 à 18:21, Wes McKinney a écrit :
> On behalf of the Arrow PMC I'm happy to announce that David has
> accepted an invitation to become an Arrow committer!
> 
> Welcome, and thank you for your contributions!
> 


[ANNOUNCE] New Arrow committer: David M Li

2019-08-30 Thread Wes McKinney
On behalf of the Arrow PMC I'm happy to announce that David has
accepted an invitation to become an Arrow committer!

Welcome, and thank you for your contributions!


[jira] [Created] (ARROW-6398) [C++] consolidate ScanOptions and ScanContext

2019-08-30 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6398:


 Summary: [C++] consolidate ScanOptions and ScanContext
 Key: ARROW-6398
 URL: https://issues.apache.org/jira/browse/ARROW-6398
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


Currently ScanOptions has two distinct responsibilities: it contains the data 
selector (and eventually projection schema) for the current scan and it serves 
as the base class for format specific scan options.

In addition, we have ScanContext which holds the memory pool for the current 
scan.

I think these classes should be rearranged as follows: ScanOptions will be 
removed and FileScanOptions will be the abstract base class for format specific 
scan options. ScanContext will be a concrete struct and contain the data 
selector, projection schema, a vector of FileScanOptions, and any other shared 
scan state.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6397) [C++][CI] Fix S3 minio failure

2019-08-30 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6397:
-

 Summary: [C++][CI] Fix S3 minio failure
 Key: ARROW-6397
 URL: https://issues.apache.org/jira/browse/ARROW-6397
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Continuous Integration
Reporter: Francois Saint-Jacques


See 
[https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [DISCUSS] Ternary logic

2019-08-30 Thread Francois Saint-Jacques
I created the ticket https://issues.apache.org/jira/browse/ARROW-6396,
I think we can offer both.

François


On Thu, Aug 29, 2019 at 5:10 PM Ben Kietzman  wrote:
>
> Indeed it's not about sanitizing nulls; it's about how nulls should
> interact with boolean (and other) expressions.
>
> For purposes of discussion, I'm naming the current approach of propagating
> null "NaN logic" (since all expressions involving NaN evaluate to NaN).
>
> To give some context for this discussion, I'm currently working on support
> for filter expressions (ARROW-6243).
>
> As an example of when this would come into play, let there be a dataset
> spanning several files. The older files have an IPV4 column while the newer
> files have IPV6 as well.
> With NaN logic the expression (IPV4=="127.0.0.1" or IPV6=="::1") yields
> null for all of the older files since they lack an IPV6 column (regardless
> of their IPV4 column) which
> seems undesirable.
>
> Could you explain what you mean by "safest"?
> Under NaN logic, the Kleene result can be recovered with
> (coalesce(IPV4=="127.0.0.1", false) or coalesce(IPV6=="::1", false))
> Under Kleene logic, the NaN result can be recovered with (case IPV4 is null
> or IPV6 is null when 1 then null else IPV4=="127.0.0.1" or IPV6=="::1" end)
> I don't think we're losing information either way.
>
> I'm not attached to either system but I'd like to understand and document
> the rationale behind our choice.
>
> On Thu, Aug 29, 2019 at 1:14 PM Antoine Pitrou  wrote:
>
> >
> > IIUC it's not about sanitizing to false.  Ben explained it in more
> > detail in private to me, perhaps he want to copy that explanation here ;-)
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 29/08/2019 à 19:05, Wes McKinney a écrit :
> > > hi Ben,
> > >
> > > My instinct is that always propagating null (at least by default) is
> > > the safest choice. Applications can choose to sanitize null to false
> > > if that's what they want semantically.
> > >
> > > - Wes
> > >
> > > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman 
> > wrote:
> > >>
> > >> To my knowledge, there isn't explicit documentation on how null slots
> > in an
> > >> array should be interpreted. SQL uses Kleene logic, wherein a null is
> > >> explicitly an unknown rather than a special value. This yields for
> > example
> > >> `(null AND false) -> false`, since `(x AND false) -> false` for all
> > >> possible values of x. This is also the behavior of Gandiva's boolean
> > >> expressions.
> > >>
> > >> By contrast the boolean kernels implement something closer to the
> > behavior
> > >> of NaN: `(null AND false) -> null`. I think this is simply an error in
> > the
> > >> boolean kernels but in any case I think explicit documentation should be
> > >> added to prevent future confusion.
> > >>
> > >> https://issues.apache.org/jira/browse/ARROW-6386
> >


[jira] [Created] (ARROW-6396) [C++] Add CompareOptions to Compare kernels

2019-08-30 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6396:
-

 Summary: [C++] Add CompareOptions to Compare kernels
 Key: ARROW-6396
 URL: https://issues.apache.org/jira/browse/ARROW-6396
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Francois Saint-Jacques


This would add an enum ResolveNull \{ KLEENE_LOGIC, NULL_PROPAGATE }.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6395) [pyarrow] Bug when using bool arrays with stride greater than 1

2019-08-30 Thread Philip Felton (Jira)
Philip Felton created ARROW-6395:


 Summary: [pyarrow] Bug when using bool arrays with stride greater 
than 1
 Key: ARROW-6395
 URL: https://issues.apache.org/jira/browse/ARROW-6395
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: Philip Felton


Here's code to reproduce it:

{code:python}
>>> import numpy as np
>>> import pyarrow as pa
>>> pa.__version__
'0.14.0'
>>> xs = np.array([True, False, False, True, True, False, True, True, True, 
>>> False, False, False, False, False, True, False, True, True, True, True, 
>>> True])
>>> xs_sliced = xs[0::2]
>>> xs_sliced
array([ True, False, True, True, True, False, False, True, True,
 True, True])
>>> pa_xs = pa.array(xs_sliced, pa.bool_())
>>> pa_xs

[
 true,
 false,
 false,
 false,
 false,
 false,
 false,
 false,
 false,
 false,
 false
]{code}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6394) [Java] Support conversions between delta vector and partial sum vector

2019-08-30 Thread Liya Fan (Jira)
Liya Fan created ARROW-6394:
---

 Summary: [Java] Support conversions between delta vector and 
partial sum vector
 Key: ARROW-6394
 URL: https://issues.apache.org/jira/browse/ARROW-6394
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


What is a delta vector/partial sum vector?

Given an integer vector a with length n, its partial sum vector is another 
integer vector b with length n + 1, with values defined as:

b(0) = initial sum
b(i) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n

Given an integer vector with length n + 1, its delta vector is another integer 
vector b with length n, with values defined as:

b(i) = a(i) - a(i - 1), i = 0, 1, ... , n -1

In this issue, we provide utilities to convert between vector and partial sum 
vector. It is interesting to note that the two operations corresponding to the 
discrete integration and differentian.

These conversions have wide applications. For example,

1. The run-length vector proposed by Micah is based on the partial sum vector, 
while the deduplication functionality is based on delta vector. This issue 
provides conversions between them.

2. The current VarCharVector/VarBinaryVector implementations are based on 
partial sum vector. We can transform them to delta vectors before IPC, to 
reduce network traffic.

3. Converting to delta can be considered as a way for data compression. To 
further reduce the data volume, the operation can be applied more than once, to 
further reduce data volume.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)