[jira] [Created] (ARROW-7037) [C++ ] Compile error on the combination of protobuf >= 3.9 and clang

2019-10-30 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7037:
---

 Summary: [C++ ] Compile error on the combination of protobuf >= 
3.9 and clang
 Key: ARROW-7037
 URL: https://issues.apache.org/jira/browse/ARROW-7037
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


I encountered the following compile error on the combination of protobuf 3.10.0 
and clang (Xcode 11).

{noformat}
[13/26] Building CXX object 
c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o
FAILED: c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o
/Applications/Xcode_11.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
   -Ic++/include 
-I/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include
 
-I/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src
 -Ic++/src -isystem c++/libs/thirdparty/zlib_ep-install/include -isystem 
c++/libs/thirdparty/lz4_ep-install/include -Qunused-arguments 
-fcolor-diagnostics -ggdb -O0 -g -fPIC  -Wno-zero-as-null-pointer-constant 
-Wno-inconsistent-missing-destructor-override -Wno-error=undef -std=c++11 
-Weverything -Wno-c++98-compat -Wno-missing-prototypes 
-Wno-c++98-compat-pedantic -Wno-padded -Wno-covered-switch-default 
-Wno-missing-noreturn -Wno-unknown-pragmas 
-Wno-gnu-zero-variadic-macro-arguments -Wconversion -Wno-c++2a-compat -Werror 
-std=c++11 -Weverything -Wno-c++98-compat -Wno-missing-prototypes 
-Wno-c++98-compat-pedantic -Wno-padded -Wno-covered-switch-default 
-Wno-missing-noreturn -Wno-unknown-pragmas 
-Wno-gnu-zero-variadic-macro-arguments -Wconversion -Wno-c++2a-compat -Werror 
-O0 -g -MD -MT c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o -MF 
c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o.d -o 
c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o -c 
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/wrap/orc-proto-wrapper.cc
In file included from 
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/wrap/orc-proto-wrapper.cc:44:
c++/src/orc_proto.pb.cc:959:145: error: possible misuse of comma operator here 
[-Werror,-Wcomma]
static bool dynamic_init_dummy_orc_5fproto_2eproto = (  
::PROTOBUF_NAMESPACE_ID::internal::AddDescriptors(_table_orc_5fproto_2eproto),
 true);

^
c++/src/orc_proto.pb.cc:959:57: note: cast expression to void to silence warning
static bool dynamic_init_dummy_orc_5fproto_2eproto = (  
::PROTOBUF_NAMESPACE_ID::internal::AddDescriptors(_table_orc_5fproto_2eproto),
 true);

^~~~
static_cast(  
)
1 error generated.
{noformat}

This may be due to a bug of protobuf filed as 
https://github.com/protocolbuffers/protobuf/issues/6619.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


questions about Gandiva

2019-10-30 Thread Yibo Cai

Hi,

Arrow cpp integrates Gandiva to provide low level operations on arrow buffers. 
[1][2]
I have some questions, any help is appreciated:
- Arrow cpp already has a compute kernel[3], does it duplicate what Gandiva 
provides? I see a Jira talk about it.[4]
- Is Gandiva only for arrow cpp? What about other languages(go, rust, ...)?
- Gandiva leverages SIMD for vectorized operations[1], but I didn't see any 
related code. Am I missing something?

[1] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
[2] https://github.com/apache/arrow/tree/master/cpp/src/gandiva
[3] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
[4] https://issues.apache.org/jira/browse/ARROW-7017

Thanks,
Yibo


[jira] [Created] (ARROW-7036) [C++] Version up ORC to avoid compile errors

2019-10-30 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7036:
---

 Summary: [C++] Version up ORC to avoid compile errors
 Key: ARROW-7036
 URL: https://issues.apache.org/jira/browse/ARROW-7036
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


I encountered the compile errors due to {{-Wshadow-field}} like below:

{noformat}
[1/4] Building CXX object c++/src/CMakeFiles/orc.dir/Vector.cc.o
FAILED: c++/src/CMakeFiles/orc.dir/Vector.cc.o
/Applications/Xcode_11.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
   -Ic++/include 
-I/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include
 -I/Users/mrkn/src/github.com/apa
che/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src -Ic++/src -isystem 
c++/libs/thirdparty/zlib_ep-install/include -isystem 
c++/libs/thirdparty/lz4_ep-install/include -Qunused-arguments 
-fcolor-diagnostics -ggdb -O0 -g -fPIC  -Wno-z
ero-as-null-pointer-constant -Wno-inconsistent-missing-destructor-override 
-Wno-error=undef -std=c++11 -Weverything -Wno-c++98-compat 
-Wno-missing-prototypes -Wno-c++98-compat-pedantic -Wno-padded 
-Wno-covered-switch-default -Wno-missing-n
oreturn -Wno-unknown-pragmas -Wno-gnu-zero-variadic-macro-arguments 
-Wconversion -Werror -std=c++11 -Weverything -Wno-c++98-compat 
-Wno-missing-prototypes -Wno-c++98-compat-pedantic -Wno-padded 
-Wno-covered-switch-default -Wno-missing-nore
turn -Wno-unknown-pragmas -Wno-gnu-zero-variadic-macro-arguments -Wconversion 
-Werror -O0 -g -MD -MT c++/src/CMakeFiles/orc.dir/Vector.cc.o -MF 
c++/src/CMakeFiles/orc.dir/Vector.cc.o.d -o 
c++/src/CMakeFiles/orc.dir/Vector.cc.o -c /Users/mr
kn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:59:45:
 error: parameter 'capacity' shadows member inherited from type 
'ColumnVectorBatch' [-Werror,-Wshadow-field]
  LongVectorBatch::LongVectorBatch(uint64_t capacity, MemoryPool& pool
^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14:
 note: declared here
uint64_t capacity;
 ^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:87:49:
 error: parameter 'capacity' shadows member inherited from type 
'ColumnVectorBatch' [-Werror,-Wshadow-field]
  DoubleVectorBatch::DoubleVectorBatch(uint64_t capacity, MemoryPool& pool
^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14:
 note: declared here
uint64_t capacity;
 ^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:115:49:
 error: parameter 'capacity' shadows member inherited from type 
'ColumnVectorBatch' [-Werror,-Wshadow-field]
  StringVectorBatch::StringVectorBatch(uint64_t capacity, MemoryPool& pool
^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14:
 note: declared here
uint64_t capacity;
 ^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:407:55:
 error: parameter 'capacity' shadows member inherited from type 
'ColumnVectorBatch' [-Werror,-Wshadow-field]
  TimestampVectorBatch::TimestampVectorBatch(uint64_t capacity,
  ^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14:
 note: declared here
uint64_t capacity;
 ^
4 errors generated.
{noformat}

Upgrading ORC to 1.5.7 will fix this errors.

I used Xcode 11.1 on macOS Mojave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: some questions, please help

2019-10-30 Thread Yibo Cai

Thanks Wes, Micah, your comments are very helpful.

Yibo

On 10/30/19 10:45 PM, Wes McKinney wrote:

On Wed, Oct 30, 2019 at 9:32 AM Micah Kornfield  wrote:





- I see some SIMD optimizations in arrow go binding, such as vectored

sum. [2]

But arrow cpp lib doesn't leverage SIMD. [3]
Why not optimize it in cpp lib so all languages can benefit?

You're welcome to contribute such optimizations to the C++ library



Note that even though C++ doesn't use explicit SIMD intrinsics often times
the compiler will generate SIMD code because it can auto-vectorize the
code.


Note it will likely be important to have explicit dynamic/runtime SIMD
dispatching on certain hot paths as we build binaries that need to be
able to run on both newer and older CPUs


On Wed, Oct 30, 2019 at 7:25 AM Wes McKinney  wrote:


hi Yibo

On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai  wrote:


Hi,

I'm new to Arrow. Would like to seek for help about some questions. Any

comment is welcomed.


- About source code tree, my understand is that "cpp" is the core arrow

libraries, "c_glib, go, python, ..." are language bindings to ease
integrating arrow into apps developed by that language. Is that correct?

No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust

* C/GLib, MATLAB, Python, R bind to C++
* Ruby binds to GLib


- Arrow implements many data types and aggregation functions(sum, mean,

...). [1]

IMO, more functions and types should be supported, like min/max,

vector/tensor operations, big number, etc. I'm not sure if this is in
arrow's scope, or the apps using arrow should deal with it themselves.

Our objective at least in the C++ library is to have a generally
useful "standard library" that handles common application concerns.
Whether or not something is thought to be in scope may vary on a case
by case basis -- if you can't find a JIRA issue for something in
particular, please go ahead and open one.


- I see some SIMD optimizations in arrow go binding, such as vectored

sum. [2]

But arrow cpp lib doesn't leverage SIMD. [3]
Why not optimize it in cpp lib so all languages can benefit?


You're welcome to contribute such optimizations to the C++ library


- Wes


[1]

https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels

[2]

https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s

[3]

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111


Yibo




Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-30-0

2019-10-30 Thread Neal Richardson
https://issues.apache.org/jira/browse/ARROW-7034 (pending +1/merge)
will rid us of these meddlesome failures.

Neal

On Wed, Oct 30, 2019 at 11:25 AM Wes McKinney  wrote:
>
> The failed tasks here are a nuisance. If they can't be fixed, should
> they be removed from the nightlies?
>
> On Wed, Oct 30, 2019 at 7:26 AM Crossbow  wrote:
> >
> >
> > Arrow Build Report for Job nightly-2019-10-30-0
> >
> > All tasks: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0
> >
> > Failed Tasks:
> > - docker-clang-format:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-clang-format
> > - docker-r-sanitizer:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-r-sanitizer
> >
> > Succeeded Tasks:
> > - centos-6:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-6
> > - centos-7:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-7
> > - centos-8:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-8
> > - conda-linux-gcc-py27:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py27
> > - conda-linux-gcc-py36:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py36
> > - conda-linux-gcc-py37:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py37
> > - conda-osx-clang-py27:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py27
> > - conda-osx-clang-py36:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py36
> > - conda-osx-clang-py37:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py37
> > - conda-win-vs2015-py36:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-win-vs2015-py36
> > - conda-win-vs2015-py37:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-win-vs2015-py37
> > - debian-buster:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-debian-buster
> > - debian-stretch:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-debian-stretch
> > - docker-c_glib:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-c_glib
> > - docker-cpp-cmake32:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-cmake32
> > - docker-cpp-release:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-release
> > - docker-cpp-static-only:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-static-only
> > - docker-cpp:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp
> > - docker-dask-integration:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-dask-integration
> > - docker-docs:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-docs
> > - docker-go:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-go
> > - docker-hdfs-integration:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-hdfs-integration
> > - docker-iwyu:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-iwyu
> > - docker-java:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-java
> > - docker-js:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-js
> > - docker-lint:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-lint
> > - docker-pandas-master:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-pandas-master
> > - docker-python-2.7-nopandas:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-2.7-nopandas
> > - docker-python-2.7:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-2.7
> > - docker-python-3.6-nopandas:
> >   URL: 
> > 

[jira] [Created] (ARROW-7035) [R] Default arguments are unclear in write_parquet docs

2019-10-30 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-7035:
-

 Summary: [R] Default arguments are unclear in write_parquet docs
 Key: ARROW-7035
 URL: https://issues.apache.org/jira/browse/ARROW-7035
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 0.15.0
 Environment: Ubuntu with libparquet-dev 0.15.0-1, R 3.6.1, and arrow 
0.15.0.
Reporter: Karl Dunkle Werner
 Fix For: 0.15.1


Thank you so much for adding support for reading and writing parquet files in 
R! I have a few questions about the user interface and optional arguments, but 
I want to highlight how great it is to have this useful filetype to pass data 
back and forth.

The defaults for the optional arguments in {{arrow::write_parquet}} aren't 
always clear. Here were my questions after reading the help docs from 
{{write_parquet}}:
 * What's the default {{version}}? Should a user prefer "2.0" for new projects?
 * What are acceptable values for {{compression}}? (Answer: {{uncompressed}}, 
{{snappy}}, {{gzip}}, {{brotli}}, {{zstd}}, or {{lz4}}.)
 * What's the default for {{use_dictionary}}? Seems to be {{TRUE}}, at least 
some of the time.
 * What's the default for {{write_statistics}}? Should a user prefer {{TRUE}}?
 * Can I assume {{allow_truncated_timestamps}} is {{FALSE}} by default?

As someone who works in both R and Python, I was a little surprised when 
pyarrow uses snappy compression by default, but R's default is uncompressed. My 
preference would be having the same default arguments, but that might be a 
fringe use-case.

While I was digging into this, I was surprised that {{ParquetReaderProperties}} 
is exported and documented, but {{ParquetWriterProperties}} isn't. Is that 
intentional?

Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7034) [CI][Crossbow] Skip known nightly failures

2019-10-30 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7034:
--

 Summary: [CI][Crossbow] Skip known nightly failures
 Key: ARROW-7034
 URL: https://issues.apache.org/jira/browse/ARROW-7034
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Neal Richardson
Assignee: Neal Richardson


The failures are ticketed. There's no point in running them if we know they're 
failing. The patches that fix the builds can add them back to the nightly list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Result vs Status

2019-10-30 Thread Wes McKinney
Returning to this discussion.

Here is my position on the matter since this was brought up on the
sync call today

* For internal / non-public and pseudo-non-public APIs that have
return/out values
  - Use Result or Status at discretion of the developer, but Result
is preferable

* For new public APIs with return/out values
  - Prefer Result unless a Status-based API seems definitely less
awkward in real world use. I have to say that I'm skeptical about the
relative usability of std::tuple outputs and don't think we should
force the use of Result for technical purity reasons

* For existing Status APIs with return values
  - Incrementally add Result APIs and deprecate Status-based APIs.
Maintain deprecated Status APIs for ~2 major releases

On Thu, Oct 24, 2019 at 5:16 PM Omer F. Ozarslan  wrote:
>
> Hi Micah,
>
> You're right. Quite possible that clang-query counted same function
> separately for each include in each file. (I was iterating each file
> separately, but providing all of them at once didn't change the result
> either.)
>
> It's cool and wrong, so not very useful apparently. :-)
>
> Best,
> Omer
>
> On Thu, Oct 24, 2019 at 4:51 PM Micah Kornfield  wrote:
> >
> > Hi Omer,
> > I think this is really cool.  It is quite possible it was underestimated (I 
> > agree about line lengths), but I think the clang query is double counting 
> > somehow.
> >
> > For instance:
> >
> > "grep -r Status *" only returns ~9000 results in total for me.
> >
> > Similarly using grep for "FinishTyped" returns 18 results for me.  
> > Searching through the log that you linked seems to return 450 (for "Status 
> > FinishTyped").
> >
> > It is quite possible, I'm doing something naive with grep.
> >
> > Thanks,
> > Micah
> >
> > On Thu, Oct 24, 2019 at 2:41 PM Omer F. Ozarslan  wrote:
> >>
> >> Forgot to mention most of those lines are longer than line width while
> >> out is usually (always?) last parameter, so probably that's why grep
> >> possibly underestimates their number.
> >>
> >> On Thu, Oct 24, 2019 at 4:33 PM Omer F. Ozarslan  wrote:
> >> >
> >> > Hi,
> >> >
> >> > I don't have much experience on customized clang-tidy plugins, but
> >> > this might be a good use case for such a plugin from what I read here
> >> > and there (frankly this was a good excuse for me to have a look at
> >> > clang tooling as well). I wanted to ensure it isn't obviously overkill
> >> > before this suggestion: Running a clang query which lists functions
> >> > returning `arrow::Status` and taking a pointer parameter named `out`
> >> > showed that there are 13947 such functions in `cpp/src/**/*.h`. [1]
> >> >
> >> > I checked logs and it seemed legitimate to me, but please check it in
> >> > case I missed something. If that's the case, it might be tedious to do
> >> > this work manually.
> >> >
> >> > [1]: https://gist.github.com/ozars/ecbb1b8acd4a57ba4721c1965f83f342
> >> > (Note that the log file is shown as truncated by github after ~30k
> >> > lines)
> >> >
> >> > Best,
> >> > Omer
> >> >
> >> >
> >> >
> >> > On Wed, Oct 23, 2019 at 9:23 PM Micah Kornfield  
> >> > wrote:
> >> > >
> >> > > OK, it sounds like people want Result (at least in some 
> >> > > circumstances).
> >> > > Any thoughts on migrating old APIs and what to do for new APIs going
> >> > > forward?
> >> > >
> >> > > A very rough approximation [1] yields the following counts by module:
> >> > >
> >> > >  853 arrow
> >> > >
> >> > >   17 gandiva
> >> > >
> >> > >   25 parquet
> >> > >
> >> > >   50 plasma
> >> > >
> >> > >
> >> > >
> >> > > [1] grep -r Status cpp/src/* |grep ".h:" | grep "\\*" |grep -v Accept 
> >> > > |sed
> >> > > s/:.*// | cut -f3 -d/ |sort
> >> > >
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Micah
> >> > >
> >> > >
> >> > >
> >> > > On Sat, Oct 19, 2019 at 7:50 PM Francois Saint-Jacques <
> >> > > fsaintjacq...@gmail.com> wrote:
> >> > >
> >> > > > As mentioned, Result is an improvement for function which returns 
> >> > > > a
> >> > > > single value, e.g. Make/Factory-like. My vote goes Result for such
> >> > > > case. For multiple return types, we have std::tuple like Antoine
> >> > > > proposed.
> >> > > >
> >> > > > François
> >> > > >
> >> > > > On Fri, Oct 18, 2019 at 9:19 PM Antoine Pitrou  
> >> > > > wrote:
> >> > > > >
> >> > > > >
> >> > > > > Le 18/10/2019 à 20:58, Wes McKinney a écrit :
> >> > > > > > I'm definitely uncomfortable with the idea of deprecating Status.
> >> > > > > >
> >> > > > > > We have a few kinds of functions that can fail:
> >> > > > > >
> >> > > > > > 1. Functions with no "out" arguments
> >> > > > > > 2. Functions with one out argument
> >> > > > > > 3. Functions with multiple out arguments
> >> > > > > >
> >> > > > > > IMHO functions in category 2 are the best candidates for 
> >> > > > > > utilizing
> >> > > > > > Status. In some cases, Case 3 may be more usable Result-based, 
> >> > > > > > but it
> >> > > > > > can also create more work (or confusion) on the part of the 
> >> > > > > > developer,
> >> > > > 

Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding

2019-10-30 Thread Wes McKinney
I wrote in on the original DISCUSS thread. I believe Antoine is
unavailable this week, but hopefully we can drive the discussion to a
consensus point next week so we can vote

On Sat, Oct 26, 2019 at 12:01 AM Micah Kornfield  wrote:
>
> I think at least the wording was confusing because you raised questions on 
> the PR and Antoine commented here.
>
> I agree with your analysis that it probably would not be hard to support.  
> But don't feel too strongly either way on this particular point, aside from 
> coming to a resolution.   If I had to choose I'd prefer allowing Delta 
> dictionaries in files.
>
> On Friday, October 25, 2019, Wes McKinney  wrote:
>>
>> Can we discuss the delta dictionary issue a bit more? I admit I don't
>> share that same concerns.
>>
>> From the perspective of a file and stream producer, the code paths
>> should be nearly identical. The differences with the file format are:
>>
>> * Magic numbers to detect that it is the "file format"
>> * Accumulated metadata at the footer
>>
>> If a file has any dictionaries at all, then they must all be
>> reconstructed before reading a record batch. So let's say we have a
>> file like
>>
>> DICTIONARY ID=0, isDelta=FALSE
>> BATCH 0
>> BATCH 1
>> BATCH 2
>> DICTIONARY ID=0, isDelta=TRUE
>> BATCH 3
>> DICTIONARY ID=0, isDelta=TRUE
>> BATCH 4
>>
>> I do not see any harm in this -- the only downside is that you won't
>> know what "state" the dictionary was in for the first 3 batches.
>> Viewing dictionary encoding strictly as a data representation method,
>> the batches 0-2 and 3 represent the same data even if their in-memory
>> dictionaries are larger than they were than the moment in which they
>> were written
>>
>> Note that the code path for "processing" the dictionaries as a first
>> step will use the same code as the stream path. It should not be a
>> great deal of work to write test cases for this
>>
>> On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield  
>> wrote:
>> >
>> > Hi Antoine,
>> > There is a defined order for dictionaries in metadata.  What isn't well
>> > defined is relative ordering between record batches and Delta dictionaries.
>> >
>> >  However, this point seems confusing. I can't think of a real-world use
>> > case we're it would be valuable enough to include, so I will remove Delta
>> > dictionaries.
>> >
>> > So let's cancel this vote and I'll start a new one after the update.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Thursday, October 24, 2019, Antoine Pitrou  wrote:
>> >
>> > >
>> > > Le 24/10/2019 à 04:39, Micah Kornfield a écrit :
>> > > >
>> > > > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
>> > > > dictionary batch and multiple "delta" dictionary batches.
>> > >
>> > > This is a bit weird.  If the file format can carry delta dictionaries,
>> > > it means order is significant, so it may as well contain dictionary
>> > > redefinitions.
>> > >
>> > > If the file format is meant to be truly readable in random order, then
>> > > it should also forbid delta dictionaries.
>> > >
>> > > Regards
>> > >
>> > > Antoine.
>> > >


Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

2019-10-30 Thread Wes McKinney
Returning to this discussion as there seems to lack consensus in the vote thread

Copying Micah's proposals in the VOTE thread here, I wanted to state
my opinions so we can discuss further and see where there is potential
disagreement

1.  It is not required that all dictionary batches occur at the beginning
of the IPC stream format (if a the first record batch has an all null
dictionary encoded column, the null column's dictionary might not be sent
until later in the stream).

This seems preferable to requiring a placeholder empty dictionary
batch. This does mean more to test but the integration tests will
force the issue

2.  A second dictionary batch for the same ID that is not a "delta batch"
in an IPC stream indicates the dictionary should be replaced.

Agree.

3.  Clarifies that the file format, can only contain 1 "NON-delta"
dictionary batch and multiple "delta" dictionary batches.

Agree -- it is also worth stating explicitly that dictionary
replacements are not allowed in the file format.

In the file format, all the dictionaries must be "loaded" up front.
The code path for loading the dictionaries ideally should use nearly
the same code as the stream-reader code that sees follow-up dictionary
batches interspersed in the stream. The only downside is that it will
not be possible to exactly preserve the dictionary "state" as of each
record batch being written.

So if we had a file containing

DICTIONARY ID=0
RECORD BATCH
RECORD BATCH
DICTIONARY DELTA ID=0
RECORD BATCH
RECORD BATCH

Then after processing/loading the dictionaries, the first two record
batches will have a dictionary that is "larger" (on account of the
delta) than when they were written. Since dictionaries are
fundamentally about data representation, they still represent the same
data so I think this is acceptable.

4.  Add an enum to dictionary metadata for possible future changes in what
format dictionary batches can be sent. (the most likely would be an array
Map).  An enum is needed as a place holder to allow for forward
compatibility past the release 1.0.0.

I'm least sure about this but I do not think it is harmful to have a
forward-compatible "escape hatch" for future evolutions in dictionary
encoding.

On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield  wrote:
>
> I'll plan on starting a vote in the next day or two if there are no further
> objections/comments.
>
> On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield 
> wrote:
>
> > I think the only point asked on the PR that I think is worth discussing is
> > assumptions about dictionaries at the beginning of streams.
> >
> > There are two options:
> > 1.  Based on the current wording, it does not seem that all dictionaries
> > need to be at the beginning of the stream if they aren't made use of in the
> > first record batch (i.e. a dictionary encoded column is all null in the
> > first record batch).
> > 2.  We require a dictionary batch for each dictionary at the beginning of
> > the stream (and require implementations to send an empty batch if they
> > don't have the dictionary available).
> >
> > The current proposal in the PR is option #1.
> >
> > Thanks,
> > Micah
> >
> > On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield 
> > wrote:
> >
> >> I've opened a pull request [1] to clarify some recent conversations about
> >> semantics/edge cases for dictionary encoding [2][3] around interleaved
> >> batches and when isDelta=False.
> >>
> >> Specifically, it proposes isDelta=False indicates dictionary
> >> replacement.  For the file format, only one isDelta=False batch is allowed
> >> per file and isDelta=true batches are applied in the order supplied file
> >> footer.
> >>
> >> In addition, I've added a new enum to DictionaryEncoding to preserve
> >> future compatibility in case we want to expand dictionary encoding to be an
> >> explicit mapping from "ID" to "VALUE" as discussed in [4].
> >>
> >> Once people have had a change to review and come to a consensus. I will
> >> call a formal vote to approve the change commit the change.
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1] https://github.com/apache/arrow/pull/5585
> >> [2]
> >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
> >> [3]
> >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E
> >> [4]
> >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E
> >>
> >>


[jira] [Created] (ARROW-7033) Error in./configure step for jemalloc when building on OSX 10.14.6

2019-10-30 Thread Christian Hudon (Jira)
Christian Hudon created ARROW-7033:
--

 Summary: Error in./configure step for jemalloc when building on 
OSX 10.14.6
 Key: ARROW-7033
 URL: https://issues.apache.org/jira/browse/ARROW-7033
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Christian Hudon


Hello. I'm trying to build the C++ part of Apache Arrow (as a first step to 
possible contributions). I'm following the C++ Development instructions, but 
running into an error early. I also looked at ARROW-4935, but the cause there 
seems different, so I'm opening a new bug report.

I'm on MacOS 10.14.6. I have the XCode cli tools installed (via xcode-select), 
and installed the other dependencies with Homebrew, giving it the cpp/Brewfile. 
I want to be able to run the tests, so I'm configuring a debug build with:

  cmake -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON ..

from an out-of-source build, in a cpp/debug directory. Then, running make, I 
get very quickly the following error:

{{$ make}}
{{[ 0%] Performing configure step for 'jemalloc_ep'}}
{{CMake Error at 
/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-configure-DEBUG.cmake:49
 (message):}}
{{ Command failed: 1}}{{'./configure' 
'AR=/Library/Developer/CommandLineTools/usr/bin/ar' 
'CC=/Library/Developer/CommandLineTools/usr/bin/cc' 
'--prefix=/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep/dist/'
 '--with-jemalloc-prefix=je_arrow_' 
'--with-private-namespace=je_arrow_private_' '--without-export' '--disable-cxx' 
'--disable-libdl' '--disable-initial-exec-tls'}}{{See 
also}}{{/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-configure-*.log}}
{{make[2]: *** [jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-configure] 
Error 1}}
{{make[1]: *** [CMakeFiles/jemalloc_ep.dir/all] Error 2}}
{{make: *** [all] Error 2}}

{{Looking into the log file as suggested, I see:}}

configure: error: in 
`/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep':
configure: error: cannot run C compiled programs.
If you meant to cross compile, use `--host'.
See `config.log' for more details 

... which seems a bit suspicuous. Running the ./configure invocation manually, 
I get the same error:

{{$ './configure' 'AR=/Library/Developer/CommandLineTools/usr/bin/ar' 
'CC=/Library/Developer/CommandLineTools/usr/bin/cc' 
'--prefix=/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep/dist/'
 '--with-jemalloc-prefix=je_arrow_' 
'--with-private-namespace=je_arrow_private_' '--without-export' '--disable-cxx' 
'--disable-libdl' '--disable-initial-exec-tls'}}
{{checking for xsltproc... /usr/bin/xsltproc}}
{{checking for gcc... /Library/Developer/CommandLineTools/usr/bin/cc}}
{{checking whether the C compiler works... yes}}
{{checking for C compiler default output file name... a.out}}
{{checking for suffix of executables...}}
{{checking whether we are cross compiling... configure: error: in 
`/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep':}}
{{configure: error: cannot run C compiled programs.}}
{{If you meant to cross compile, use `--host'.}}
{{See `config.log' for more details}}{{}}

{{Digging into config.log, I see:}}

configure:3213: checking whether we are cross compiling
*configure:3221: /Library/Developer/CommandLineTools/usr/bin/cc -o conftest 
conftest.c >&5*
*conftest.c:9:10: fatal error: 'stdio.h' file not found*
#include 
 ^
1 error generated.
configure:3225: $? = 1
configure:3232: ./conftest
./configure: line 3234: ./conftest: No such file or directory
configure:3236: $? = 127
configure:3243: error: in 
`/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep':
configure:3245: error: cannot run C compiled programs.
If you meant to cross compile, use `--host'.

(Relevant bit in bold.) Well, that would make more sense, at least. I create a 
close-enough conftest.c by hand:

{{#include }}

{{int main(void) \{ return 0; }}}

and try to compile it with the same command-line invocation:

{{$  /Library/Developer/CommandLineTools/usr/bin/cc -o conftest conftest.c}}

{{I get that same error:}}

conftest.c:1:10: fatal error: 'stdio.h' file not found
#include 
 ^
1 error generated.

However, I also have a cc in /usr/bin. If I try that one instead, things works:

{{$ /usr/bin/cc -o conftest conftest.c}}
{{$ ls -l conftest}}
{{-rwxr-xr-x 1 chrish staff 4,2K oct 30 16:03 conftest*}}
{{$ ./conftest}}

{{(No error compiling or running conftest.c)}}

The two executable seem to be the same compiler (or at least the exact same 
version):

{{$ /usr/bin/cc --version
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin}}

{{$ /Library/Developer/CommandLineTools/usr/bin/cc --version
Apple LLVM version 10.0.1 (clang-1001.0.46.4)

Re: [VOTE] Release Apache Arrow 0.15.1 - RC0

2019-10-30 Thread Wes McKinney
+1 (binding)

* Verified source on Ubuntu 18.04 (using 0.15.1 RC verification script)
* Verified wheels on Linux, macOS, and Windows using
"verify-release-candidate.sh wheels ..." and
verify-release-candidate-wheels.bat
* Verified Linux binaries

Thanks for fixing the macOS wheel!

On Wed, Oct 30, 2019 at 11:24 AM Krisztián Szűcs
 wrote:
>
> Hi,
>
> I've uploaded the correct wheel for CPython 3.7 on macOS, also
> tested it locally, it works properly. Created a JIRA [1] to test the
> wheels in the release verification script similarly like we test the
> linux packages, this should catch both the uploading issues and
> the linking errors causing most of the troubles with wheels.
>
> Thanks, Krisztian
>
> [1]: https://issues.apache.org/jira/browse/ARROW-7032
>
> On Tue, Oct 29, 2019 at 6:40 PM Krisztián Szűcs
>  wrote:
> >
> > I have locally the same binary, so something must have happened
> > silently during the downloading process, without exiting with an error.
> > The proper wheel is available under the GitHub release for that
> > particular crossbow task here [1].
> > I'll download, sign and upload it to Bintray tomorrow evening (CET).
> >
> > [1]: 
> > https://github.com/ursa-labs/crossbow/releases/tag/build-722-travis-wheel-osx-cp37m
> >
> > On Mon, Oct 28, 2019 at 11:00 PM Wes McKinney  wrote:
> > >
> > > I started looking at some of the Python wheels and found that the
> > > macOS Python 3.7 wheel is corrupted. Note that it's only 101KB while
> > > the other macOS wheels are ~35MB.
> > >
> > > Eyeballing the file list at
> > >
> > > https://bintray.com/apache/arrow/python-rc/0.15.1-rc0#files/python-rc/0.15.1-rc0
> > >
> > > it seems this is the only wheel with this issue, but this suggests
> > > that we should prioritize some kind of wheel integrity check using
> > > Crossbow jobs. An issue for this is
> > >
> > > https://issues.apache.org/jira/browse/ARROW-2880
> > >
> > > I'm going to check out some other wheels to see if they are OK, but
> > > maybe just this one wheel can be regenerated?
> > >
> > > On Sun, Oct 27, 2019 at 4:31 PM Sutou Kouhei  wrote:
> > > >
> > > > +1 (binding)
> > > >
> > > > I ran the followings on Debian GNU/Linux sid:
> > > >
> > > >   * TEST_CSHARP=0 \
> > > >   JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
> > > >   CUDA_TOOLKIT_ROOT=/usr \
> > > > dev/release/verify-release-candidate.sh source 0.15.1 0
> > > >   * dev/release/verify-release-candidate.sh binaries 0.15.1 0
> > > >
> > > > with:
> > > >
> > > >   * gcc (Debian 9.2.1-8) 9.2.1 20190909
> > > >   * openjdk version "1.8.0_232-ea"
> > > >   * Node.JS v12.1.0
> > > >   * go version go1.12.10 linux/amd64
> > > >   * nvidia-cuda-dev 10.1.105-3+b1
> > > >
> > > > Notes:
> > > >
> > > >   * C# sourcelink is failed as usual.
> > > >
> > > >   * We can't use dev/release/verify-release-candidate.sh on
> > > > master to verify source because it depends on the latest
> > > > archery. We need to use
> > > > dev/release/verify-release-candidate.sh in 0.15.1.
> > > >
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > > >
> > > > In 
> > > >   "[VOTE] Release Apache Arrow 0.15.1 - RC0" on Fri, 25 Oct 2019 
> > > > 20:43:07 +0200,
> > > >   Krisztián Szűcs  wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I would like to propose the following release candidate (RC0) of 
> > > > > Apache
> > > > > Arrow version 0.15.1. This is a patch release consisting of 36 
> > > > > resolved
> > > > > JIRA issues[1].
> > > > >
> > > > > This release candidate is based on commit:
> > > > > b789226ccb2124285792107c758bb3b40b3d082a [2]
> > > > >
> > > > > The source release rc0 is hosted at [3].
> > > > > The binary artifacts are hosted at [4][5][6][7].
> > > > > The changelog is located at [8].
> > > > >
> > > > > Please download, verify checksums and signatures, run the unit tests,
> > > > > and vote on the release. See [9] for how to validate a release 
> > > > > candidate.
> > > > >
> > > > > The vote will be open for at least 72 hours.
> > > > >
> > > > > [ ] +1 Release this as Apache Arrow 0.15.1
> > > > > [ ] +0
> > > > > [ ] -1 Do not release this as Apache Arrow 0.15.1 because...
> > > > >
> > > > > [1]: 
> > > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.15.1
> > > > > [2]: 
> > > > > https://github.com/apache/arrow/tree/b789226ccb2124285792107c758bb3b40b3d082a
> > > > > [3]: 
> > > > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.15.1-rc0
> > > > > [4]: https://bintray.com/apache/arrow/centos-rc/0.15.1-rc0
> > > > > [5]: https://bintray.com/apache/arrow/debian-rc/0.15.1-rc0
> > > > > [6]: https://bintray.com/apache/arrow/python-rc/0.15.1-rc0
> > > > > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.15.1-rc0
> > > > > [8]: 
> > > > > https://github.com/apache/arrow/blob/b789226ccb2124285792107c758bb3b40b3d082a/CHANGELOG.md
> > > > > [9]: 
> > > > > 

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-30-0

2019-10-30 Thread Wes McKinney
The failed tasks here are a nuisance. If they can't be fixed, should
they be removed from the nightlies?

On Wed, Oct 30, 2019 at 7:26 AM Crossbow  wrote:
>
>
> Arrow Build Report for Job nightly-2019-10-30-0
>
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0
>
> Failed Tasks:
> - docker-clang-format:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-clang-format
> - docker-r-sanitizer:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-r-sanitizer
>
> Succeeded Tasks:
> - centos-6:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-6
> - centos-7:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-7
> - centos-8:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-8
> - conda-linux-gcc-py27:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py27
> - conda-linux-gcc-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py36
> - conda-linux-gcc-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py37
> - conda-osx-clang-py27:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py27
> - conda-osx-clang-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py37
> - conda-win-vs2015-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-win-vs2015-py36
> - conda-win-vs2015-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-win-vs2015-py37
> - debian-buster:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-debian-buster
> - debian-stretch:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-debian-stretch
> - docker-c_glib:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-c_glib
> - docker-cpp-cmake32:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-cmake32
> - docker-cpp-release:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-release
> - docker-cpp-static-only:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-static-only
> - docker-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp
> - docker-dask-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-dask-integration
> - docker-docs:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-docs
> - docker-go:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-go
> - docker-hdfs-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-hdfs-integration
> - docker-iwyu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-iwyu
> - docker-java:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-java
> - docker-js:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-js
> - docker-lint:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-lint
> - docker-pandas-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-pandas-master
> - docker-python-2.7-nopandas:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-2.7-nopandas
> - docker-python-2.7:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-2.7
> - docker-python-3.6-nopandas:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-3.6-nopandas
> - docker-python-3.6:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-3.6
> - docker-python-3.7:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-3.7
> - 

Re: Arrow sync call October 30 at 12:00 US/Eastern, 16:00 UTC

2019-10-30 Thread Neal Richardson
Attendees:
* Uwe Korn
* Micah Kornfield
* Praveen Kumar
* Wes McKinney
* Rok Mihevc
* Neal Richardson


Discussion:
* docker-compose/github-actions
(https://github.com/apache/arrow/pull/5589). Needs review, needs to be
merged and have followup issues made. Currently too many jobs being
run on every commit.
* result vs. status C++: following up on previous discussion
* Parquet PRs: who is blessed to merge? Technically should be an
Apache Parquet committer (Wes, Uwe, Deepak, others?). If reviewing,
ask one of them to merge.
* C API: outstanding concerns (1) use of JSON for metadata, (2) who
owns the data and has to free it? Uwe and Micah to review the
C++/Python/R implementation

On Tue, Oct 29, 2019 at 8:52 PM Neal Richardson
 wrote:
>
> Hi all, reminder that our biweekly call is 12 hours from now at
> https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes
> will be sent out to the mailing list afterwards.
>
> Neal


[jira] [Created] (ARROW-7032) [Release] Verify python wheels in the release verification script

2019-10-30 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7032:
--

 Summary: [Release] Verify python wheels in the release 
verification script
 Key: ARROW-7032
 URL: https://issues.apache.org/jira/browse/ARROW-7032
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs
 Fix For: 1.0.0


For linux wheels use docker, otherwise setup a virtualenv and install the wheel 
supported on the host's platform. 
Testing should include the imports for the optional modules and perhaps running 
the unit tests, but the import testing should catch most of the wheel issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: some questions, please help

2019-10-30 Thread Wes McKinney
On Wed, Oct 30, 2019 at 9:32 AM Micah Kornfield  wrote:
>
> >
> > > - I see some SIMD optimizations in arrow go binding, such as vectored
> > sum. [2]
> > >But arrow cpp lib doesn't leverage SIMD. [3]
> > >Why not optimize it in cpp lib so all languages can benefit?
> > You're welcome to contribute such optimizations to the C++ library
>
>
> Note that even though C++ doesn't use explicit SIMD intrinsics often times
> the compiler will generate SIMD code because it can auto-vectorize the
> code.

Note it will likely be important to have explicit dynamic/runtime SIMD
dispatching on certain hot paths as we build binaries that need to be
able to run on both newer and older CPUs

> On Wed, Oct 30, 2019 at 7:25 AM Wes McKinney  wrote:
>
> > hi Yibo
> >
> > On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai  wrote:
> > >
> > > Hi,
> > >
> > > I'm new to Arrow. Would like to seek for help about some questions. Any
> > comment is welcomed.
> > >
> > > - About source code tree, my understand is that "cpp" is the core arrow
> > libraries, "c_glib, go, python, ..." are language bindings to ease
> > integrating arrow into apps developed by that language. Is that correct?
> >
> > No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust
> >
> > * C/GLib, MATLAB, Python, R bind to C++
> > * Ruby binds to GLib
> >
> > > - Arrow implements many data types and aggregation functions(sum, mean,
> > ...). [1]
> > >IMO, more functions and types should be supported, like min/max,
> > vector/tensor operations, big number, etc. I'm not sure if this is in
> > arrow's scope, or the apps using arrow should deal with it themselves.
> >
> > Our objective at least in the C++ library is to have a generally
> > useful "standard library" that handles common application concerns.
> > Whether or not something is thought to be in scope may vary on a case
> > by case basis -- if you can't find a JIRA issue for something in
> > particular, please go ahead and open one.
> >
> > > - I see some SIMD optimizations in arrow go binding, such as vectored
> > sum. [2]
> > >But arrow cpp lib doesn't leverage SIMD. [3]
> > >Why not optimize it in cpp lib so all languages can benefit?
> >
> > You're welcome to contribute such optimizations to the C++ library
> >
> >
> > - Wes
> >
> > > [1]
> > https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
> > > [2]
> > https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
> > > [3]
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
> > >
> > > Yibo
> >


Re: some questions, please help

2019-10-30 Thread Wes McKinney
hi Yibo

On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai  wrote:
>
> Hi,
>
> I'm new to Arrow. Would like to seek for help about some questions. Any 
> comment is welcomed.
>
> - About source code tree, my understand is that "cpp" is the core arrow 
> libraries, "c_glib, go, python, ..." are language bindings to ease 
> integrating arrow into apps developed by that language. Is that correct?

No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust

* C/GLib, MATLAB, Python, R bind to C++
* Ruby binds to GLib

> - Arrow implements many data types and aggregation functions(sum, mean, ...). 
> [1]
>IMO, more functions and types should be supported, like min/max, 
> vector/tensor operations, big number, etc. I'm not sure if this is in arrow's 
> scope, or the apps using arrow should deal with it themselves.

Our objective at least in the C++ library is to have a generally
useful "standard library" that handles common application concerns.
Whether or not something is thought to be in scope may vary on a case
by case basis -- if you can't find a JIRA issue for something in
particular, please go ahead and open one.

> - I see some SIMD optimizations in arrow go binding, such as vectored sum. [2]
>But arrow cpp lib doesn't leverage SIMD. [3]
>Why not optimize it in cpp lib so all languages can benefit?

You're welcome to contribute such optimizations to the C++ library


- Wes

> [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
> [2] 
> https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
> [3] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
>
> Yibo


[jira] [Created] (ARROW-7031) [Python] Expose the offsets of a ListArray in python

2019-10-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7031:


 Summary: [Python] Expose the offsets of a ListArray in python
 Key: ARROW-7031
 URL: https://issues.apache.org/jira/browse/ARROW-7031
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Assume the following ListArray:

{code}
In [1]: arr = pa.ListArray.from_arrays(offsets=[0, 3, 5], values=[1, 2, 3, 4, 
5]) 
 

In [2]: arr 

   
Out[2]: 

[
  [
1,
2,
3
  ],
  [
4,
5
  ]
]
{code}

You can get the actual values as a flat array through {{.values}} / 
{{.flatten()}}, but there is currently no easy way to get back to the offsets 
(except from interpreting the buffers manually). 

We should probably add an {{offsets}} attribute (there is actually also a TODO 
comment for that).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


AW: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

2019-10-30 Thread roman.karlstetter
Hi Wes,

the data is indeed not originating from Arrow, so I was looking for how to call 
the low level WriteBatch API. I figured it out now, it's actually 
straightforward in the Arrow-API, I just got confused a little with the spec at 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#DECIMAL

So for future reference: I multiply each value in a floating point array with 
pow(10, scale) and pass the resulting array (in my case: int32_t) directly to 
WriteBatch().

One thing I can imagine that could make the API a little easier to use: Provide 
a function that directly takes an array of floats or doubles which does the 
conversion internally. But it's not really needed, so it's not really worth 
adding.

Thanks for your help and sorry for the annoyance,
Roman


-Ursprüngliche Nachricht-
Von: Wes McKinney  
Gesendet: Dienstag, 29. Oktober 2019 16:19
An: dev 
Betreff: Re: State of decimal support in Arrow (from/to Parquet Decimal 
Logicaltype)

It depends on the origin of your data.

If your data is not originating from Arrow, then it may be better to produce an 
array of FixedLenByteArray and pass that to the low level WriteBatch API. If 
you would like some other API, please feel free to propose something.

On Tue, Oct 29, 2019 at 10:13 AM  wrote:
>
> Hi Wes,
>
> that was a bit unclear, sorry for that. With "an array", I'm referring to a 
> plain c++-type array, i.e. an array of float, uint32_t, ...
> This means that I do not use the arrow::Array-based write API, but I use the 
> TypedColumnWriter::WriteBatch() function directly and do not have any arrow 
> arrays. Are there any advantages of not using the writebatch directly and 
> instead using arrow::Arrays?
>
> Thanks,
> Roman
>
> -Ursprüngliche Nachricht-
> Von: Wes McKinney 
> Gesendet: Dienstag, 29. Oktober 2019 15:59
> An: dev 
> Betreff: Re: State of decimal support in Arrow (from/to Parquet 
> Decimal Logicaltype)
>
> On Tue, Oct 29, 2019 at 3:11 AM  wrote:
> >
> > Hi Wes,
> >
> > thanks for the response. There's one thing that is still a little unclear 
> > to me:
> > I had a look at the code for function WriteArrowSerialize > arrow::Decimal128Type> in the reference you provided. I don't have arrow 
> > data in the first place, but as I understand it, I need to have an array of 
> > FixedLenByteArrays objects which then point to the actual decimal values in 
> > the big_endian_values buffer. Is this the only way to write decimal types 
> > or is it also possible to directly provide an array with values to 
> > writeBatch()?
> >
>
> Could you clarify what you mean by "an array"? If you use the 
> arrow::Array-based write API then it will invoke this serializer 
> specialization
>
> https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca
> 585ae1c/cpp/src/parquet/column_writer.cc#L1569
>
> That's what we're calling (if I'm not mistaken, since I just worked on 
> this code recently) when writing arrow::Decimal128Array. If you set a 
> breakpoint with gdb there you can see the call stack
>
> > For the issues, I also found 
> > https://issues.apache.org/jira/browse/ARROW-6990, but I'm not sure if this 
> > is also related to the issues you created.
> >
> > Thanks,
> > Roman
> >
> > -Ursprüngliche Nachricht-
> > Von: Wes McKinney 
> > Gesendet: Montag, 28. Oktober 2019 21:11
> > An: dev 
> > Betreff: Re: State of decimal support in Arrow (from/to Parquet 
> > Decimal Logicaltype)
> >
> > hi Roman,
> >
> > On Mon, Oct 28, 2019 at 5:56 AM  wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I have a question about the state of decimal support in Arrow when 
> > > reading from/writing to Parquet.
> > >
> > > *   Is writing decimals to parquet supposed to work? Are there any
> > > examples on how to do this in C++?
> >
> > Yes, it's supported, the details are here
> >
> > https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497
> > ca
> > 585ae1c/cpp/src/parquet/column_writer.cc#L1511
> >
> > > *   When reading decimals in a parquet file with pyarrow and 
> > > converting
> > > the resulting table to a pandas dataframe, datatype in the cells 
> > > is "object". As a consequence, performance when doing analysis on 
> > > this table is suboptimal. Can I somehow directly get the decimals 
> > > from the parquet file into floats/doubles in a pandas dataframe?
> >
> > Some work will be required. The cleanest way would be to cast
> > decimal128 columns to float32/float64 prior to converting to pandas.
> >
> > I didn't see an issue for this right away so I opened
> >
> > https://issues.apache.org/jira/browse/ARROW-7010
> >
> > I also opened
> >
> > https://issues.apache.org/jira/browse/ARROW-7011
> >
> > about going the other way. This would be a useful thing to contribute to 
> > the project.
> >
> > Thanks
> > Wes
> >
> > >
> > >
> > > Thanks in advance,
> > >
> > > Roman
> > >
> > >
> > >
> >
>



[jira] [Created] (ARROW-7030) csv example coredump error

2019-10-30 Thread wjw (Jira)
wjw created ARROW-7030:
--

 Summary: csv example coredump error
 Key: ARROW-7030
 URL: https://issues.apache.org/jira/browse/ARROW-7030
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.15.0
 Environment: g++:7.3.1
Reporter: wjw


I am trying to write a example for reading csv by apache-arrow in c++ according 
to the offical one,https://arrow.apache.org/docs/cpp/csv.html#, but it meets 
Segmentation fault at `status = reader->Read();`

Can anyone help? thank you~

 

environment info:
`g++:7.3.1`

make command:
`c++ -g -std=c++11 -Wall -O2 test.cpp -o test -I../../arrow/src 
-L../../arrow/lib -larrow -lparquet -Wl,-rpath,./`

code info:
```
 arrow::Status status;
 arrow::MemoryPool *pool = arrow::default_memory_pool();
 std::shared_ptr input;
 std::string csv_file = "test.csv";
 auto input_readable = 
std::dynamic_pointer_cast(input);
 PARQUET_THROW_NOT_OK(arrow::io::ReadableFile::Open(csv_file, pool, 
_readable));

auto read_options = arrow::csv::ReadOptions::Defaults();
 read_options.use_threads = false;
 read_options.column_names.emplace_back("name");
 read_options.column_names.emplace_back("age");

auto parse_options = arrow::csv::ParseOptions::Defaults();

auto convert_options = arrow::csv::ConvertOptions::Defaults();
 convert_options.include_missing_columns = true;

std::shared_ptr reader;
 status = arrow::csv::TableReader::Make(pool, input, read_options,
 parse_options, convert_options,
 );
 if (!status.ok())
 {
 std::cout << "make csv table error" << std::endl;
 return -1;
 }
 std::shared_ptr table;
 status = reader->Read();
 if (!status.ok())
 {
 std::cout << "read csv table error" << std::endl;
 return -1;
 }
```

coredump info:
```
Program terminated with signal 11, Segmentation fault.
#0 0x7fe4fcda83e7 in 
arrow::io::internal::ReadaheadSpooler::Impl::WorkerLoop() () from 
./libarrow.so.15
(gdb) bt
#0 0x7fe4fcda83e7 in 
arrow::io::internal::ReadaheadSpooler::Impl::WorkerLoop() () from 
./libarrow.so.15
#1 0x7fe4fd405a2f in execute_native_thread_routine () from ./libarrow.so.15
#2 0x7fe4fa8ecdf3 in start_thread () from /lib64/libpthread.so.0
#3 0x7fe4fb86e1bd in clone () from /lib64/libc.so.6
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7029) [Go] unsafe pointer arithmetic panic w/ Go-1.14-dev

2019-10-30 Thread Sebastien Binet (Jira)
Sebastien Binet created ARROW-7029:
--

 Summary: [Go] unsafe pointer arithmetic panic w/ Go-1.14-dev
 Key: ARROW-7029
 URL: https://issues.apache.org/jira/browse/ARROW-7029
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Go
Reporter: Sebastien Binet


Go-1.14 (to be released in Feb-2020) has a new analysis pass (enabled with 
-race) that checks for unsafe pointer arithmetic:

~~

go test -race -run=Example_minimal .
--- FAIL: Example_minimal (0.00s)
panic: runtime error: unsafe pointer arithmetic [recovered]
 panic: runtime error: unsafe pointer arithmetic

goroutine 1 [running]:
testing.(*InternalExample).processRunResult(0xcadc80, 0x0, 0x0, 0x8927, 
0x90a400, 0xcb62c0, 0xca48c8)
 /home/binet/sdk/go/src/testing/example.go:89 +0x71f
testing.runExample.func2(0xbf6675c29511646d, 0x20fb5f, 0xc7f780, 0xca2378, 
0xca2008, 0xc86360, 0xcadc80, 0xcadcb0)
 /home/binet/sdk/go/src/testing/run_example.go:58 +0x143
panic(0x90a400, 0xcb62c0)
 /home/binet/sdk/go/src/runtime/panic.go:915 +0x370
github.com/apache/arrow/go/arrow/memory.memory_memset_avx2(0xc9e200, 0x40, 
0x40, 0xc9c000)
 
/home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/memory/memory_avx2_amd64.go:33
 +0xa4
github.com/apache/arrow/go/arrow/memory.Set(...)
 /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/memory/memory.go:25
github.com/apache/arrow/go/arrow/array.(*builder).init(0xc84600, 0x20)
 
/home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/builder.go:101
 +0x23a
github.com/apache/arrow/go/arrow/array.(*Int64Builder).init(0xc84600, 0x20)
 
/home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:102
 +0x60
github.com/apache/arrow/go/arrow/array.(*Int64Builder).Resize(0xc84600, 0x2)
 
/home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:125
 +0x8c
github.com/apache/arrow/go/arrow/array.(*builder).reserve(0xc84600, 0x1, 
0xcad918)
 
/home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/builder.go:138
 +0xdc
github.com/apache/arrow/go/arrow/array.(*Int64Builder).Reserve(0xc84600, 
0x1)
 
/home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:113
 +0x68
github.com/apache/arrow/go/arrow/array.(*Int64Builder).Append(0xc84600, 0x1)
 
/home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:60
 +0x46
github.com/apache/arrow/go/arrow_test.Example_minimal()
 /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/example_test.go:39 
+0x153
testing.runExample(0x94714a, 0xf, 0x95d8a8, 0x957614, 0x83, 0x0, 0x0)
 /home/binet/sdk/go/src/testing/run_example.go:62 +0x275
testing.runExamples(0xcaded8, 0xc7a2e0, 0xb, 0xb, 0x100)
 /home/binet/sdk/go/src/testing/example.go:44 +0x212
testing.(*M).Run(0xc00010, 0x0)
 /home/binet/sdk/go/src/testing/testing.go:1125 +0x3b4
main.main()
 _testmain.go:130 +0x224
FAIL github.com/apache/arrow/go/arrow 0.009s
FAIL

~~

 

see:

[https://groups.google.com/forum/#!msg/golang-dev/SzwDoqoRVJA/IvtnBW5oDwAJ]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7028) Dates in R are different when saved and loaded with parquet

2019-10-30 Thread Sascha (Jira)
Sascha created ARROW-7028:
-

 Summary: Dates in R are different when saved and loaded with 
parquet
 Key: ARROW-7028
 URL: https://issues.apache.org/jira/browse/ARROW-7028
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.15.0
Reporter: Sascha


When saving R-dataframes with parquet and loading them again, the internal 
representation of Dates changes, leading e.g. to errors when comparing them in 
dplyr::if_else.

``` r

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#> filter, lag
#> The following objects are masked from 'package:base':
#> 
#> intersect, setdiff, setequal, union

tmp = tempdir()
dat = tibble(tag = as.Date("2018-01-01"))
dat2 = tibble(tag2 = as.Date("2019-01-01"))

arrow::write_parquet(dat, file.path(tmp, "dat.parquet"))
dat = arrow::read_parquet(file.path(tmp, "dat.parquet"))

typeof(dat$tag)
#> [1] "integer"
typeof(dat2$tag2)
#> [1] "double"

bind_cols(dat, dat2) %>%
 mutate(comparison = if_else(TRUE, tag, tag2))
#> `false` must be a `Date` object, not a `Date` object
```

Created on 2019-10-30 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7027) [Python] pa.table(..) returns instead of raises error if passing invalid object

2019-10-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7027:


 Summary: [Python] pa.table(..) returns instead of raises error if 
passing invalid object
 Key: ARROW-7027
 URL: https://issues.apache.org/jira/browse/ARROW-7027
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


When passing eg a Series instead of a DataFrame, you get:

{code}
In [4]: df = pd.DataFrame({'a': [1, 2, 3]}) 

   

In [5]: table = pa.table(df['a'])   

   

In [6]: table   

   
Out[6]: TypeError('Expected pandas DataFrame or python dictionary')

In [7]: type(table) 

   
Out[7]: TypeError
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


some questions, please help

2019-10-30 Thread Yibo Cai

Hi,

I'm new to Arrow. Would like to seek for help about some questions. Any comment 
is welcomed.

- About source code tree, my understand is that "cpp" is the core arrow libraries, 
"c_glib, go, python, ..." are language bindings to ease integrating arrow into apps 
developed by that language. Is that correct?

- Arrow implements many data types and aggregation functions(sum, mean, ...). 
[1]
  IMO, more functions and types should be supported, like min/max, 
vector/tensor operations, big number, etc. I'm not sure if this is in arrow's 
scope, or the apps using arrow should deal with it themselves.

- I see some SIMD optimizations in arrow go binding, such as vectored sum. [2]
  But arrow cpp lib doesn't leverage SIMD. [3]
  Why not optimize it in cpp lib so all languages can benefit?

[1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
[2] 
https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
[3] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111

Yibo


[jira] [Created] (ARROW-7026) [Java] Remove assertions in MessageSerializer/vector/writer/reader

2019-10-30 Thread Ji Liu (Jira)
Ji Liu created ARROW-7026:
-

 Summary: [Java] Remove assertions in 
MessageSerializer/vector/writer/reader
 Key: ARROW-7026
 URL: https://issues.apache.org/jira/browse/ARROW-7026
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently assertions exists in many classes like 
{{MessagaSerializer/JsonReader/JsonWriter/ListVector}} etc.

i. If jvm arguments are not specified, these checks will skipped and lead to 
potential problems.

ii. Java errors produced by failed assertions are not caught by traditional 
catch clauses.

To fix this, use {{Preconditions}} instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)