[jira] [Created] (ARROW-7037) [C++ ] Compile error on the combination of protobuf >= 3.9 and clang
Kenta Murata created ARROW-7037: --- Summary: [C++ ] Compile error on the combination of protobuf >= 3.9 and clang Key: ARROW-7037 URL: https://issues.apache.org/jira/browse/ARROW-7037 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Kenta Murata Assignee: Kenta Murata I encountered the following compile error on the combination of protobuf 3.10.0 and clang (Xcode 11). {noformat} [13/26] Building CXX object c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o FAILED: c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o /Applications/Xcode_11.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -Ic++/include -I/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include -I/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src -Ic++/src -isystem c++/libs/thirdparty/zlib_ep-install/include -isystem c++/libs/thirdparty/lz4_ep-install/include -Qunused-arguments -fcolor-diagnostics -ggdb -O0 -g -fPIC -Wno-zero-as-null-pointer-constant -Wno-inconsistent-missing-destructor-override -Wno-error=undef -std=c++11 -Weverything -Wno-c++98-compat -Wno-missing-prototypes -Wno-c++98-compat-pedantic -Wno-padded -Wno-covered-switch-default -Wno-missing-noreturn -Wno-unknown-pragmas -Wno-gnu-zero-variadic-macro-arguments -Wconversion -Wno-c++2a-compat -Werror -std=c++11 -Weverything -Wno-c++98-compat -Wno-missing-prototypes -Wno-c++98-compat-pedantic -Wno-padded -Wno-covered-switch-default -Wno-missing-noreturn -Wno-unknown-pragmas -Wno-gnu-zero-variadic-macro-arguments -Wconversion -Wno-c++2a-compat -Werror -O0 -g -MD -MT c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o -MF c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o.d -o c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o -c /Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/wrap/orc-proto-wrapper.cc In file included from /Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/wrap/orc-proto-wrapper.cc:44: c++/src/orc_proto.pb.cc:959:145: error: possible misuse of comma operator here [-Werror,-Wcomma] static bool dynamic_init_dummy_orc_5fproto_2eproto = ( ::PROTOBUF_NAMESPACE_ID::internal::AddDescriptors(_table_orc_5fproto_2eproto), true); ^ c++/src/orc_proto.pb.cc:959:57: note: cast expression to void to silence warning static bool dynamic_init_dummy_orc_5fproto_2eproto = ( ::PROTOBUF_NAMESPACE_ID::internal::AddDescriptors(_table_orc_5fproto_2eproto), true); ^~~~ static_cast( ) 1 error generated. {noformat} This may be due to a bug of protobuf filed as https://github.com/protocolbuffers/protobuf/issues/6619. -- This message was sent by Atlassian Jira (v8.3.4#803005)
questions about Gandiva
Hi, Arrow cpp integrates Gandiva to provide low level operations on arrow buffers. [1][2] I have some questions, any help is appreciated: - Arrow cpp already has a compute kernel[3], does it duplicate what Gandiva provides? I see a Jira talk about it.[4] - Is Gandiva only for arrow cpp? What about other languages(go, rust, ...)? - Gandiva leverages SIMD for vectorized operations[1], but I didn't see any related code. Am I missing something? [1] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/ [2] https://github.com/apache/arrow/tree/master/cpp/src/gandiva [3] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute [4] https://issues.apache.org/jira/browse/ARROW-7017 Thanks, Yibo
[jira] [Created] (ARROW-7036) [C++] Version up ORC to avoid compile errors
Kenta Murata created ARROW-7036: --- Summary: [C++] Version up ORC to avoid compile errors Key: ARROW-7036 URL: https://issues.apache.org/jira/browse/ARROW-7036 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Kenta Murata Assignee: Kenta Murata I encountered the compile errors due to {{-Wshadow-field}} like below: {noformat} [1/4] Building CXX object c++/src/CMakeFiles/orc.dir/Vector.cc.o FAILED: c++/src/CMakeFiles/orc.dir/Vector.cc.o /Applications/Xcode_11.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -Ic++/include -I/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include -I/Users/mrkn/src/github.com/apa che/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src -Ic++/src -isystem c++/libs/thirdparty/zlib_ep-install/include -isystem c++/libs/thirdparty/lz4_ep-install/include -Qunused-arguments -fcolor-diagnostics -ggdb -O0 -g -fPIC -Wno-z ero-as-null-pointer-constant -Wno-inconsistent-missing-destructor-override -Wno-error=undef -std=c++11 -Weverything -Wno-c++98-compat -Wno-missing-prototypes -Wno-c++98-compat-pedantic -Wno-padded -Wno-covered-switch-default -Wno-missing-n oreturn -Wno-unknown-pragmas -Wno-gnu-zero-variadic-macro-arguments -Wconversion -Werror -std=c++11 -Weverything -Wno-c++98-compat -Wno-missing-prototypes -Wno-c++98-compat-pedantic -Wno-padded -Wno-covered-switch-default -Wno-missing-nore turn -Wno-unknown-pragmas -Wno-gnu-zero-variadic-macro-arguments -Wconversion -Werror -O0 -g -MD -MT c++/src/CMakeFiles/orc.dir/Vector.cc.o -MF c++/src/CMakeFiles/orc.dir/Vector.cc.o.d -o c++/src/CMakeFiles/orc.dir/Vector.cc.o -c /Users/mr kn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc /Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:59:45: error: parameter 'capacity' shadows member inherited from type 'ColumnVectorBatch' [-Werror,-Wshadow-field] LongVectorBatch::LongVectorBatch(uint64_t capacity, MemoryPool& pool ^ /Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14: note: declared here uint64_t capacity; ^ /Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:87:49: error: parameter 'capacity' shadows member inherited from type 'ColumnVectorBatch' [-Werror,-Wshadow-field] DoubleVectorBatch::DoubleVectorBatch(uint64_t capacity, MemoryPool& pool ^ /Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14: note: declared here uint64_t capacity; ^ /Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:115:49: error: parameter 'capacity' shadows member inherited from type 'ColumnVectorBatch' [-Werror,-Wshadow-field] StringVectorBatch::StringVectorBatch(uint64_t capacity, MemoryPool& pool ^ /Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14: note: declared here uint64_t capacity; ^ /Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:407:55: error: parameter 'capacity' shadows member inherited from type 'ColumnVectorBatch' [-Werror,-Wshadow-field] TimestampVectorBatch::TimestampVectorBatch(uint64_t capacity, ^ /Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14: note: declared here uint64_t capacity; ^ 4 errors generated. {noformat} Upgrading ORC to 1.5.7 will fix this errors. I used Xcode 11.1 on macOS Mojave. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: some questions, please help
Thanks Wes, Micah, your comments are very helpful. Yibo On 10/30/19 10:45 PM, Wes McKinney wrote: On Wed, Oct 30, 2019 at 9:32 AM Micah Kornfield wrote: - I see some SIMD optimizations in arrow go binding, such as vectored sum. [2] But arrow cpp lib doesn't leverage SIMD. [3] Why not optimize it in cpp lib so all languages can benefit? You're welcome to contribute such optimizations to the C++ library Note that even though C++ doesn't use explicit SIMD intrinsics often times the compiler will generate SIMD code because it can auto-vectorize the code. Note it will likely be important to have explicit dynamic/runtime SIMD dispatching on certain hot paths as we build binaries that need to be able to run on both newer and older CPUs On Wed, Oct 30, 2019 at 7:25 AM Wes McKinney wrote: hi Yibo On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai wrote: Hi, I'm new to Arrow. Would like to seek for help about some questions. Any comment is welcomed. - About source code tree, my understand is that "cpp" is the core arrow libraries, "c_glib, go, python, ..." are language bindings to ease integrating arrow into apps developed by that language. Is that correct? No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust * C/GLib, MATLAB, Python, R bind to C++ * Ruby binds to GLib - Arrow implements many data types and aggregation functions(sum, mean, ...). [1] IMO, more functions and types should be supported, like min/max, vector/tensor operations, big number, etc. I'm not sure if this is in arrow's scope, or the apps using arrow should deal with it themselves. Our objective at least in the C++ library is to have a generally useful "standard library" that handles common application concerns. Whether or not something is thought to be in scope may vary on a case by case basis -- if you can't find a JIRA issue for something in particular, please go ahead and open one. - I see some SIMD optimizations in arrow go binding, such as vectored sum. [2] But arrow cpp lib doesn't leverage SIMD. [3] Why not optimize it in cpp lib so all languages can benefit? You're welcome to contribute such optimizations to the C++ library - Wes [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels [2] https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111 Yibo
Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-30-0
https://issues.apache.org/jira/browse/ARROW-7034 (pending +1/merge) will rid us of these meddlesome failures. Neal On Wed, Oct 30, 2019 at 11:25 AM Wes McKinney wrote: > > The failed tasks here are a nuisance. If they can't be fixed, should > they be removed from the nightlies? > > On Wed, Oct 30, 2019 at 7:26 AM Crossbow wrote: > > > > > > Arrow Build Report for Job nightly-2019-10-30-0 > > > > All tasks: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0 > > > > Failed Tasks: > > - docker-clang-format: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-clang-format > > - docker-r-sanitizer: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-r-sanitizer > > > > Succeeded Tasks: > > - centos-6: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-6 > > - centos-7: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-7 > > - centos-8: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-8 > > - conda-linux-gcc-py27: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py27 > > - conda-linux-gcc-py36: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py36 > > - conda-linux-gcc-py37: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py37 > > - conda-osx-clang-py27: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py27 > > - conda-osx-clang-py36: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py36 > > - conda-osx-clang-py37: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py37 > > - conda-win-vs2015-py36: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-win-vs2015-py36 > > - conda-win-vs2015-py37: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-win-vs2015-py37 > > - debian-buster: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-debian-buster > > - debian-stretch: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-debian-stretch > > - docker-c_glib: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-c_glib > > - docker-cpp-cmake32: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-cmake32 > > - docker-cpp-release: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-release > > - docker-cpp-static-only: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-static-only > > - docker-cpp: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp > > - docker-dask-integration: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-dask-integration > > - docker-docs: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-docs > > - docker-go: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-go > > - docker-hdfs-integration: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-hdfs-integration > > - docker-iwyu: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-iwyu > > - docker-java: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-java > > - docker-js: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-js > > - docker-lint: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-lint > > - docker-pandas-master: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-pandas-master > > - docker-python-2.7-nopandas: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-2.7-nopandas > > - docker-python-2.7: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-2.7 > > - docker-python-3.6-nopandas: > > URL: > >
[jira] [Created] (ARROW-7035) [R] Default arguments are unclear in write_parquet docs
Karl Dunkle Werner created ARROW-7035: - Summary: [R] Default arguments are unclear in write_parquet docs Key: ARROW-7035 URL: https://issues.apache.org/jira/browse/ARROW-7035 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 0.15.0 Environment: Ubuntu with libparquet-dev 0.15.0-1, R 3.6.1, and arrow 0.15.0. Reporter: Karl Dunkle Werner Fix For: 0.15.1 Thank you so much for adding support for reading and writing parquet files in R! I have a few questions about the user interface and optional arguments, but I want to highlight how great it is to have this useful filetype to pass data back and forth. The defaults for the optional arguments in {{arrow::write_parquet}} aren't always clear. Here were my questions after reading the help docs from {{write_parquet}}: * What's the default {{version}}? Should a user prefer "2.0" for new projects? * What are acceptable values for {{compression}}? (Answer: {{uncompressed}}, {{snappy}}, {{gzip}}, {{brotli}}, {{zstd}}, or {{lz4}}.) * What's the default for {{use_dictionary}}? Seems to be {{TRUE}}, at least some of the time. * What's the default for {{write_statistics}}? Should a user prefer {{TRUE}}? * Can I assume {{allow_truncated_timestamps}} is {{FALSE}} by default? As someone who works in both R and Python, I was a little surprised when pyarrow uses snappy compression by default, but R's default is uncompressed. My preference would be having the same default arguments, but that might be a fringe use-case. While I was digging into this, I was surprised that {{ParquetReaderProperties}} is exported and documented, but {{ParquetWriterProperties}} isn't. Is that intentional? Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7034) [CI][Crossbow] Skip known nightly failures
Neal Richardson created ARROW-7034: -- Summary: [CI][Crossbow] Skip known nightly failures Key: ARROW-7034 URL: https://issues.apache.org/jira/browse/ARROW-7034 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Neal Richardson Assignee: Neal Richardson The failures are ticketed. There's no point in running them if we know they're failing. The patches that fix the builds can add them back to the nightly list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Result vs Status
Returning to this discussion. Here is my position on the matter since this was brought up on the sync call today * For internal / non-public and pseudo-non-public APIs that have return/out values - Use Result or Status at discretion of the developer, but Result is preferable * For new public APIs with return/out values - Prefer Result unless a Status-based API seems definitely less awkward in real world use. I have to say that I'm skeptical about the relative usability of std::tuple outputs and don't think we should force the use of Result for technical purity reasons * For existing Status APIs with return values - Incrementally add Result APIs and deprecate Status-based APIs. Maintain deprecated Status APIs for ~2 major releases On Thu, Oct 24, 2019 at 5:16 PM Omer F. Ozarslan wrote: > > Hi Micah, > > You're right. Quite possible that clang-query counted same function > separately for each include in each file. (I was iterating each file > separately, but providing all of them at once didn't change the result > either.) > > It's cool and wrong, so not very useful apparently. :-) > > Best, > Omer > > On Thu, Oct 24, 2019 at 4:51 PM Micah Kornfield wrote: > > > > Hi Omer, > > I think this is really cool. It is quite possible it was underestimated (I > > agree about line lengths), but I think the clang query is double counting > > somehow. > > > > For instance: > > > > "grep -r Status *" only returns ~9000 results in total for me. > > > > Similarly using grep for "FinishTyped" returns 18 results for me. > > Searching through the log that you linked seems to return 450 (for "Status > > FinishTyped"). > > > > It is quite possible, I'm doing something naive with grep. > > > > Thanks, > > Micah > > > > On Thu, Oct 24, 2019 at 2:41 PM Omer F. Ozarslan wrote: > >> > >> Forgot to mention most of those lines are longer than line width while > >> out is usually (always?) last parameter, so probably that's why grep > >> possibly underestimates their number. > >> > >> On Thu, Oct 24, 2019 at 4:33 PM Omer F. Ozarslan wrote: > >> > > >> > Hi, > >> > > >> > I don't have much experience on customized clang-tidy plugins, but > >> > this might be a good use case for such a plugin from what I read here > >> > and there (frankly this was a good excuse for me to have a look at > >> > clang tooling as well). I wanted to ensure it isn't obviously overkill > >> > before this suggestion: Running a clang query which lists functions > >> > returning `arrow::Status` and taking a pointer parameter named `out` > >> > showed that there are 13947 such functions in `cpp/src/**/*.h`. [1] > >> > > >> > I checked logs and it seemed legitimate to me, but please check it in > >> > case I missed something. If that's the case, it might be tedious to do > >> > this work manually. > >> > > >> > [1]: https://gist.github.com/ozars/ecbb1b8acd4a57ba4721c1965f83f342 > >> > (Note that the log file is shown as truncated by github after ~30k > >> > lines) > >> > > >> > Best, > >> > Omer > >> > > >> > > >> > > >> > On Wed, Oct 23, 2019 at 9:23 PM Micah Kornfield > >> > wrote: > >> > > > >> > > OK, it sounds like people want Result (at least in some > >> > > circumstances). > >> > > Any thoughts on migrating old APIs and what to do for new APIs going > >> > > forward? > >> > > > >> > > A very rough approximation [1] yields the following counts by module: > >> > > > >> > > 853 arrow > >> > > > >> > > 17 gandiva > >> > > > >> > > 25 parquet > >> > > > >> > > 50 plasma > >> > > > >> > > > >> > > > >> > > [1] grep -r Status cpp/src/* |grep ".h:" | grep "\\*" |grep -v Accept > >> > > |sed > >> > > s/:.*// | cut -f3 -d/ |sort > >> > > > >> > > > >> > > Thanks, > >> > > > >> > > Micah > >> > > > >> > > > >> > > > >> > > On Sat, Oct 19, 2019 at 7:50 PM Francois Saint-Jacques < > >> > > fsaintjacq...@gmail.com> wrote: > >> > > > >> > > > As mentioned, Result is an improvement for function which returns > >> > > > a > >> > > > single value, e.g. Make/Factory-like. My vote goes Result for such > >> > > > case. For multiple return types, we have std::tuple like Antoine > >> > > > proposed. > >> > > > > >> > > > François > >> > > > > >> > > > On Fri, Oct 18, 2019 at 9:19 PM Antoine Pitrou > >> > > > wrote: > >> > > > > > >> > > > > > >> > > > > Le 18/10/2019 à 20:58, Wes McKinney a écrit : > >> > > > > > I'm definitely uncomfortable with the idea of deprecating Status. > >> > > > > > > >> > > > > > We have a few kinds of functions that can fail: > >> > > > > > > >> > > > > > 1. Functions with no "out" arguments > >> > > > > > 2. Functions with one out argument > >> > > > > > 3. Functions with multiple out arguments > >> > > > > > > >> > > > > > IMHO functions in category 2 are the best candidates for > >> > > > > > utilizing > >> > > > > > Status. In some cases, Case 3 may be more usable Result-based, > >> > > > > > but it > >> > > > > > can also create more work (or confusion) on the part of the > >> > > > > > developer, > >> > > >
Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding
I wrote in on the original DISCUSS thread. I believe Antoine is unavailable this week, but hopefully we can drive the discussion to a consensus point next week so we can vote On Sat, Oct 26, 2019 at 12:01 AM Micah Kornfield wrote: > > I think at least the wording was confusing because you raised questions on > the PR and Antoine commented here. > > I agree with your analysis that it probably would not be hard to support. > But don't feel too strongly either way on this particular point, aside from > coming to a resolution. If I had to choose I'd prefer allowing Delta > dictionaries in files. > > On Friday, October 25, 2019, Wes McKinney wrote: >> >> Can we discuss the delta dictionary issue a bit more? I admit I don't >> share that same concerns. >> >> From the perspective of a file and stream producer, the code paths >> should be nearly identical. The differences with the file format are: >> >> * Magic numbers to detect that it is the "file format" >> * Accumulated metadata at the footer >> >> If a file has any dictionaries at all, then they must all be >> reconstructed before reading a record batch. So let's say we have a >> file like >> >> DICTIONARY ID=0, isDelta=FALSE >> BATCH 0 >> BATCH 1 >> BATCH 2 >> DICTIONARY ID=0, isDelta=TRUE >> BATCH 3 >> DICTIONARY ID=0, isDelta=TRUE >> BATCH 4 >> >> I do not see any harm in this -- the only downside is that you won't >> know what "state" the dictionary was in for the first 3 batches. >> Viewing dictionary encoding strictly as a data representation method, >> the batches 0-2 and 3 represent the same data even if their in-memory >> dictionaries are larger than they were than the moment in which they >> were written >> >> Note that the code path for "processing" the dictionaries as a first >> step will use the same code as the stream path. It should not be a >> great deal of work to write test cases for this >> >> On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield >> wrote: >> > >> > Hi Antoine, >> > There is a defined order for dictionaries in metadata. What isn't well >> > defined is relative ordering between record batches and Delta dictionaries. >> > >> > However, this point seems confusing. I can't think of a real-world use >> > case we're it would be valuable enough to include, so I will remove Delta >> > dictionaries. >> > >> > So let's cancel this vote and I'll start a new one after the update. >> > >> > Thanks, >> > Micah >> > >> > On Thursday, October 24, 2019, Antoine Pitrou wrote: >> > >> > > >> > > Le 24/10/2019 à 04:39, Micah Kornfield a écrit : >> > > > >> > > > 3. Clarifies that the file format, can only contain 1 "NON-delta" >> > > > dictionary batch and multiple "delta" dictionary batches. >> > > >> > > This is a bit weird. If the file format can carry delta dictionaries, >> > > it means order is significant, so it may as well contain dictionary >> > > redefinitions. >> > > >> > > If the file format is meant to be truly readable in random order, then >> > > it should also forbid delta dictionaries. >> > > >> > > Regards >> > > >> > > Antoine. >> > >
Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing
Returning to this discussion as there seems to lack consensus in the vote thread Copying Micah's proposals in the VOTE thread here, I wanted to state my opinions so we can discuss further and see where there is potential disagreement 1. It is not required that all dictionary batches occur at the beginning of the IPC stream format (if a the first record batch has an all null dictionary encoded column, the null column's dictionary might not be sent until later in the stream). This seems preferable to requiring a placeholder empty dictionary batch. This does mean more to test but the integration tests will force the issue 2. A second dictionary batch for the same ID that is not a "delta batch" in an IPC stream indicates the dictionary should be replaced. Agree. 3. Clarifies that the file format, can only contain 1 "NON-delta" dictionary batch and multiple "delta" dictionary batches. Agree -- it is also worth stating explicitly that dictionary replacements are not allowed in the file format. In the file format, all the dictionaries must be "loaded" up front. The code path for loading the dictionaries ideally should use nearly the same code as the stream-reader code that sees follow-up dictionary batches interspersed in the stream. The only downside is that it will not be possible to exactly preserve the dictionary "state" as of each record batch being written. So if we had a file containing DICTIONARY ID=0 RECORD BATCH RECORD BATCH DICTIONARY DELTA ID=0 RECORD BATCH RECORD BATCH Then after processing/loading the dictionaries, the first two record batches will have a dictionary that is "larger" (on account of the delta) than when they were written. Since dictionaries are fundamentally about data representation, they still represent the same data so I think this is acceptable. 4. Add an enum to dictionary metadata for possible future changes in what format dictionary batches can be sent. (the most likely would be an array Map). An enum is needed as a place holder to allow for forward compatibility past the release 1.0.0. I'm least sure about this but I do not think it is harmful to have a forward-compatible "escape hatch" for future evolutions in dictionary encoding. On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield wrote: > > I'll plan on starting a vote in the next day or two if there are no further > objections/comments. > > On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield > wrote: > > > I think the only point asked on the PR that I think is worth discussing is > > assumptions about dictionaries at the beginning of streams. > > > > There are two options: > > 1. Based on the current wording, it does not seem that all dictionaries > > need to be at the beginning of the stream if they aren't made use of in the > > first record batch (i.e. a dictionary encoded column is all null in the > > first record batch). > > 2. We require a dictionary batch for each dictionary at the beginning of > > the stream (and require implementations to send an empty batch if they > > don't have the dictionary available). > > > > The current proposal in the PR is option #1. > > > > Thanks, > > Micah > > > > On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield > > wrote: > > > >> I've opened a pull request [1] to clarify some recent conversations about > >> semantics/edge cases for dictionary encoding [2][3] around interleaved > >> batches and when isDelta=False. > >> > >> Specifically, it proposes isDelta=False indicates dictionary > >> replacement. For the file format, only one isDelta=False batch is allowed > >> per file and isDelta=true batches are applied in the order supplied file > >> footer. > >> > >> In addition, I've added a new enum to DictionaryEncoding to preserve > >> future compatibility in case we want to expand dictionary encoding to be an > >> explicit mapping from "ID" to "VALUE" as discussed in [4]. > >> > >> Once people have had a change to review and come to a consensus. I will > >> call a formal vote to approve the change commit the change. > >> > >> Thanks, > >> Micah > >> > >> [1] https://github.com/apache/arrow/pull/5585 > >> [2] > >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E > >> [3] > >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E > >> [4] > >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E > >> > >>
[jira] [Created] (ARROW-7033) Error in./configure step for jemalloc when building on OSX 10.14.6
Christian Hudon created ARROW-7033: -- Summary: Error in./configure step for jemalloc when building on OSX 10.14.6 Key: ARROW-7033 URL: https://issues.apache.org/jira/browse/ARROW-7033 Project: Apache Arrow Issue Type: Bug Reporter: Christian Hudon Hello. I'm trying to build the C++ part of Apache Arrow (as a first step to possible contributions). I'm following the C++ Development instructions, but running into an error early. I also looked at ARROW-4935, but the cause there seems different, so I'm opening a new bug report. I'm on MacOS 10.14.6. I have the XCode cli tools installed (via xcode-select), and installed the other dependencies with Homebrew, giving it the cpp/Brewfile. I want to be able to run the tests, so I'm configuring a debug build with: cmake -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON .. from an out-of-source build, in a cpp/debug directory. Then, running make, I get very quickly the following error: {{$ make}} {{[ 0%] Performing configure step for 'jemalloc_ep'}} {{CMake Error at /Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-configure-DEBUG.cmake:49 (message):}} {{ Command failed: 1}}{{'./configure' 'AR=/Library/Developer/CommandLineTools/usr/bin/ar' 'CC=/Library/Developer/CommandLineTools/usr/bin/cc' '--prefix=/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep/dist/' '--with-jemalloc-prefix=je_arrow_' '--with-private-namespace=je_arrow_private_' '--without-export' '--disable-cxx' '--disable-libdl' '--disable-initial-exec-tls'}}{{See also}}{{/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-configure-*.log}} {{make[2]: *** [jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-configure] Error 1}} {{make[1]: *** [CMakeFiles/jemalloc_ep.dir/all] Error 2}} {{make: *** [all] Error 2}} {{Looking into the log file as suggested, I see:}} configure: error: in `/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep': configure: error: cannot run C compiled programs. If you meant to cross compile, use `--host'. See `config.log' for more details ... which seems a bit suspicuous. Running the ./configure invocation manually, I get the same error: {{$ './configure' 'AR=/Library/Developer/CommandLineTools/usr/bin/ar' 'CC=/Library/Developer/CommandLineTools/usr/bin/cc' '--prefix=/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep/dist/' '--with-jemalloc-prefix=je_arrow_' '--with-private-namespace=je_arrow_private_' '--without-export' '--disable-cxx' '--disable-libdl' '--disable-initial-exec-tls'}} {{checking for xsltproc... /usr/bin/xsltproc}} {{checking for gcc... /Library/Developer/CommandLineTools/usr/bin/cc}} {{checking whether the C compiler works... yes}} {{checking for C compiler default output file name... a.out}} {{checking for suffix of executables...}} {{checking whether we are cross compiling... configure: error: in `/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep':}} {{configure: error: cannot run C compiled programs.}} {{If you meant to cross compile, use `--host'.}} {{See `config.log' for more details}}{{}} {{Digging into config.log, I see:}} configure:3213: checking whether we are cross compiling *configure:3221: /Library/Developer/CommandLineTools/usr/bin/cc -o conftest conftest.c >&5* *conftest.c:9:10: fatal error: 'stdio.h' file not found* #include ^ 1 error generated. configure:3225: $? = 1 configure:3232: ./conftest ./configure: line 3234: ./conftest: No such file or directory configure:3236: $? = 127 configure:3243: error: in `/Users/chrish/Code/arrow/cpp/debug/jemalloc_ep-prefix/src/jemalloc_ep': configure:3245: error: cannot run C compiled programs. If you meant to cross compile, use `--host'. (Relevant bit in bold.) Well, that would make more sense, at least. I create a close-enough conftest.c by hand: {{#include }} {{int main(void) \{ return 0; }}} and try to compile it with the same command-line invocation: {{$ /Library/Developer/CommandLineTools/usr/bin/cc -o conftest conftest.c}} {{I get that same error:}} conftest.c:1:10: fatal error: 'stdio.h' file not found #include ^ 1 error generated. However, I also have a cc in /usr/bin. If I try that one instead, things works: {{$ /usr/bin/cc -o conftest conftest.c}} {{$ ls -l conftest}} {{-rwxr-xr-x 1 chrish staff 4,2K oct 30 16:03 conftest*}} {{$ ./conftest}} {{(No error compiling or running conftest.c)}} The two executable seem to be the same compiler (or at least the exact same version): {{$ /usr/bin/cc --version Apple LLVM version 10.0.1 (clang-1001.0.46.4) Target: x86_64-apple-darwin18.7.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin}} {{$ /Library/Developer/CommandLineTools/usr/bin/cc --version Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Re: [VOTE] Release Apache Arrow 0.15.1 - RC0
+1 (binding) * Verified source on Ubuntu 18.04 (using 0.15.1 RC verification script) * Verified wheels on Linux, macOS, and Windows using "verify-release-candidate.sh wheels ..." and verify-release-candidate-wheels.bat * Verified Linux binaries Thanks for fixing the macOS wheel! On Wed, Oct 30, 2019 at 11:24 AM Krisztián Szűcs wrote: > > Hi, > > I've uploaded the correct wheel for CPython 3.7 on macOS, also > tested it locally, it works properly. Created a JIRA [1] to test the > wheels in the release verification script similarly like we test the > linux packages, this should catch both the uploading issues and > the linking errors causing most of the troubles with wheels. > > Thanks, Krisztian > > [1]: https://issues.apache.org/jira/browse/ARROW-7032 > > On Tue, Oct 29, 2019 at 6:40 PM Krisztián Szűcs > wrote: > > > > I have locally the same binary, so something must have happened > > silently during the downloading process, without exiting with an error. > > The proper wheel is available under the GitHub release for that > > particular crossbow task here [1]. > > I'll download, sign and upload it to Bintray tomorrow evening (CET). > > > > [1]: > > https://github.com/ursa-labs/crossbow/releases/tag/build-722-travis-wheel-osx-cp37m > > > > On Mon, Oct 28, 2019 at 11:00 PM Wes McKinney wrote: > > > > > > I started looking at some of the Python wheels and found that the > > > macOS Python 3.7 wheel is corrupted. Note that it's only 101KB while > > > the other macOS wheels are ~35MB. > > > > > > Eyeballing the file list at > > > > > > https://bintray.com/apache/arrow/python-rc/0.15.1-rc0#files/python-rc/0.15.1-rc0 > > > > > > it seems this is the only wheel with this issue, but this suggests > > > that we should prioritize some kind of wheel integrity check using > > > Crossbow jobs. An issue for this is > > > > > > https://issues.apache.org/jira/browse/ARROW-2880 > > > > > > I'm going to check out some other wheels to see if they are OK, but > > > maybe just this one wheel can be regenerated? > > > > > > On Sun, Oct 27, 2019 at 4:31 PM Sutou Kouhei wrote: > > > > > > > > +1 (binding) > > > > > > > > I ran the followings on Debian GNU/Linux sid: > > > > > > > > * TEST_CSHARP=0 \ > > > > JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ > > > > CUDA_TOOLKIT_ROOT=/usr \ > > > > dev/release/verify-release-candidate.sh source 0.15.1 0 > > > > * dev/release/verify-release-candidate.sh binaries 0.15.1 0 > > > > > > > > with: > > > > > > > > * gcc (Debian 9.2.1-8) 9.2.1 20190909 > > > > * openjdk version "1.8.0_232-ea" > > > > * Node.JS v12.1.0 > > > > * go version go1.12.10 linux/amd64 > > > > * nvidia-cuda-dev 10.1.105-3+b1 > > > > > > > > Notes: > > > > > > > > * C# sourcelink is failed as usual. > > > > > > > > * We can't use dev/release/verify-release-candidate.sh on > > > > master to verify source because it depends on the latest > > > > archery. We need to use > > > > dev/release/verify-release-candidate.sh in 0.15.1. > > > > > > > > > > > > Thanks, > > > > -- > > > > kou > > > > > > > > In > > > > "[VOTE] Release Apache Arrow 0.15.1 - RC0" on Fri, 25 Oct 2019 > > > > 20:43:07 +0200, > > > > Krisztián Szűcs wrote: > > > > > > > > > Hi, > > > > > > > > > > I would like to propose the following release candidate (RC0) of > > > > > Apache > > > > > Arrow version 0.15.1. This is a patch release consisting of 36 > > > > > resolved > > > > > JIRA issues[1]. > > > > > > > > > > This release candidate is based on commit: > > > > > b789226ccb2124285792107c758bb3b40b3d082a [2] > > > > > > > > > > The source release rc0 is hosted at [3]. > > > > > The binary artifacts are hosted at [4][5][6][7]. > > > > > The changelog is located at [8]. > > > > > > > > > > Please download, verify checksums and signatures, run the unit tests, > > > > > and vote on the release. See [9] for how to validate a release > > > > > candidate. > > > > > > > > > > The vote will be open for at least 72 hours. > > > > > > > > > > [ ] +1 Release this as Apache Arrow 0.15.1 > > > > > [ ] +0 > > > > > [ ] -1 Do not release this as Apache Arrow 0.15.1 because... > > > > > > > > > > [1]: > > > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.15.1 > > > > > [2]: > > > > > https://github.com/apache/arrow/tree/b789226ccb2124285792107c758bb3b40b3d082a > > > > > [3]: > > > > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.15.1-rc0 > > > > > [4]: https://bintray.com/apache/arrow/centos-rc/0.15.1-rc0 > > > > > [5]: https://bintray.com/apache/arrow/debian-rc/0.15.1-rc0 > > > > > [6]: https://bintray.com/apache/arrow/python-rc/0.15.1-rc0 > > > > > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.15.1-rc0 > > > > > [8]: > > > > > https://github.com/apache/arrow/blob/b789226ccb2124285792107c758bb3b40b3d082a/CHANGELOG.md > > > > > [9]: > > > > >
Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-30-0
The failed tasks here are a nuisance. If they can't be fixed, should they be removed from the nightlies? On Wed, Oct 30, 2019 at 7:26 AM Crossbow wrote: > > > Arrow Build Report for Job nightly-2019-10-30-0 > > All tasks: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0 > > Failed Tasks: > - docker-clang-format: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-clang-format > - docker-r-sanitizer: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-r-sanitizer > > Succeeded Tasks: > - centos-6: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-6 > - centos-7: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-7 > - centos-8: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-centos-8 > - conda-linux-gcc-py27: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py27 > - conda-linux-gcc-py36: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py36 > - conda-linux-gcc-py37: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-linux-gcc-py37 > - conda-osx-clang-py27: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py27 > - conda-osx-clang-py36: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py36 > - conda-osx-clang-py37: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-osx-clang-py37 > - conda-win-vs2015-py36: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-win-vs2015-py36 > - conda-win-vs2015-py37: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-conda-win-vs2015-py37 > - debian-buster: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-debian-buster > - debian-stretch: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-azure-debian-stretch > - docker-c_glib: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-c_glib > - docker-cpp-cmake32: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-cmake32 > - docker-cpp-release: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-release > - docker-cpp-static-only: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp-static-only > - docker-cpp: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-cpp > - docker-dask-integration: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-dask-integration > - docker-docs: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-docs > - docker-go: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-go > - docker-hdfs-integration: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-hdfs-integration > - docker-iwyu: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-iwyu > - docker-java: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-java > - docker-js: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-js > - docker-lint: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-lint > - docker-pandas-master: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-pandas-master > - docker-python-2.7-nopandas: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-2.7-nopandas > - docker-python-2.7: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-2.7 > - docker-python-3.6-nopandas: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-3.6-nopandas > - docker-python-3.6: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-3.6 > - docker-python-3.7: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-30-0-circle-docker-python-3.7 > -
Re: Arrow sync call October 30 at 12:00 US/Eastern, 16:00 UTC
Attendees: * Uwe Korn * Micah Kornfield * Praveen Kumar * Wes McKinney * Rok Mihevc * Neal Richardson Discussion: * docker-compose/github-actions (https://github.com/apache/arrow/pull/5589). Needs review, needs to be merged and have followup issues made. Currently too many jobs being run on every commit. * result vs. status C++: following up on previous discussion * Parquet PRs: who is blessed to merge? Technically should be an Apache Parquet committer (Wes, Uwe, Deepak, others?). If reviewing, ask one of them to merge. * C API: outstanding concerns (1) use of JSON for metadata, (2) who owns the data and has to free it? Uwe and Micah to review the C++/Python/R implementation On Tue, Oct 29, 2019 at 8:52 PM Neal Richardson wrote: > > Hi all, reminder that our biweekly call is 12 hours from now at > https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes > will be sent out to the mailing list afterwards. > > Neal
[jira] [Created] (ARROW-7032) [Release] Verify python wheels in the release verification script
Krisztian Szucs created ARROW-7032: -- Summary: [Release] Verify python wheels in the release verification script Key: ARROW-7032 URL: https://issues.apache.org/jira/browse/ARROW-7032 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs Fix For: 1.0.0 For linux wheels use docker, otherwise setup a virtualenv and install the wheel supported on the host's platform. Testing should include the imports for the optional modules and perhaps running the unit tests, but the import testing should catch most of the wheel issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: some questions, please help
On Wed, Oct 30, 2019 at 9:32 AM Micah Kornfield wrote: > > > > > > - I see some SIMD optimizations in arrow go binding, such as vectored > > sum. [2] > > >But arrow cpp lib doesn't leverage SIMD. [3] > > >Why not optimize it in cpp lib so all languages can benefit? > > You're welcome to contribute such optimizations to the C++ library > > > Note that even though C++ doesn't use explicit SIMD intrinsics often times > the compiler will generate SIMD code because it can auto-vectorize the > code. Note it will likely be important to have explicit dynamic/runtime SIMD dispatching on certain hot paths as we build binaries that need to be able to run on both newer and older CPUs > On Wed, Oct 30, 2019 at 7:25 AM Wes McKinney wrote: > > > hi Yibo > > > > On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai wrote: > > > > > > Hi, > > > > > > I'm new to Arrow. Would like to seek for help about some questions. Any > > comment is welcomed. > > > > > > - About source code tree, my understand is that "cpp" is the core arrow > > libraries, "c_glib, go, python, ..." are language bindings to ease > > integrating arrow into apps developed by that language. Is that correct? > > > > No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust > > > > * C/GLib, MATLAB, Python, R bind to C++ > > * Ruby binds to GLib > > > > > - Arrow implements many data types and aggregation functions(sum, mean, > > ...). [1] > > >IMO, more functions and types should be supported, like min/max, > > vector/tensor operations, big number, etc. I'm not sure if this is in > > arrow's scope, or the apps using arrow should deal with it themselves. > > > > Our objective at least in the C++ library is to have a generally > > useful "standard library" that handles common application concerns. > > Whether or not something is thought to be in scope may vary on a case > > by case basis -- if you can't find a JIRA issue for something in > > particular, please go ahead and open one. > > > > > - I see some SIMD optimizations in arrow go binding, such as vectored > > sum. [2] > > >But arrow cpp lib doesn't leverage SIMD. [3] > > >Why not optimize it in cpp lib so all languages can benefit? > > > > You're welcome to contribute such optimizations to the C++ library > > > > > > - Wes > > > > > [1] > > https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels > > > [2] > > https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s > > > [3] > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111 > > > > > > Yibo > >
Re: some questions, please help
hi Yibo On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai wrote: > > Hi, > > I'm new to Arrow. Would like to seek for help about some questions. Any > comment is welcomed. > > - About source code tree, my understand is that "cpp" is the core arrow > libraries, "c_glib, go, python, ..." are language bindings to ease > integrating arrow into apps developed by that language. Is that correct? No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust * C/GLib, MATLAB, Python, R bind to C++ * Ruby binds to GLib > - Arrow implements many data types and aggregation functions(sum, mean, ...). > [1] >IMO, more functions and types should be supported, like min/max, > vector/tensor operations, big number, etc. I'm not sure if this is in arrow's > scope, or the apps using arrow should deal with it themselves. Our objective at least in the C++ library is to have a generally useful "standard library" that handles common application concerns. Whether or not something is thought to be in scope may vary on a case by case basis -- if you can't find a JIRA issue for something in particular, please go ahead and open one. > - I see some SIMD optimizations in arrow go binding, such as vectored sum. [2] >But arrow cpp lib doesn't leverage SIMD. [3] >Why not optimize it in cpp lib so all languages can benefit? You're welcome to contribute such optimizations to the C++ library - Wes > [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels > [2] > https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s > [3] > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111 > > Yibo
[jira] [Created] (ARROW-7031) [Python] Expose the offsets of a ListArray in python
Joris Van den Bossche created ARROW-7031: Summary: [Python] Expose the offsets of a ListArray in python Key: ARROW-7031 URL: https://issues.apache.org/jira/browse/ARROW-7031 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Assume the following ListArray: {code} In [1]: arr = pa.ListArray.from_arrays(offsets=[0, 3, 5], values=[1, 2, 3, 4, 5]) In [2]: arr Out[2]: [ [ 1, 2, 3 ], [ 4, 5 ] ] {code} You can get the actual values as a flat array through {{.values}} / {{.flatten()}}, but there is currently no easy way to get back to the offsets (except from interpreting the buffers manually). We should probably add an {{offsets}} attribute (there is actually also a TODO comment for that). -- This message was sent by Atlassian Jira (v8.3.4#803005)
AW: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)
Hi Wes, the data is indeed not originating from Arrow, so I was looking for how to call the low level WriteBatch API. I figured it out now, it's actually straightforward in the Arrow-API, I just got confused a little with the spec at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#DECIMAL So for future reference: I multiply each value in a floating point array with pow(10, scale) and pass the resulting array (in my case: int32_t) directly to WriteBatch(). One thing I can imagine that could make the API a little easier to use: Provide a function that directly takes an array of floats or doubles which does the conversion internally. But it's not really needed, so it's not really worth adding. Thanks for your help and sorry for the annoyance, Roman -Ursprüngliche Nachricht- Von: Wes McKinney Gesendet: Dienstag, 29. Oktober 2019 16:19 An: dev Betreff: Re: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype) It depends on the origin of your data. If your data is not originating from Arrow, then it may be better to produce an array of FixedLenByteArray and pass that to the low level WriteBatch API. If you would like some other API, please feel free to propose something. On Tue, Oct 29, 2019 at 10:13 AM wrote: > > Hi Wes, > > that was a bit unclear, sorry for that. With "an array", I'm referring to a > plain c++-type array, i.e. an array of float, uint32_t, ... > This means that I do not use the arrow::Array-based write API, but I use the > TypedColumnWriter::WriteBatch() function directly and do not have any arrow > arrays. Are there any advantages of not using the writebatch directly and > instead using arrow::Arrays? > > Thanks, > Roman > > -Ursprüngliche Nachricht- > Von: Wes McKinney > Gesendet: Dienstag, 29. Oktober 2019 15:59 > An: dev > Betreff: Re: State of decimal support in Arrow (from/to Parquet > Decimal Logicaltype) > > On Tue, Oct 29, 2019 at 3:11 AM wrote: > > > > Hi Wes, > > > > thanks for the response. There's one thing that is still a little unclear > > to me: > > I had a look at the code for function WriteArrowSerialize > arrow::Decimal128Type> in the reference you provided. I don't have arrow > > data in the first place, but as I understand it, I need to have an array of > > FixedLenByteArrays objects which then point to the actual decimal values in > > the big_endian_values buffer. Is this the only way to write decimal types > > or is it also possible to directly provide an array with values to > > writeBatch()? > > > > Could you clarify what you mean by "an array"? If you use the > arrow::Array-based write API then it will invoke this serializer > specialization > > https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca > 585ae1c/cpp/src/parquet/column_writer.cc#L1569 > > That's what we're calling (if I'm not mistaken, since I just worked on > this code recently) when writing arrow::Decimal128Array. If you set a > breakpoint with gdb there you can see the call stack > > > For the issues, I also found > > https://issues.apache.org/jira/browse/ARROW-6990, but I'm not sure if this > > is also related to the issues you created. > > > > Thanks, > > Roman > > > > -Ursprüngliche Nachricht- > > Von: Wes McKinney > > Gesendet: Montag, 28. Oktober 2019 21:11 > > An: dev > > Betreff: Re: State of decimal support in Arrow (from/to Parquet > > Decimal Logicaltype) > > > > hi Roman, > > > > On Mon, Oct 28, 2019 at 5:56 AM wrote: > > > > > > Hi everyone, > > > > > > > > > > > > I have a question about the state of decimal support in Arrow when > > > reading from/writing to Parquet. > > > > > > * Is writing decimals to parquet supposed to work? Are there any > > > examples on how to do this in C++? > > > > Yes, it's supported, the details are here > > > > https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497 > > ca > > 585ae1c/cpp/src/parquet/column_writer.cc#L1511 > > > > > * When reading decimals in a parquet file with pyarrow and > > > converting > > > the resulting table to a pandas dataframe, datatype in the cells > > > is "object". As a consequence, performance when doing analysis on > > > this table is suboptimal. Can I somehow directly get the decimals > > > from the parquet file into floats/doubles in a pandas dataframe? > > > > Some work will be required. The cleanest way would be to cast > > decimal128 columns to float32/float64 prior to converting to pandas. > > > > I didn't see an issue for this right away so I opened > > > > https://issues.apache.org/jira/browse/ARROW-7010 > > > > I also opened > > > > https://issues.apache.org/jira/browse/ARROW-7011 > > > > about going the other way. This would be a useful thing to contribute to > > the project. > > > > Thanks > > Wes > > > > > > > > > > > Thanks in advance, > > > > > > Roman > > > > > > > > > > > >
[jira] [Created] (ARROW-7030) csv example coredump error
wjw created ARROW-7030: -- Summary: csv example coredump error Key: ARROW-7030 URL: https://issues.apache.org/jira/browse/ARROW-7030 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.15.0 Environment: g++:7.3.1 Reporter: wjw I am trying to write a example for reading csv by apache-arrow in c++ according to the offical one,https://arrow.apache.org/docs/cpp/csv.html#, but it meets Segmentation fault at `status = reader->Read();` Can anyone help? thank you~ environment info: `g++:7.3.1` make command: `c++ -g -std=c++11 -Wall -O2 test.cpp -o test -I../../arrow/src -L../../arrow/lib -larrow -lparquet -Wl,-rpath,./` code info: ``` arrow::Status status; arrow::MemoryPool *pool = arrow::default_memory_pool(); std::shared_ptr input; std::string csv_file = "test.csv"; auto input_readable = std::dynamic_pointer_cast(input); PARQUET_THROW_NOT_OK(arrow::io::ReadableFile::Open(csv_file, pool, _readable)); auto read_options = arrow::csv::ReadOptions::Defaults(); read_options.use_threads = false; read_options.column_names.emplace_back("name"); read_options.column_names.emplace_back("age"); auto parse_options = arrow::csv::ParseOptions::Defaults(); auto convert_options = arrow::csv::ConvertOptions::Defaults(); convert_options.include_missing_columns = true; std::shared_ptr reader; status = arrow::csv::TableReader::Make(pool, input, read_options, parse_options, convert_options, ); if (!status.ok()) { std::cout << "make csv table error" << std::endl; return -1; } std::shared_ptr table; status = reader->Read(); if (!status.ok()) { std::cout << "read csv table error" << std::endl; return -1; } ``` coredump info: ``` Program terminated with signal 11, Segmentation fault. #0 0x7fe4fcda83e7 in arrow::io::internal::ReadaheadSpooler::Impl::WorkerLoop() () from ./libarrow.so.15 (gdb) bt #0 0x7fe4fcda83e7 in arrow::io::internal::ReadaheadSpooler::Impl::WorkerLoop() () from ./libarrow.so.15 #1 0x7fe4fd405a2f in execute_native_thread_routine () from ./libarrow.so.15 #2 0x7fe4fa8ecdf3 in start_thread () from /lib64/libpthread.so.0 #3 0x7fe4fb86e1bd in clone () from /lib64/libc.so.6 ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7029) [Go] unsafe pointer arithmetic panic w/ Go-1.14-dev
Sebastien Binet created ARROW-7029: -- Summary: [Go] unsafe pointer arithmetic panic w/ Go-1.14-dev Key: ARROW-7029 URL: https://issues.apache.org/jira/browse/ARROW-7029 Project: Apache Arrow Issue Type: New Feature Components: Go Reporter: Sebastien Binet Go-1.14 (to be released in Feb-2020) has a new analysis pass (enabled with -race) that checks for unsafe pointer arithmetic: ~~ go test -race -run=Example_minimal . --- FAIL: Example_minimal (0.00s) panic: runtime error: unsafe pointer arithmetic [recovered] panic: runtime error: unsafe pointer arithmetic goroutine 1 [running]: testing.(*InternalExample).processRunResult(0xcadc80, 0x0, 0x0, 0x8927, 0x90a400, 0xcb62c0, 0xca48c8) /home/binet/sdk/go/src/testing/example.go:89 +0x71f testing.runExample.func2(0xbf6675c29511646d, 0x20fb5f, 0xc7f780, 0xca2378, 0xca2008, 0xc86360, 0xcadc80, 0xcadcb0) /home/binet/sdk/go/src/testing/run_example.go:58 +0x143 panic(0x90a400, 0xcb62c0) /home/binet/sdk/go/src/runtime/panic.go:915 +0x370 github.com/apache/arrow/go/arrow/memory.memory_memset_avx2(0xc9e200, 0x40, 0x40, 0xc9c000) /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/memory/memory_avx2_amd64.go:33 +0xa4 github.com/apache/arrow/go/arrow/memory.Set(...) /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/memory/memory.go:25 github.com/apache/arrow/go/arrow/array.(*builder).init(0xc84600, 0x20) /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/builder.go:101 +0x23a github.com/apache/arrow/go/arrow/array.(*Int64Builder).init(0xc84600, 0x20) /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:102 +0x60 github.com/apache/arrow/go/arrow/array.(*Int64Builder).Resize(0xc84600, 0x2) /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:125 +0x8c github.com/apache/arrow/go/arrow/array.(*builder).reserve(0xc84600, 0x1, 0xcad918) /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/builder.go:138 +0xdc github.com/apache/arrow/go/arrow/array.(*Int64Builder).Reserve(0xc84600, 0x1) /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:113 +0x68 github.com/apache/arrow/go/arrow/array.(*Int64Builder).Append(0xc84600, 0x1) /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:60 +0x46 github.com/apache/arrow/go/arrow_test.Example_minimal() /home/binet/work/gonum/src/github.com/apache/arrow/go/arrow/example_test.go:39 +0x153 testing.runExample(0x94714a, 0xf, 0x95d8a8, 0x957614, 0x83, 0x0, 0x0) /home/binet/sdk/go/src/testing/run_example.go:62 +0x275 testing.runExamples(0xcaded8, 0xc7a2e0, 0xb, 0xb, 0x100) /home/binet/sdk/go/src/testing/example.go:44 +0x212 testing.(*M).Run(0xc00010, 0x0) /home/binet/sdk/go/src/testing/testing.go:1125 +0x3b4 main.main() _testmain.go:130 +0x224 FAIL github.com/apache/arrow/go/arrow 0.009s FAIL ~~ see: [https://groups.google.com/forum/#!msg/golang-dev/SzwDoqoRVJA/IvtnBW5oDwAJ] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7028) Dates in R are different when saved and loaded with parquet
Sascha created ARROW-7028: - Summary: Dates in R are different when saved and loaded with parquet Key: ARROW-7028 URL: https://issues.apache.org/jira/browse/ARROW-7028 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.15.0 Reporter: Sascha When saving R-dataframes with parquet and loading them again, the internal representation of Dates changes, leading e.g. to errors when comparing them in dplyr::if_else. ``` r library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union tmp = tempdir() dat = tibble(tag = as.Date("2018-01-01")) dat2 = tibble(tag2 = as.Date("2019-01-01")) arrow::write_parquet(dat, file.path(tmp, "dat.parquet")) dat = arrow::read_parquet(file.path(tmp, "dat.parquet")) typeof(dat$tag) #> [1] "integer" typeof(dat2$tag2) #> [1] "double" bind_cols(dat, dat2) %>% mutate(comparison = if_else(TRUE, tag, tag2)) #> `false` must be a `Date` object, not a `Date` object ``` Created on 2019-10-30 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7027) [Python] pa.table(..) returns instead of raises error if passing invalid object
Joris Van den Bossche created ARROW-7027: Summary: [Python] pa.table(..) returns instead of raises error if passing invalid object Key: ARROW-7027 URL: https://issues.apache.org/jira/browse/ARROW-7027 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 When passing eg a Series instead of a DataFrame, you get: {code} In [4]: df = pd.DataFrame({'a': [1, 2, 3]}) In [5]: table = pa.table(df['a']) In [6]: table Out[6]: TypeError('Expected pandas DataFrame or python dictionary') In [7]: type(table) Out[7]: TypeError {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
some questions, please help
Hi, I'm new to Arrow. Would like to seek for help about some questions. Any comment is welcomed. - About source code tree, my understand is that "cpp" is the core arrow libraries, "c_glib, go, python, ..." are language bindings to ease integrating arrow into apps developed by that language. Is that correct? - Arrow implements many data types and aggregation functions(sum, mean, ...). [1] IMO, more functions and types should be supported, like min/max, vector/tensor operations, big number, etc. I'm not sure if this is in arrow's scope, or the apps using arrow should deal with it themselves. - I see some SIMD optimizations in arrow go binding, such as vectored sum. [2] But arrow cpp lib doesn't leverage SIMD. [3] Why not optimize it in cpp lib so all languages can benefit? [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels [2] https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111 Yibo
[jira] [Created] (ARROW-7026) [Java] Remove assertions in MessageSerializer/vector/writer/reader
Ji Liu created ARROW-7026: - Summary: [Java] Remove assertions in MessageSerializer/vector/writer/reader Key: ARROW-7026 URL: https://issues.apache.org/jira/browse/ARROW-7026 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently assertions exists in many classes like {{MessagaSerializer/JsonReader/JsonWriter/ListVector}} etc. i. If jvm arguments are not specified, these checks will skipped and lead to potential problems. ii. Java errors produced by failed assertions are not caught by traditional catch clauses. To fix this, use {{Preconditions}} instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)