[jira] [Created] (ARROW-6031) [Java] Support iterating a vector by ArrowBufPointer
Liya Fan created ARROW-6031: --- Summary: [Java] Support iterating a vector by ArrowBufPointer Key: ARROW-6031 URL: https://issues.apache.org/jira/browse/ARROW-6031 Project: Apache Arrow Issue Type: New Feature Reporter: Liya Fan Assignee: Liya Fan Provide the functionality to traverse a vector (fixed-width vector & variable-width vector) by an iterator. This is convenient for scenarios when accessing vector elements in sequence. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6030) [Java] Efficiently compute hash code for ArrowBufPointer
Liya Fan created ARROW-6030: --- Summary: [Java] Efficiently compute hash code for ArrowBufPointer Key: ARROW-6030 URL: https://issues.apache.org/jira/browse/ARROW-6030 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan As ArrowBufHasher is introduced, we can compute the hash code of a continuous region within an ArrowBuf. We optimize the process to make it efficient to avoid recomputation. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6029) [R] could not build
kohleth created ARROW-6029: -- Summary: [R] could not build Key: ARROW-6029 URL: https://issues.apache.org/jira/browse/ARROW-6029 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.14.0 Environment: > sessionInfo() R version 3.6.0 (2019-04-26) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.2 LTS Reporter: kohleth hi there, when trying to build the R wrapper using {code:java} remotes::install_github("apache/arrow", subdir = "r"){code} I hit the following error: Found pkg-config cflags and libs! PKG_CFLAGS=-DNDEBUG -DARROW_R_WITH_ARROW PKG_LIBS=-larrow -lparquet** libsg++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -DNDEBUG -DARROW_R_WITH_ARROW -I"/usr/lib/R/site-library/Rcpp/include" -fvisibility=hidden -fpic -g -O2 -fdebug-prefix-map=/build/r-base-VjHo9C/r-base-3.6.0=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c array.cpp -o array.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -DNDEBUG -DARROW_R_WITH_ARROW -I"/usr/lib/R/site-library/Rcpp/include" -fvisibility=hidden -fpic -g -O2 -fdebug-prefix-map=/build/r-base-VjHo9C/r-base-3.6.0=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c array__to_vector.cpp -o array__to_vector.oarray__to_vector.cpp: In function 'Rcpp::List Table__to_dataframe(const std::shared_ptr&, bool)': array__to_vector.cpp:819:65: error: 'using element_type = class arrow::Column \{aka class arrow::Column}' has no member named 'chunks' converters[i] = arrow::r::Converter::Make(table->column(i)->chunks()); ^~array__to_vector.cpp:820:23: error: 'using element_type = class arrow::Table \{aka class arrow::Table}' has no member named 'field' names[i] = table->field(i)->name(); ^/usr/lib/R/etc/Makeconf:176: recipe for target 'array__to_vector.o' failedmake: *** [array__to_vector.o] Error 1ERROR: compilation failed for package 'arrow'* removing '/home/kchia/R/x86_64-pc-linux-gnu-library/3.6/arrow'Error: Failed to install 'arrow' from GitHub: (converted from warning) installation of package '/tmp/RtmpfYJZFa/file33fc6aee0ae6/arrow_0.14.0.9000.tar.gz' had non-zero exit status -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discuss] Do a 0.15.0 release before 1.0.0?
+1 on a 0.15.0 release. At the minimum, if we could detect the stream and provide a clear error message for Python and Java I think that would help the transition. If we are also able to implement readers/writers that can fallback to 4-byte prefix, then that would be nice to have. On Wed, Jul 24, 2019 at 1:27 PM Jacques Nadeau wrote: > I'm ok with the change and 0.15 release to better manage it. > > > > I've always understood the metadata to be a few dozen/hundred KB, a > > small percentage of the total message size. I could be underestimating > > the ratios though -- is it common to have tables w/ 1000+ columns? I've > > seen a few reports like that in cuDF, but I'm curious to hear > > Jacques'/Dremio's experience too. > > > > Metadata size has been an issue at different points for us. We do > definitely see datasets with 1000+ columns. It is also compounded by the > fact that as we add more columns, we typically decrease row count so that > the individual batches are still easily pipelined--which further increases > the relative ratio between data and metadata. >
Re: [Discuss] Do a 0.15.0 release before 1.0.0?
I'm ok with the change and 0.15 release to better manage it. > I've always understood the metadata to be a few dozen/hundred KB, a > small percentage of the total message size. I could be underestimating > the ratios though -- is it common to have tables w/ 1000+ columns? I've > seen a few reports like that in cuDF, but I'm curious to hear > Jacques'/Dremio's experience too. > Metadata size has been an issue at different points for us. We do definitely see datasets with 1000+ columns. It is also compounded by the fact that as we add more columns, we typically decrease row count so that the individual batches are still easily pipelined--which further increases the relative ratio between data and metadata.
Re: [Discuss] Do a 0.15.0 release before 1.0.0?
I'm not sure I understand this suggestion: 1. Wouldn't this cause old readers to miss the last 4 bytes of the buffer (and provide meaningless bytes at the beginning). 2. The current proposal on the other thread is to have the pattern be <0x> Sorry I didn't mean to say an int64_t length, just that now we'd be reserving 8 bytes in the "metadata length" position where today we reserve 4. I'm not sure about every language, but at least in Python/JS an external forwards-compatible solution would involve slicing the message buffer up front like this: def adjust_message_buffer(message_bytes): buf = pa.py_buffer(message_bytes) if first_four_bytes_are_max_int32(message_bytes): return buf.slice(4) return buf On 7/23/19 7:31 PM, Micah Kornfield wrote: Could we detect the 4-byte length, incur a penalty copying the memory to an aligned buffer, then continue consuming the stream? I think that is the plan (or at least would be my plan) if we go ahead with the change (It's probably fine if we only write the 8-byte length, since consumers on older versions of Arrow could slice from the 4th byte before passing a buffer to the reader). I'm not sure I understand this suggestion: 1. Wouldn't this cause old readers to miss the last 4 bytes of the buffer (and provide meaningless bytes at the beginning). 2. The current proposal on the other thread is to have the pattern be <0x> Thanks, Micah On Tue, Jul 23, 2019 at 11:43 AM Paul Taylor wrote: +1 for a 0.15.0 before 1.0 if we go ahead with this. I'm curious to hear other's thoughts about compatibility. I think we should avoid breaking backwards compatibility if possible. It's common for apps/libs to be pinned on specific Arrow versions, and I worry it'd cause a lot of work for downstream devs to audit their tool suite for full Arrow binary compatibility (and/or require their customers to do the same). Could we detect the 4-byte length, incur a penalty copying the memory to an aligned buffer, then continue consuming the stream? (It's probably fine if we only write the 8-byte length, since consumers on older versions of Arrow could slice from the 4th byte before passing a buffer to the reader). I've always understood the metadata to be a few dozen/hundred KB, a small percentage of the total message size. I could be underestimating the ratios though -- is it common to have tables w/ 1000+ columns? I've seen a few reports like that in cuDF, but I'm curious to hear Jacques'/Dremio's experience too. If copying is feasible, it doesn't seem so bad a trade-off to maintain backwards-compatibility. As libraries and consumers upgrade their Arrow dependencies, the 4-byte length will be less and less common, and they'll be less likely to pay the cost. On 7/23/19 2:22 AM, Uwe L. Korn wrote: It is also a good way to test the change in public. We don't want to adjust something like this anymore in a 1.0.0 release. Already doing this in 0.15.0 and then maybe doing adjustments due to issues that appear "in the wild" is psychologically the easier way. There is a lot of thinking of users bound with the magic 1.0, thus I would plan to minimize what is changed between 1.0 and pre-1.0. This also should save us maintainers some time as I would expect different behaviour in bug reports between 1.0 and pre-1.0 issues. Uwe On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote: I think the main reason to do a release before 1.0.0 is if we want to make the change that would give a good error message for forward incompatibility (I think this could be done as 0.14.2 since it would just be clarifying an error message). Otherwise, I think including it in 1.0.0 would be fine (its still not clear to me if there is consensus to fix the issue). Thanks, Micah On Monday, July 22, 2019, Wes McKinney wrote: I'd be satisfied with fixing the Flatbuffer alignment issue either in a 0.15.0 or 1.0.0. In the interest of expediency, though, making a 0.15.0 with this change sooner rather than later might be prudent. On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou wrote: Hello, Recently we've discussed breaking the IPC format to fix a long-standing alignment issue. See this discussion: https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E Should we first do a 0.15.0 in order to get those format fixes right? Once that is fine and settled we can move to the 1.0.0 release? Regards Antoine.
Building on Arrow CUDA
I'm looking at options to replace the custom Arrow logic in cuDF with Arrow library calls. What's the recommended way to declare a dependency on pyarrow/arrowcpp with CUDA support? I see in the docs it says to build from source, but that's only an option for an (advanced) end-user. And building/vendoring libarrow_cuda.so isn't a great option for a non-Arrow library, because someone who does source build Arrow-with-cuda will conflict with the version we ship. Right now we're considering statically linking libarrow_cuda into libcudf.so and vendoring Arrow's cuda cython alongside ours, but this increases compile times/library size. Is there a package management solution (like pip/conda install pyarrow[cuda]) that I'm missing? If not, should there be? Best, Paul
[jira] [Created] (ARROW-6028) Failed to compile on windows platform using arrow
Haowei Yu created ARROW-6028: Summary: Failed to compile on windows platform using arrow Key: ARROW-6028 URL: https://issues.apache.org/jira/browse/ARROW-6028 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 0.14.0 Reporter: Haowei Yu I am writing a python extension and trying to compile c++ code and link against arrow library on windows platform. (Using visual studio 2017) and compilation failed. {code:text} building 'snowflake.connector.arrow_iterator' extension C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Icpp/ArrowIterator/ -Ic:\Users\Haowei\py36env\lib\site-packages\pyarrow\include -IC:\Users\Haowei\AppData\Local\Programs\Python\Python36\include -IC:\Users\Haowei\AppData\Local\Programs\Python\Python36\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\cppwinrt" /EHsc /Tpbuild\cython\arrow_iterator.cpp /Fobuild\temp.win-amd64-3.6\Release\build\cython\arrow_iterator.obj -std=c++11 cl : Command line warning D9002 : ignoring unknown option '-std=c++11' arrow_iterator.cpp c:\Users\Haowei\py36env\lib\site-packages\pyarrow\include\arrow/type.h(852): error C2528: '__timezone': pointer to reference is illegal c:\Users\Haowei\py36env\lib\site-packages\pyarrow\include\arrow/type.h(859): error C2269: cannot create a pointer or reference to a qualified function type (requires pointer-to-member) c:\Users\Haowei\py36env\lib\site-packages\pyarrow\include\arrow/type.h(853): error C2664: 'std::basic_string,std::allocator>::basic_string(const std::basic_string,std::allocator> &)': cannot convert argument 1 from 'const std::string *' to 'std::initializer_list<_Elem>' with [ _Elem=char ] c:\Users\Haowei\py36env\lib\site-packages\pyarrow\include\arrow/type.h(852): note: No constructor could take the source type, or constructor overload resolution was ambiguous c:\Users\Haowei\py36env\lib\site-packages\pyarrow\include\arrow/type.h(859): error C2440: 'return': cannot convert from 'std::string' to 'const std::string *(__cdecl *)(void)' c:\Users\Haowei\py36env\lib\site-packages\pyarrow\include\arrow/type.h(859): note: No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called c:\Users\Haowei\py36env\lib\site-packages\pyarrow\include\arrow/type.h(1126): error C2528: '__timezone': pointer to reference is illegal error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2017\\Community\\VC\\Tools\\MSVC\\14.16.27023\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2 {code} I googled a little bit and found similar issue in feather repo. https://github.com/wesm/feather/issues/111 So I did something similar to their fix: Adding following code to the type.h header file (according to https://github.com/wesm/feather/pull/146/files) {code:c++} #if _MSC_VER >= 1900 #undef timezone #endif {code} Not sure if this is the right way to fix it. If yes, I can submit a PR. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6027) CMake Build w/boost_ep fails on Windows - "%1 is not a valid Win32 application"
Jonathan McDevitt created ARROW-6027: Summary: CMake Build w/boost_ep fails on Windows - "%1 is not a valid Win32 application" Key: ARROW-6027 URL: https://issues.apache.org/jira/browse/ARROW-6027 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Jonathan McDevitt Attachments: _release64CMakeBuildLogs.txt, _release64CMakeLogs.txt Hi all, I seem to be running into an issue when building Apache Arrow for Windows. It fails to build boost; in the CMake output it says {code:java} CMake Error at D:/Staging/arrow/cpp/release64/boost_ep-prefix/src/boost_ep-stamp/boost_ep-configure-Release.cmake:49 (message): Command failed: %1 is not a valid Win32 application './bootstrap.sh' '--prefix=D:/Staging/arrow/cpp/release64/boost_ep-prefix/src/boost_ep' '--with-libraries=filesystem,regex,system' {code} I've been trying to address this issue, and am currently investigating using a pre-build Boost library as a workaround, but the expectation is that this should work out of the box. I have attached logs demonstrating this behaviour. The initial step of running CMake for Windows 64 is fine, but the actual build step is what fails, and the boost_ep-configure-*.log files are empty so there is nothing there to give an idea of what's going on. h2. **Expected Behaviour When building Apache Arrow 0.14.x, build should work out of the box when VS 2015 build tools are present and the environment is configured with vcvarsall for the appropriate architecture. h2. Observed Behaviour Build fails with error: {code:java} Command failed: %1 is not a valid Win32 application './bootstrap.sh' '--prefix=D:/Staging/arrow/cpp/release64/boost_ep-prefix/src/boost_ep' '--with-libraries=filesystem,regex,system'{code} h2. Steps to Reproduce # Sync to Maintenance 0.14.x with 'git clone -b maint-0.14.x [https://github.com/apache/arrow.git'] # Following the instructions at [https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst]: ## Create a 'build' directory from which to run CMake and generate the appropriate build files. ## Run "%VS140COMNTOOLS%..\..\VC\vcvarsall.bat" amd64 ## From within the build directory, run "cmake .. -G "Visual Studio 14 2015 Win64" -DARROW_BUILD_TESTS=ON" ### Alternatively, if running Ninja, run "cmake .. -GNinja -DCMAKE_C_COMPILER="cl.exe" -DCMAKE_CXX_COMPILER="cl.exe" -DARROW_BUILD_TESTS=ON" ## Observe error. Thanks, ~Jon -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: Arrow sync call July 24 at 12:00 US/Eastern, 16:00 UTC
Want to try https://meet.google.com/myj-ospb-dxw On Wed, Jul 24, 2019 at 9:12 AM Antoine Pitrou wrote: > > Apparently we're all having the same problem... > > > Le 24/07/2019 à 18:06, Micah Kornfield a écrit : > > Is this happening? I can't seem to join? > > > > On Tue, Jul 23, 2019 at 7:26 PM Neal Richardson < > neal.p.richard...@gmail.com> > > wrote: > > > >> Hi everyone, > >> Reminder that the biweekly Arrow call is tomorrow (well, already today > for > >> some of you) at https://meet.google.com/vtm-teks-phx. All are welcome > to > >> join. Notes will be sent out to the mailing list afterwards. > >> > >> Neal > >> > > >
Re: Arrow sync call July 24 at 12:00 US/Eastern, 16:00 UTC
Apparently we're all having the same problem... Le 24/07/2019 à 18:06, Micah Kornfield a écrit : > Is this happening? I can't seem to join? > > On Tue, Jul 23, 2019 at 7:26 PM Neal Richardson > wrote: > >> Hi everyone, >> Reminder that the biweekly Arrow call is tomorrow (well, already today for >> some of you) at https://meet.google.com/vtm-teks-phx. All are welcome to >> join. Notes will be sent out to the mailing list afterwards. >> >> Neal >> >
Re: Arrow sync call July 24 at 12:00 US/Eastern, 16:00 UTC
Is this happening? I can't seem to join? On Tue, Jul 23, 2019 at 7:26 PM Neal Richardson wrote: > Hi everyone, > Reminder that the biweekly Arrow call is tomorrow (well, already today for > some of you) at https://meet.google.com/vtm-teks-phx. All are welcome to > join. Notes will be sent out to the mailing list afterwards. > > Neal >
[jira] [Created] (ARROW-6026) [Doc] Add CONTRIBUTING.md
Antoine Pitrou created ARROW-6026: - Summary: [Doc] Add CONTRIBUTING.md Key: ARROW-6026 URL: https://issues.apache.org/jira/browse/ARROW-6026 Project: Apache Arrow Issue Type: Task Components: Documentation Reporter: Antoine Pitrou Fix For: 1.0.0 A CONTRIBUTING.md file at the top-level of a repository is automatically picked up by Github and displayed when people open an issue or PR for the first time. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6025) [Gandiva][Test] Error handling for missing timezone in castTIMESTAMP_utf8 tests
Krisztian Szucs created ARROW-6025: -- Summary: [Gandiva][Test] Error handling for missing timezone in castTIMESTAMP_utf8 tests Key: ARROW-6025 URL: https://issues.apache.org/jira/browse/ARROW-6025 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Krisztian Szucs I've recently enabled gandiva in the conda c++ ursabot builders. The container doesn't contain the required timezones do the tests are failing: {code} ../src/gandiva/precompiled/time_test.cc:103: Failure Expected equality of these values: castTIMESTAMP_utf8(context_ptr, "2000-09-23 9:45:30.920 Canada/Pacific", 37) Which is: 0 969727530920 ../src/gandiva/precompiled/time_test.cc:105: Failure Expected equality of these values: castTIMESTAMP_utf8(context_ptr, "2012-02-28 23:30:59 Asia/Kolkata", 32) Which is: 0 1330452059000 ../src/gandiva/precompiled/time_test.cc:107: Failure Expected equality of these values: castTIMESTAMP_utf8(context_ptr, "1923-10-07 03:03:03 America/New_York", 36) Which is: 0 -1459094217000 {code} See build: https://ci.ursalabs.org/#/builders/66/builds/3046/steps/8/logs/stdio -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6024) [Java] Provide more hash algorithms
Liya Fan created ARROW-6024: --- Summary: [Java] Provide more hash algorithms Key: ARROW-6024 URL: https://issues.apache.org/jira/browse/ARROW-6024 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Provide more hash algorithms to choose for different scenarios. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6023) [C++][Gandiva] Add functions in Gandiva
Prudhvi Porandla created ARROW-6023: --- Summary: [C++][Gandiva] Add functions in Gandiva Key: ARROW-6023 URL: https://issues.apache.org/jira/browse/ARROW-6023 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla Fix For: 1.0.0 support following functions in Gandiva - # int32 castINT(int64) : cast int64 to int32 # float4 castFLOAT4(float8) : cast float8 to float4 # int64 truncate(int64, int32 scale) : if scale is negative, make last -scale digits zero # timestamp add(date, int32 days) : add days to date(in milliseconds) and return timestamp -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6022) [Java] Support equals API in ValueVector to compare two vectors equal
Ji Liu created ARROW-6022: - Summary: [Java] Support equals API in ValueVector to compare two vectors equal Key: ARROW-6022 URL: https://issues.apache.org/jira/browse/ARROW-6022 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu In some case, this feature is useful. In ARROW-1184, {{Dictionary#equals}} not work due to the lack of this API. Moreover, we already implemented {{equals(int index, ValueVector target, int targetIndex)}}, so this new added API could reuse it. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6021) [Java] Extract copyFrom and copyFromSafe to ValueVector
Liya Fan created ARROW-6021: --- Summary: [Java] Extract copyFrom and copyFromSafe to ValueVector Key: ARROW-6021 URL: https://issues.apache.org/jira/browse/ARROW-6021 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Currently we have copyFrom and copyFromSafe methods in fixed-width and variable-width vectors. Extracting them to the common super interface will make it much more convenient to use them, and avoid unnecessary if-else statements. -- This message was sent by Atlassian JIRA (v7.6.14#76016)