[jira] [Created] (ARROW-8132) [C++] arrow-s3fs-test failing on master
Hatem Helal created ARROW-8132: -- Summary: [C++] arrow-s3fs-test failing on master Key: ARROW-8132 URL: https://issues.apache.org/jira/browse/ARROW-8132 Project: Apache Arrow Issue Type: Improvement Reporter: Hatem Helal Log: [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/branch/master/job/9dgr7xl635yuwh7y#L1917] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6096) [C++] Remove dependency on boost regex library
Hatem Helal created ARROW-6096: -- Summary: [C++] Remove dependency on boost regex library Key: ARROW-6096 URL: https://issues.apache.org/jira/browse/ARROW-6096 Project: Apache Arrow Issue Type: Improvement Reporter: Hatem Helal Assignee: Hatem Helal There appears to be only one place where the boost regex library is used: [cpp/src/parquet/metadata.cc|https://github.com/apache/arrow/blob/eb73b962e42b5ae6983bf026ebf825f1f707e245/cpp/src/parquet/metadata.cc#L32] I think this can be replaced by the C++11 regex library. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6061) [C++] Cannot build libarrow without rapidjson
Hatem Helal created ARROW-6061: -- Summary: [C++] Cannot build libarrow without rapidjson Key: ARROW-6061 URL: https://issues.apache.org/jira/browse/ARROW-6061 Project: Apache Arrow Issue Type: Bug Reporter: Hatem Helal Assignee: Hatem Helal {code:java} arrow/cpp/src/arrow/json/chunker.cc:25:30:fatal error: rapidjson/reader.h: No such file or directory #include "rapidjson/reader.h" compilation terminated.{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5676) [CI] hadolint failing on r/Dockerfile causing Travis "Lint, Release tests" failure
Hatem Helal created ARROW-5676: -- Summary: [CI] hadolint failing on r/Dockerfile causing Travis "Lint, Release tests" failure Key: ARROW-5676 URL: https://issues.apache.org/jira/browse/ARROW-5676 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Hatem Helal See [https://travis-ci.org/apache/arrow/jobs/548674391#L544] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5675) [Doc] Fix typo in documentation describing compile/debug workflow on macOS with Xcode IDE
Hatem Helal created ARROW-5675: -- Summary: [Doc] Fix typo in documentation describing compile/debug workflow on macOS with Xcode IDE Key: ARROW-5675 URL: https://issues.apache.org/jira/browse/ARROW-5675 Project: Apache Arrow Issue Type: Bug Components: Documentation Reporter: Hatem Helal Assignee: Hatem Helal See https://github.com/apache/arrow/pull/4596#discussion_r296093152 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5638) [C++] cmake fails to generate Xcode project when Gandiva JNI bindings are enabled
Hatem Helal created ARROW-5638: -- Summary: [C++] cmake fails to generate Xcode project when Gandiva JNI bindings are enabled Key: ARROW-5638 URL: https://issues.apache.org/jira/browse/ARROW-5638 Project: Apache Arrow Issue Type: Bug Reporter: Hatem Helal See comment with error here: https://github.com/apache/arrow/pull/4596#issuecomment-502954709 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5632) [Doc] Add some documentation describing compile/debug workflow on macOS with Xcode IDE
Hatem Helal created ARROW-5632: -- Summary: [Doc] Add some documentation describing compile/debug workflow on macOS with Xcode IDE Key: ARROW-5632 URL: https://issues.apache.org/jira/browse/ARROW-5632 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Hatem Helal Assignee: Hatem Helal -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5608) [C++][parquet] Invalid memory access when using parquet::arrow::ColumnReader
Hatem Helal created ARROW-5608: -- Summary: [C++][parquet] Invalid memory access when using parquet::arrow::ColumnReader Key: ARROW-5608 URL: https://issues.apache.org/jira/browse/ARROW-5608 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Hatem Helal Assignee: Hatem Helal I've observed occasional crashes when using the {{parquet::arrow::ColumnReader}} to iteratively read a fixed number of records. This has been quite tricky to isolate but compiling the attached version of parquet-arrow-example with ASAN has pointed me to an out-of-bounds access at [cpp/src/parquet/arrow/record_reader.cc#L356|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/record_reader.cc#L356] ASAN stack trace {code:java} ==18666==ERROR: AddressSanitizer: global-buffer-overflow on address 0x00010c1b3038 at pc 0x000108330bdd bp 0x7ffee8d16450 sp 0x7ffee8d15c00 READ of size 198 at 0x00010c1b3038 thread T0 #0 0x108330bdc in __asan_memmove (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x54bdc) #1 0x107205e96 in parquet::internal::RecordReader::RecordReaderImpl::Reset() algorithm:1828 #2 0x107205813 in parquet::internal::RecordReader::Reset() record_reader.cc:932 #3 0x106faea47 in parquet::arrow::PrimitiveImpl::NextBatch(long long, std::__1::shared_ptr*) reader.cc:1549 #4 0x106f6e69b in parquet::arrow::ColumnReader::NextBatch(long long, std::__1::shared_ptr*) reader.cc:1665 #5 0x106f06afe in read_column_iterative() reader-writer.cc:162 #6 0x106f09e9a in main reader-writer.cc:174 #7 0x7fff79472ed8 in start (libdyld.dylib:x86_64+0x16ed8){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5157) [Website] Add MATLAB to powered by Apache Arrow page
Hatem Helal created ARROW-5157: -- Summary: [Website] Add MATLAB to powered by Apache Arrow page Key: ARROW-5157 URL: https://issues.apache.org/jira/browse/ARROW-5157 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Hatem Helal Assignee: Hatem Helal MATLAB recently shipped R2019a with builtin support for Apache Parquet files and we used Arrow in the implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4785) [CI] Make Travis CI resilient against GPG errors
Hatem Helal created ARROW-4785: -- Summary: [CI] Make Travis CI resilient against GPG errors Key: ARROW-4785 URL: https://issues.apache.org/jira/browse/ARROW-4785 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Hatem Helal Travis Jobs sometime fail with a GPG error: {{W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://packagecloud.io/github/git-lfs/ubuntu trusty InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 6B05F25D762E3157}}{{W: Failed to fetch https://packagecloud.io/github/git-lfs/ubuntu/dists/trusty/InRelease The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 6B05F25D762E3157}}{{E: Failed to fetch http://security.ubuntu.com/ubuntu/dists/trusty-security/main/binary-i386/Packages.gz Hash Sum mismatch}}{{W: Some index files failed to download. They have been ignored, or old ones used instead.}}{{The command "if [ $TRAVIS_OS_NAME == "linux" ]; then}}{{ sudo bash -c "echo -e 'Acquire::Retries 10; Acquire::http::Timeout \"20\";' > /etc/apt/apt.conf.d/99-travis-retry"}}{{ sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test}}{{ sudo apt-get update -qq}}{{ fi}}{{ " failed and exited with 100 during .}}{{ }}{{Your build has been stopped.}} It would be nice if the number of retries, timeout, or both could be increased to make the travis jobs more resilient to this seemingly sporadic issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4661) [C++] Consolidate random string generators for use in benchmarks and unittests
Hatem Helal created ARROW-4661: -- Summary: [C++] Consolidate random string generators for use in benchmarks and unittests Key: ARROW-4661 URL: https://issues.apache.org/jira/browse/ARROW-4661 Project: Apache Arrow Issue Type: Improvement Reporter: Hatem Helal Assignee: Hatem Helal Fix For: 0.14.0 This was discussed in here: [https://github.com/apache/arrow/pull/3721] For testing/benchmarking dictionary encoding its useful to control the number of repeated values and it would also be good to optionally include null values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4260) [Python] test_serialize_deserialize_pandas is failing on OSX with Xcode 6.4
Hatem Helal created ARROW-4260: -- Summary: [Python] test_serialize_deserialize_pandas is failing on OSX with Xcode 6.4 Key: ARROW-4260 URL: https://issues.apache.org/jira/browse/ARROW-4260 Project: Apache Arrow Issue Type: Bug Reporter: Hatem Helal See [https://travis-ci.org/apache/arrow/jobs/479378190#L2427] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4156) [C++] xcodebuild failure for cmake generated project
Hatem Helal created ARROW-4156: -- Summary: [C++] xcodebuild failure for cmake generated project Key: ARROW-4156 URL: https://issues.apache.org/jira/browse/ARROW-4156 Project: Apache Arrow Issue Type: Wish Reporter: Hatem Helal Assignee: Uwe L. Korn Using the cmake xcode project generator fails to build using xcodebuild as follows: {code:java} $ cmake .. -G Xcode -DARROW_PARQUET=ON -DPARQUET_BUILD_EXECUTABLES=ON -DPARQUET_BUILD_EXAMPLES=ON -DFLATBUFFERS_HOME=/usr/local/Cellar/flatbuffers/1.10.0 -DCMAKE_BUILD_TYPE=Debug -DTHRIFT_HOME=/usr/local/Cellar/thrift/0.11.0 -DARROW_EXTRA_ERROR_CONTEXT=ON -DARROW_BUILD_TESTS=ON -DClangTools_PATH=/usr/local/Cellar/llvm@6/6.0.1_1 Libtool xcode-build/src/arrow/arrow.build/Debug/arrow_objlib.build/Objects-normal/libarrow_objlib.a normal x86_64 cd /Users/hhelal/Documents/code/arrow/cpp export MACOSX_DEPLOYMENT_TARGET=10.14 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/libtool -static -arch_only x86_64 -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -L/Users/hhelal/Documents/code/arrow/cpp/xcode-build/src/arrow/arrow.build/Debug/arrow_objlib.build/Objects-normal -filelist /Users/hhelal/Documents/code/arrow/cpp/xcode-build/src/arrow/arrow.build/Debug/arrow_objlib.build/Objects-normal/x86_64/arrow_objlib.LinkFileList -o /Users/hhelal/Documents/code/arrow/cpp/xcode-build/src/arrow/arrow.build/Debug/arrow_objlib.build/Objects-normal/libarrow_objlib.a PhaseScriptExecution CMake\ PostBuild\ Rules xcode-build/src/arrow/arrow.build/Debug/arrow_objlib.build/Script-2604120B03B14AB58C2E586A.sh cd /Users/hhelal/Documents/code/arrow/cpp /bin/sh -c /Users/hhelal/Documents/code/arrow/cpp/xcode-build/src/arrow/arrow.build/Debug/arrow_objlib.build/Script-2604120B03B14AB58C2E586A.sh echo "Depend check for xcode" Depend check for xcode cd /Users/hhelal/Documents/code/arrow/cpp/xcode-build && make -C /Users/hhelal/Documents/code/arrow/cpp/xcode-build -f /Users/hhelal/Documents/code/arrow/cpp/xcode-build/CMakeScripts/XCODE_DEPEND_HELPER.make PostBuild.arrow_objlib.Debug /bin/rm -f /Users/hhelal/Documents/code/arrow/cpp/xcode-build/debug/Debug/libarrow.dylib /bin/rm -f /Users/hhelal/Documents/code/arrow/cpp/xcode-build/debug/Debug/libarrow.a === BUILD TARGET arrow_shared OF PROJECT arrow WITH THE DEFAULT CONFIGURATION (Debug) === Check dependencies Write auxiliary files write-file /Users/hhelal/Documents/code/arrow/cpp/xcode-build/src/arrow/arrow.build/Debug/arrow_shared.build/Script-9AFD4DDD88034C5F965570DF.sh chmod 0755 /Users/hhelal/Documents/code/arrow/cpp/xcode-build/src/arrow/arrow.build/Debug/arrow_shared.build/Script-9AFD4DDD88034C5F965570DF.sh PhaseScriptExecution CMake\ PostBuild\ Rules xcode-build/src/arrow/arrow.build/Debug/arrow_shared.build/Script-9AFD4DDD88034C5F965570DF.sh cd /Users/hhelal/Documents/code/arrow/cpp /bin/sh -c /Users/hhelal/Documents/code/arrow/cpp/xcode-build/src/arrow/arrow.build/Debug/arrow_shared.build/Script-9AFD4DDD88034C5F965570DF.sh echo "Creating symlinks" Creating symlinks /usr/local/Cellar/cmake/3.12.4/bin/cmake -E cmake_symlink_library /Users/hhelal/Documents/code/arrow/cpp/xcode-build/debug/Debug/libarrow.12.0.0.dylib /Users/hhelal/Documents/code/arrow/cpp/xcode-build/debug/Debug/libarrow.12.dylib /Users/hhelal/Documents/code/arrow/cpp/xcode-build/debug/Debug/libarrow.dylib CMake Error: cmake_symlink_library: System Error: No such file or directory CMake Error: cmake_symlink_library: System Error: No such file or directory make: *** [arrow_shared_buildpart_0] Error 1 ** BUILD FAILED ** The following build commands failed: PhaseScriptExecution CMake\ PostBuild\ Rules xcode-build/src/arrow/arrow.build/Debug/arrow_shared.build/Script-9AFD4DDD88034C5F965570DF.sh (1 failure) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3564) pyarrow: writing version 2.0 parquet format with dictionary encoding enabled
Hatem Helal created ARROW-3564: -- Summary: pyarrow: writing version 2.0 parquet format with dictionary encoding enabled Key: ARROW-3564 URL: https://issues.apache.org/jira/browse/ARROW-3564 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.11.0 Reporter: Hatem Helal Attachments: example_v1.0_dict_False.parquet, example_v1.0_dict_True.parquet, example_v2.0_dict_False.parquet, example_v2.0_dict_True.parquet, pyarrow_repro.py Using pyarrow v0.11.0, the following script writes a simple table (lifted from the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to both parquet format versions 1.0 and 2.0, with and without dictionary encoding enabled. |{{import}} {{pyarrow.parquet as pq}} {{import}} {{numpy as np}} {{import}} {{pandas as pd}} {{import}} {{pyarrow as pa}} {{import}} {{itertools}} {{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}{{-}}{{1}}{{, np.nan, }}{{2.5}}{{],}} {{}}{{'two'}}{{: [}}{{'foo'}}{{, }}{{'bar'}}{{, }}{{'baz'}}{{],}} {{}}{{'three'}}{{: [}}{{True}}{{, }}{{False}}{{, }}{{True}}{{]},}} {{}}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}} {{table }}{{=}} {{pa.Table.from_pandas(df)}} {{use_dict }}{{=}} {{[}}{{True}}{{, }}{{False}}{{]}} {{version }}{{=}} {{[}}{{'1.0'}}{{, }}{{'2.0'}}{{]}} {{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}} {{}}{{filename }}{{=}} {{'example_v'}} {{+}} {{v }}{{+}} {{'_dict_'}} {{+}} {{str}}{{(tf) }}{{+}} {{'.parquet'}} {{}}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf, version}}{{=}}{{v)}}| Inspecting the written files using [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] appears to show that dictionary encoding is not used in either of the version 2.0 files. Both files report that the columns are encoded using {{PLAIN,RLE}} and that the dictionary page offset is zero. I was expecting that the column encoding would include {{RLE_DICTIONARY}}. Attached are the script with repro steps and the files that were generated by it. Below is the output of using {{parquet-tools meta}} on the version 2.0 files {panel:title=version='2.0', use_dictionary = True} {panel} |{{% parquet-tools meta example_v2.0_dict_True.parquet}} {{file: file:.../example_v2.0_dict_True.parquet}} {{creator: parquet-cpp version 1.5.1-SNAPSHOT}} {{extra: pandas = \{"pandas_version": "0.23.4", "index_columns": ["__index_level_0__"], "columns": [{"metadata": null, "field_name": "one", "name": "one", "numpy_type": "float64", "pandas_type": "float64"}, \{"metadata": null, "field_name": "three", "name": "three", "numpy_type": "bool", "pandas_type": "bool"}, \{"metadata": null, "field_name": "two", "name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \{"metadata": null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", "pandas_type": "bytes"}], "column_indexes": [\{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}}} {{file schema: schema}} {{}} {{one: OPTIONAL DOUBLE R:0 D:1}} {{three: OPTIONAL BOOLEAN R:0 D:1}} {{two: OPTIONAL BINARY R:0 D:1}} {{__index_level_0__: OPTIONAL BINARY R:0 D:1}} {{row group 1: RC:3 TS:211 OFFSET:4}} {{}} {{one: DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}} {{three: BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}} {{two: BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}} {{__index_level_0__: BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}| {panel:title=version='2.0', use_dictionary = False} {panel} |{{% parquet-tools meta example_v2.0_dict_False.parquet}} {{file: file:.../example_v2.0_dict_False.parquet}} {{creator: parquet-cpp version 1.5.1-SNAPSHOT}} {{extra: pandas = \{"pandas_version": "0.23.4", "index_columns": ["__index_level_0__"], "columns": [{"metadata": null, "field_name": "one", "name": "one", "numpy_type": "float64", "pandas_type": "float64"}, \{"metadata": null, "field_name": "three", "name": "three", "numpy_type": "bool", "pandas_type": "bool"}, \{"metadata": null, "field_name": "two", "name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \{"metadata": null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", "pandas_type": "bytes"}], "column_indexes": [\{"metadata": null, "field_name": null, "name": null,