[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly
[ https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708321#comment-16708321 ] Suvayu Ali commented on ARROW-3874: --- Since I'm using {{java-1.8.0-openjdk}}, I had to install {{java-1.8.0-openjdk-devel}} to get {{jni.h}}. For other java versions on F29, it should be {{java--openjdk-devel}}. > [Gandiva] Cannot build: LLVM not detected correctly > --- > > Key: ARROW-3874 > URL: https://issues.apache.org/jira/browse/ARROW-3874 > Project: Apache Arrow > Issue Type: Bug > Components: Gandiva >Affects Versions: 0.12.0 > Environment: Fedora 29, master (1013a1dc) > gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5) > llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos) > cmake version 3.12.1 >Reporter: Suvayu Ali >Assignee: Suvayu Ali >Priority: Major > Labels: cmake, pull-request-available > Fix For: 0.12.0 > > Attachments: CMakeError.log, CMakeOutput.log, > arrow-cmake-findllvm.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while > detecting LLVM on the system. > {code} > $ cd build/data-an/arrow/arrow/cpp/ > $ export ARROW_HOME=/opt/data-an > $ mkdir release > $ cd release/ > $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DARROW_GANDIVA=ON ../ > [...] > -- Found LLVM 6.0.1 > -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm > CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message): > Target X86 is not in the set of libraries. > Call Stack (most recent call first): > cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames) > src/gandiva/CMakeLists.txt:25 (find_package) > -- Configuring incomplete, errors occurred! > {code} > The cmake log files are attached. > When I invoke cmake with options other than *Gandiva*, it finishes > successfully. > Here are the llvm libraries that are installed on my system: > {code} > $ rpm -qa llvm\* | sort > llvm3.9-libs-3.9.1-13.fc28.x86_64 > llvm4.0-libs-4.0.1-5.fc28.x86_64 > llvm-6.0.1-8.fc28.x86_64 > llvm-devel-6.0.1-8.fc28.x86_64 > llvm-libs-6.0.1-8.fc28.i686 > llvm-libs-6.0.1-8.fc28.x86_64 > $ ls /usr/lib64/libLLVM* /usr/include/llvm > /usr/lib64/libLLVM-6.0.1.so /usr/lib64/libLLVM-6.0.so /usr/lib64/libLLVM.so > /usr/include/llvm: > ADT FuzzMutate Object Support > Analysis InitializePasses.h ObjectYAML TableGen > AsmParserIR Option Target > BinaryFormat IRReaderPassAnalysisSupport.h Testing > Bitcode LineEditor Passes ToolDrivers > CodeGen LinkAllIR.h Pass.h Transforms > Config LinkAllPasses.h PassInfo.h WindowsManifest > DebugInfoLinker PassRegistry.h WindowsResource > Demangle LTO PassSupport.h XRay > ExecutionEngine MC ProfileData > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2323) [JS] Document JavaScript release management
[ https://issues.apache.org/jira/browse/ARROW-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette reassigned ARROW-2323: Assignee: Brian Hulette > [JS] Document JavaScript release management > --- > > Key: ARROW-2323 > URL: https://issues.apache.org/jira/browse/ARROW-2323 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Wes McKinney >Assignee: Brian Hulette >Priority: Major > Fix For: JS-0.4.0 > > > The JavaScript post-vote release management process is not documented. For > example, there is certain NPM-related steps required to be able to publish > artifacts after the release vote has taken place. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2984) [JS] Refactor release verification script to share code with main source release verification script
[ https://issues.apache.org/jira/browse/ARROW-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-2984: - Fix Version/s: (was: JS-0.4.0) JS-0.5.0 > [JS] Refactor release verification script to share code with main source > release verification script > > > Key: ARROW-2984 > URL: https://issues.apache.org/jira/browse/ARROW-2984 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Wes McKinney >Priority: Major > Fix For: JS-0.5.0 > > > There is some possible code duplication. See discussion in ARROW-2977 > https://github.com/apache/arrow/pull/2369 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3892) [JS] Remove any dependency on compromised NPM flatmap-stream package
[ https://issues.apache.org/jira/browse/ARROW-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette reassigned ARROW-3892: Assignee: Brian Hulette > [JS] Remove any dependency on compromised NPM flatmap-stream package > > > Key: ARROW-3892 > URL: https://issues.apache.org/jira/browse/ARROW-3892 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Wes McKinney >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > > We are erroring out as the result of > https://github.com/dominictarr/event-stream/issues/116 > {code} > npm ERR! code ENOVERSIONS > npm ERR! No valid versions available for flatmap-stream > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3892) [JS] Remove any dependency on compromised NPM flatmap-stream package
[ https://issues.apache.org/jira/browse/ARROW-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3892: -- Labels: pull-request-available (was: ) > [JS] Remove any dependency on compromised NPM flatmap-stream package > > > Key: ARROW-3892 > URL: https://issues.apache.org/jira/browse/ARROW-3892 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Wes McKinney >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > > We are erroring out as the result of > https://github.com/dominictarr/event-stream/issues/116 > {code} > npm ERR! code ENOVERSIONS > npm ERR! No valid versions available for flatmap-stream > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (ARROW-3834) [Doc] Merge Python & C++ and move to top-level
[ https://issues.apache.org/jira/browse/ARROW-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reopened ARROW-3834: - > [Doc] Merge Python & C++ and move to top-level > -- > > Key: ARROW-3834 > URL: https://issues.apache.org/jira/browse/ARROW-3834 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > Merge the C++, Python and Format documentation and move it to the top-level > folder. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column
[ https://issues.apache.org/jira/browse/ARROW-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708149#comment-16708149 ] Brian Hulette commented on ARROW-3667: -- Makes sense, thanks for the context. Maybe I'll start a discussion on the mailing list to define how we represent the null datatype in JSON. > [JS] Incorrectly reads record batches with an all null column > - > > Key: ARROW-3667 > URL: https://issues.apache.org/jira/browse/ARROW-3667 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: JS-0.3.1 >Reporter: Brian Hulette >Priority: Major > Fix For: JS-0.4.0 > > > The JS library seems to incorrectly read any columns that come after an > all-null column in IPC buffers produced by pyarrow. > Here's a python script that generates two arrow buffers, one with an all-null > column followed by a utf-8 column, and a second with those two reversed > {code:python} > import pyarrow as pa > import pandas as pd > def serialize_to_arrow(df, fd, compress=True): > batch = pa.RecordBatch.from_pandas(df) > writer = pa.RecordBatchFileWriter(fd, batch.schema) > writer.write_batch(batch) > writer.close() > if __name__ == "__main__": > df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', > 'def', 'ghi']}, columns=['nulls', 'not nulls']) > with open('bad.arrow', 'wb') as fd: > serialize_to_arrow(df, fd) > df = pd.DataFrame(df, columns=['not nulls', 'nulls']) > with open('good.arrow', 'wb') as fd: > serialize_to_arrow(df, fd) > {code} > JS incorrectly interprets the [null, not null] case: > {code:javascript} > > var arrow = require('apache-arrow') > undefined > > var fs = require('fs') > undefined > > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not > > nulls').get(0) > 'abc' > > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0) > '\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u' > {code} > Presumably this is because pyarrow is omitting some (or all) of the buffers > associated with the all-null column, but the JS IPC reader is still looking > for them, causing the buffer count to get out of sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3834) [Doc] Merge Python & C++ and move to top-level
[ https://issues.apache.org/jira/browse/ARROW-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3834. - Resolution: Fixed Issue resolved by pull request 2856 [https://github.com/apache/arrow/pull/2856] > [Doc] Merge Python & C++ and move to top-level > -- > > Key: ARROW-3834 > URL: https://issues.apache.org/jira/browse/ARROW-3834 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 4h > Remaining Estimate: 0h > > Merge the C++, Python and Format documentation and move it to the top-level > folder. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column
[ https://issues.apache.org/jira/browse/ARROW-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-3667: - Fix Version/s: (was: JS-0.4.0) JS-0.5.0 > [JS] Incorrectly reads record batches with an all null column > - > > Key: ARROW-3667 > URL: https://issues.apache.org/jira/browse/ARROW-3667 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: JS-0.3.1 >Reporter: Brian Hulette >Priority: Major > Fix For: JS-0.5.0 > > > The JS library seems to incorrectly read any columns that come after an > all-null column in IPC buffers produced by pyarrow. > Here's a python script that generates two arrow buffers, one with an all-null > column followed by a utf-8 column, and a second with those two reversed > {code:python} > import pyarrow as pa > import pandas as pd > def serialize_to_arrow(df, fd, compress=True): > batch = pa.RecordBatch.from_pandas(df) > writer = pa.RecordBatchFileWriter(fd, batch.schema) > writer.write_batch(batch) > writer.close() > if __name__ == "__main__": > df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', > 'def', 'ghi']}, columns=['nulls', 'not nulls']) > with open('bad.arrow', 'wb') as fd: > serialize_to_arrow(df, fd) > df = pd.DataFrame(df, columns=['not nulls', 'nulls']) > with open('good.arrow', 'wb') as fd: > serialize_to_arrow(df, fd) > {code} > JS incorrectly interprets the [null, not null] case: > {code:javascript} > > var arrow = require('apache-arrow') > undefined > > var fs = require('fs') > undefined > > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not > > nulls').get(0) > 'abc' > > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0) > '\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u' > {code} > Presumably this is because pyarrow is omitting some (or all) of the buffers > associated with the all-null column, but the JS IPC reader is still looking > for them, causing the buffer count to get out of sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-951) [JS] Fix generated API documentation
[ https://issues.apache.org/jira/browse/ARROW-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-951: Fix Version/s: (was: JS-0.4.0) JS-0.5.0 > [JS] Fix generated API documentation > > > Key: ARROW-951 > URL: https://issues.apache.org/jira/browse/ARROW-951 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Brian Hulette >Priority: Minor > Labels: documentation > Fix For: JS-0.5.0 > > > The current generated API documentation doesn't respect the project's > namespaces, it simply lists all exported objects. We should see if we can > make typedoc display the project's structure (even if it means re-structuring > the code a bit), or find another approach for doc generation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3337) [JS] IPC writer doesn't serialize the dictionary of nested Vectors
[ https://issues.apache.org/jira/browse/ARROW-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-3337: - Fix Version/s: (was: JS-0.4.0) JS-0.5.0 > [JS] IPC writer doesn't serialize the dictionary of nested Vectors > -- > > Key: ARROW-3337 > URL: https://issues.apache.org/jira/browse/ARROW-3337 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Affects Versions: JS-0.3.1 >Reporter: Paul Taylor >Assignee: Paul Taylor >Priority: Major > Fix For: JS-0.5.0 > > > The JS writer only serializes dictionaries for [top-level > children|https://github.com/apache/arrow/blob/ee9b1ba426e2f1f117cde8d8f4ba6fbe3be5674c/js/src/ipc/writer/binary.ts#L40] > of a Table. This is wrong, and an oversight on my part. The fix here is to > put the actual Dictionary vectors in the `schema.dictionaries` map instead of > the dictionary fields, like I understand the C++ does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2860) [Python] Null values in a single partition of Parquet dataset, results in invalid schema on read
[ https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708141#comment-16708141 ] Wes McKinney commented on ARROW-2860: - Thanks for checking that. There's a couple of related issues that may be the same thing > [Python] Null values in a single partition of Parquet dataset, results in > invalid schema on read > > > Key: ARROW-2860 > URL: https://issues.apache.org/jira/browse/ARROW-2860 > Project: Apache Arrow > Issue Type: Bug >Reporter: Sam Oluwalana >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > from datetime import datetime, timedelta > def generate_data(event_type, event_id, offset=0): > """Generate data.""" > now = datetime.utcnow() + timedelta(seconds=offset) > obj = { > 'event_type': event_type, > 'event_id': event_id, > 'event_date': now.date(), > 'foo': None, > 'bar': u'hello', > } > if event_type == 2: > obj['foo'] = 1 > obj['bar'] = u'world' > if event_type == 3: > obj['different'] = u'data' > obj['bar'] = u'event type 3' > else: > obj['different'] = None > return obj > data = [ > generate_data(1, 1, 1), > generate_data(1, 1, 3600 * 72), > generate_data(2, 1, 1), > generate_data(2, 1, 3600 * 72), > generate_data(3, 1, 1), > generate_data(3, 1, 3600 * 72), > ] > df = pd.DataFrame.from_records(data, index='event_id') > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, root_path='/tmp/events', > partition_cols=['event_type', 'event_date']) > dataset = pq.ParquetDataset('/tmp/events') > table = dataset.read() > print(table.num_rows) > {code} > Expected output: > {code:python} > 6 > {code} > Actual: > {code:python} > python example_failure.py > Traceback (most recent call last): > File "example_failure.py", line 43, in > dataset = pq.ParquetDataset('/tmp/events') > File > "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", > line 745, in __init__ > self.validate_schemas() > File > "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", > line 775, in validate_schemas > dataset_schema)) > ValueError: Schema in partition[event_type=2, event_date=0] > /tmp/events/event_type=3/event_date=2018-07-16 > 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different. > bar: string > different: string > foo: double > event_id: int64 > metadata > > {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], > "columns": [{"metadata": null, "field_name": "bar", "name": "bar", > "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, > "field_name": "different", "name": "different", "numpy_type": "object", > "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": > "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, > "field_name": "event_id", "name": "event_id", "numpy_type": "int64", > "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": > null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'} > vs > bar: string > different: null > foo: double > event_id: int64 > metadata > > {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], > "columns": [{"metadata": null, "field_name": "bar", "name": "bar", > "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, > "field_name": "different", "name": "different", "numpy_type": "object", > "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": > "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, > "field_name": "event_id", "name": "event_id", "numpy_type": "int64", > "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": > null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'} > {code} > Apparently what is happening is that pyarrow is interpreting the schema from > each of the partitions individually and the partitions for `event_type=3 / > event_date=*` both have values for the column `different` whereas the other > columns do not. The discrepancy causes the `None` values of the other > partitions to be labeled as `pandas_type` `empty` instead of `unicode`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2909) [JS] Add convenience function for creating a table from a list of vectors
[ https://issues.apache.org/jira/browse/ARROW-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette resolved ARROW-2909. -- Resolution: Fixed Issue resolved by pull request 2322 [https://github.com/apache/arrow/pull/2322] > [JS] Add convenience function for creating a table from a list of vectors > - > > Key: ARROW-2909 > URL: https://issues.apache.org/jira/browse/ARROW-2909 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Similar to ARROW-2766, but requires users to first turn their arrays into > vectors, so we don't have to deduce type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2860) [Python] Null values in a single partition of Parquet dataset, results in invalid schema on read
[ https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708137#comment-16708137 ] Tanya Schlusser commented on ARROW-2860: I think this was resolved with https://issues.apache.org/jira/browse/ARROW-2891 pull request 2302 [https://github.com/apache/arrow/pull/2302] When I run {{example_failure.py}} it does not fail and returns the expected result. > [Python] Null values in a single partition of Parquet dataset, results in > invalid schema on read > > > Key: ARROW-2860 > URL: https://issues.apache.org/jira/browse/ARROW-2860 > Project: Apache Arrow > Issue Type: Bug >Reporter: Sam Oluwalana >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > from datetime import datetime, timedelta > def generate_data(event_type, event_id, offset=0): > """Generate data.""" > now = datetime.utcnow() + timedelta(seconds=offset) > obj = { > 'event_type': event_type, > 'event_id': event_id, > 'event_date': now.date(), > 'foo': None, > 'bar': u'hello', > } > if event_type == 2: > obj['foo'] = 1 > obj['bar'] = u'world' > if event_type == 3: > obj['different'] = u'data' > obj['bar'] = u'event type 3' > else: > obj['different'] = None > return obj > data = [ > generate_data(1, 1, 1), > generate_data(1, 1, 3600 * 72), > generate_data(2, 1, 1), > generate_data(2, 1, 3600 * 72), > generate_data(3, 1, 1), > generate_data(3, 1, 3600 * 72), > ] > df = pd.DataFrame.from_records(data, index='event_id') > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, root_path='/tmp/events', > partition_cols=['event_type', 'event_date']) > dataset = pq.ParquetDataset('/tmp/events') > table = dataset.read() > print(table.num_rows) > {code} > Expected output: > {code:python} > 6 > {code} > Actual: > {code:python} > python example_failure.py > Traceback (most recent call last): > File "example_failure.py", line 43, in > dataset = pq.ParquetDataset('/tmp/events') > File > "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", > line 745, in __init__ > self.validate_schemas() > File > "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", > line 775, in validate_schemas > dataset_schema)) > ValueError: Schema in partition[event_type=2, event_date=0] > /tmp/events/event_type=3/event_date=2018-07-16 > 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different. > bar: string > different: string > foo: double > event_id: int64 > metadata > > {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], > "columns": [{"metadata": null, "field_name": "bar", "name": "bar", > "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, > "field_name": "different", "name": "different", "numpy_type": "object", > "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": > "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, > "field_name": "event_id", "name": "event_id", "numpy_type": "int64", > "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": > null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'} > vs > bar: string > different: null > foo: double > event_id: int64 > metadata > > {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], > "columns": [{"metadata": null, "field_name": "bar", "name": "bar", > "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, > "field_name": "different", "name": "different", "numpy_type": "object", > "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": > "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, > "field_name": "event_id", "name": "event_id", "numpy_type": "int64", > "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": > null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'} > {code} > Apparently what is happening is that pyarrow is interpreting the schema from > each of the partitions individually and the partitions for `event_type=3 / > event_date=*` both have values for the column `different` whereas the other > columns do not. The discrepancy causes the `None` values of the other > partitions to be labeled as `pandas_type` `empty` instead of `unicode`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD
[ https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3933: Labels: parquet (was: ) > [Python] Segfault reading Parquet files from GNOMAD > --- > > Key: ARROW-3933 > URL: https://issues.apache.org/jira/browse/ARROW-3933 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Ubuntu 18.04 or Mac OS X >Reporter: David Konerding >Priority: Minor > Labels: parquet > Fix For: 0.12.0 > > > I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). > Error also occurs out of box on Mac OS X. > $ sudo snap install --classic google-cloud-sdk > $ gsutil cp > gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet > . > $ conda install pyarrow > $ python test.py > Segmentation fault (core dumped) > test.py: > import pyarrow.parquet as pq > path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet" > pq.read_table(path) > gdb output: > Thread 3 "python" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fffdf199700 (LWP 13703)] > 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, > unsigned long*) () from > /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11 > I tested fastparquet, it reads the file just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD
[ https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3933: Fix Version/s: 0.12.0 > [Python] Segfault reading Parquet files from GNOMAD > --- > > Key: ARROW-3933 > URL: https://issues.apache.org/jira/browse/ARROW-3933 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Ubuntu 18.04 or Mac OS X >Reporter: David Konerding >Priority: Minor > Labels: parquet > Fix For: 0.12.0 > > > I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). > Error also occurs out of box on Mac OS X. > $ sudo snap install --classic google-cloud-sdk > $ gsutil cp > gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet > . > $ conda install pyarrow > $ python test.py > Segmentation fault (core dumped) > test.py: > import pyarrow.parquet as pq > path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet" > pq.read_table(path) > gdb output: > Thread 3 "python" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fffdf199700 (LWP 13703)] > 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, > unsigned long*) () from > /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11 > I tested fastparquet, it reads the file just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD
[ https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3933: Summary: [Python] Segfault reading Parquet files from GNOMAD (was: pyarrow segfault reading Parquet files from GNOMAD) > [Python] Segfault reading Parquet files from GNOMAD > --- > > Key: ARROW-3933 > URL: https://issues.apache.org/jira/browse/ARROW-3933 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Ubuntu 18.04 or Mac OS X >Reporter: David Konerding >Priority: Minor > Labels: parquet > Fix For: 0.12.0 > > > I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). > Error also occurs out of box on Mac OS X. > $ sudo snap install --classic google-cloud-sdk > $ gsutil cp > gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet > . > $ conda install pyarrow > $ python test.py > Segmentation fault (core dumped) > test.py: > import pyarrow.parquet as pq > path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet" > pq.read_table(path) > gdb output: > Thread 3 "python" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fffdf199700 (LWP 13703)] > 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, > unsigned long*) () from > /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11 > I tested fastparquet, it reads the file just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD
[ https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3933: Component/s: Python > [Python] Segfault reading Parquet files from GNOMAD > --- > > Key: ARROW-3933 > URL: https://issues.apache.org/jira/browse/ARROW-3933 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Ubuntu 18.04 or Mac OS X >Reporter: David Konerding >Priority: Minor > Labels: parquet > Fix For: 0.12.0 > > > I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). > Error also occurs out of box on Mac OS X. > $ sudo snap install --classic google-cloud-sdk > $ gsutil cp > gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet > . > $ conda install pyarrow > $ python test.py > Segmentation fault (core dumped) > test.py: > import pyarrow.parquet as pq > path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet" > pq.read_table(path) > gdb output: > Thread 3 "python" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fffdf199700 (LWP 13703)] > 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, > unsigned long*) () from > /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11 > I tested fastparquet, it reads the file just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3914) [C++/Python/Packaging] Docker-compose setup for Alpine linux
[ https://issues.apache.org/jira/browse/ARROW-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3914. - Resolution: Fixed Fix Version/s: 0.12.0 Issue resolved by pull request 3059 [https://github.com/apache/arrow/pull/3059] > [C++/Python/Packaging] Docker-compose setup for Alpine linux > > > Key: ARROW-3914 > URL: https://issues.apache.org/jira/browse/ARROW-3914 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Packaging, Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3934) [Gandiva] Don't compile precompiled tests if ARROW_GANDIVA_BUILD_TESTS=off
[ https://issues.apache.org/jira/browse/ARROW-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3934: -- Labels: pull-request-available (was: ) > [Gandiva] Don't compile precompiled tests if ARROW_GANDIVA_BUILD_TESTS=off > -- > > Key: ARROW-3934 > URL: https://issues.apache.org/jira/browse/ARROW-3934 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Philipp Moritz >Assignee: Philipp Moritz >Priority: Minor > Labels: pull-request-available > Fix For: 0.12.0 > > > Currently the precompiled tests are compiled in any case, even if > ARROW_GANDIVA_BUILD_TESTS=off. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3934) [Gandiva] Don't compile precompiled tests if ARROW_GANDIVA_BUILD_TESTS=off
Philipp Moritz created ARROW-3934: - Summary: [Gandiva] Don't compile precompiled tests if ARROW_GANDIVA_BUILD_TESTS=off Key: ARROW-3934 URL: https://issues.apache.org/jira/browse/ARROW-3934 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Assignee: Philipp Moritz Fix For: 0.12.0 Currently the precompiled tests are compiled in any case, even if ARROW_GANDIVA_BUILD_TESTS=off. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3933) pyarrow segfault reading Parquet files from GNOMAD
David Konerding created ARROW-3933: -- Summary: pyarrow segfault reading Parquet files from GNOMAD Key: ARROW-3933 URL: https://issues.apache.org/jira/browse/ARROW-3933 Project: Apache Arrow Issue Type: Bug Components: C++ Environment: Ubuntu 18.04 or Mac OS X Reporter: David Konerding I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). Error also occurs out of box on Mac OS X. $ sudo snap install --classic google-cloud-sdk $ gsutil cp gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet . $ conda install pyarrow $ python test.py Segmentation fault (core dumped) test.py: import pyarrow.parquet as pq path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet" pq.read_table(path) gdb output: Thread 3 "python" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffdf199700 (LWP 13703)] 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, unsigned long*) () from /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11 I tested fastparquet, it reads the file just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps
[ https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708059#comment-16708059 ] Wes McKinney commented on ARROW-3907: - ETL can be a messy business. If you have ideas about improving the APIs for schema coercion / casting, I'd be interested to discuss more > [Python] from_pandas errors when schemas are used with lower resolution > timestamps > -- > > Key: ARROW-3907 > URL: https://issues.apache.org/jira/browse/ARROW-3907 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: David Lee >Priority: Major > Fix For: 0.11.1 > > > When passing in a schema object to from_pandas a resolution error occurs if > the schema uses a lower resolution timestamp. Do we need to also add > "coerce_timestamps" and "allow_truncated_timestamps" parameters found in > write_table() to from_pandas()? > Error: > pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would > lose data: 1532015191753713000', 'Conversion failed for column modified with > type datetime64[ns]') > Code: > > {code:java} > processed_schema = pa.schema([ > pa.field('Id', pa.string()), > pa.field('modified', pa.timestamp('ms')), > pa.field('records', pa.int32()) > ]) > pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly
[ https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3874. - Resolution: Fixed Fix Version/s: 0.12.0 Issue resolved by pull request 3072 [https://github.com/apache/arrow/pull/3072] > [Gandiva] Cannot build: LLVM not detected correctly > --- > > Key: ARROW-3874 > URL: https://issues.apache.org/jira/browse/ARROW-3874 > Project: Apache Arrow > Issue Type: Bug > Components: Gandiva >Affects Versions: 0.12.0 > Environment: Fedora 29, master (1013a1dc) > gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5) > llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos) > cmake version 3.12.1 >Reporter: Suvayu Ali >Priority: Major > Labels: cmake, pull-request-available > Fix For: 0.12.0 > > Attachments: CMakeError.log, CMakeOutput.log, > arrow-cmake-findllvm.patch > > Time Spent: 20m > Remaining Estimate: 0h > > I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while > detecting LLVM on the system. > {code} > $ cd build/data-an/arrow/arrow/cpp/ > $ export ARROW_HOME=/opt/data-an > $ mkdir release > $ cd release/ > $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DARROW_GANDIVA=ON ../ > [...] > -- Found LLVM 6.0.1 > -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm > CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message): > Target X86 is not in the set of libraries. > Call Stack (most recent call first): > cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames) > src/gandiva/CMakeLists.txt:25 (find_package) > -- Configuring incomplete, errors occurred! > {code} > The cmake log files are attached. > When I invoke cmake with options other than *Gandiva*, it finishes > successfully. > Here are the llvm libraries that are installed on my system: > {code} > $ rpm -qa llvm\* | sort > llvm3.9-libs-3.9.1-13.fc28.x86_64 > llvm4.0-libs-4.0.1-5.fc28.x86_64 > llvm-6.0.1-8.fc28.x86_64 > llvm-devel-6.0.1-8.fc28.x86_64 > llvm-libs-6.0.1-8.fc28.i686 > llvm-libs-6.0.1-8.fc28.x86_64 > $ ls /usr/lib64/libLLVM* /usr/include/llvm > /usr/lib64/libLLVM-6.0.1.so /usr/lib64/libLLVM-6.0.so /usr/lib64/libLLVM.so > /usr/include/llvm: > ADT FuzzMutate Object Support > Analysis InitializePasses.h ObjectYAML TableGen > AsmParserIR Option Target > BinaryFormat IRReaderPassAnalysisSupport.h Testing > Bitcode LineEditor Passes ToolDrivers > CodeGen LinkAllIR.h Pass.h Transforms > Config LinkAllPasses.h PassInfo.h WindowsManifest > DebugInfoLinker PassRegistry.h WindowsResource > Demangle LTO PassSupport.h XRay > ExecutionEngine MC ProfileData > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly
[ https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3874: --- Assignee: Suvayu Ali > [Gandiva] Cannot build: LLVM not detected correctly > --- > > Key: ARROW-3874 > URL: https://issues.apache.org/jira/browse/ARROW-3874 > Project: Apache Arrow > Issue Type: Bug > Components: Gandiva >Affects Versions: 0.12.0 > Environment: Fedora 29, master (1013a1dc) > gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5) > llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos) > cmake version 3.12.1 >Reporter: Suvayu Ali >Assignee: Suvayu Ali >Priority: Major > Labels: cmake, pull-request-available > Fix For: 0.12.0 > > Attachments: CMakeError.log, CMakeOutput.log, > arrow-cmake-findllvm.patch > > Time Spent: 20m > Remaining Estimate: 0h > > I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while > detecting LLVM on the system. > {code} > $ cd build/data-an/arrow/arrow/cpp/ > $ export ARROW_HOME=/opt/data-an > $ mkdir release > $ cd release/ > $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DARROW_GANDIVA=ON ../ > [...] > -- Found LLVM 6.0.1 > -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm > CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message): > Target X86 is not in the set of libraries. > Call Stack (most recent call first): > cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames) > src/gandiva/CMakeLists.txt:25 (find_package) > -- Configuring incomplete, errors occurred! > {code} > The cmake log files are attached. > When I invoke cmake with options other than *Gandiva*, it finishes > successfully. > Here are the llvm libraries that are installed on my system: > {code} > $ rpm -qa llvm\* | sort > llvm3.9-libs-3.9.1-13.fc28.x86_64 > llvm4.0-libs-4.0.1-5.fc28.x86_64 > llvm-6.0.1-8.fc28.x86_64 > llvm-devel-6.0.1-8.fc28.x86_64 > llvm-libs-6.0.1-8.fc28.i686 > llvm-libs-6.0.1-8.fc28.x86_64 > $ ls /usr/lib64/libLLVM* /usr/include/llvm > /usr/lib64/libLLVM-6.0.1.so /usr/lib64/libLLVM-6.0.so /usr/lib64/libLLVM.so > /usr/include/llvm: > ADT FuzzMutate Object Support > Analysis InitializePasses.h ObjectYAML TableGen > AsmParserIR Option Target > BinaryFormat IRReaderPassAnalysisSupport.h Testing > Bitcode LineEditor Passes ToolDrivers > CodeGen LinkAllIR.h Pass.h Transforms > Config LinkAllPasses.h PassInfo.h WindowsManifest > DebugInfoLinker PassRegistry.h WindowsResource > Demangle LTO PassSupport.h XRay > ExecutionEngine MC ProfileData > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3906) [C++] Break builder.cc into multiple compilation units
[ https://issues.apache.org/jira/browse/ARROW-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3906: --- Assignee: Antoine Pitrou > [C++] Break builder.cc into multiple compilation units > -- > > Key: ARROW-3906 > URL: https://issues.apache.org/jira/browse/ARROW-3906 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 20m > Remaining Estimate: 0h > > To improve readability I suggest splitting {{builder.cc}} into independent > compilation units. Concrete builder classes are generally independent of each > other. The only concern is whether inlining some of the base class > implementations is important for performance. > This would also make incremental compilation faster when changing one of the > concrete classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3906) [C++] Break builder.cc into multiple compilation units
[ https://issues.apache.org/jira/browse/ARROW-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3906. - Resolution: Fixed Fix Version/s: 0.12.0 Issue resolved by pull request 3076 [https://github.com/apache/arrow/pull/3076] > [C++] Break builder.cc into multiple compilation units > -- > > Key: ARROW-3906 > URL: https://issues.apache.org/jira/browse/ARROW-3906 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 20m > Remaining Estimate: 0h > > To improve readability I suggest splitting {{builder.cc}} into independent > compilation units. Concrete builder classes are generally independent of each > other. The only concern is whether inlining some of the base class > implementations is important for performance. > This would also make incremental compilation faster when changing one of the > concrete classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3884) [Python] Add LLVM6 to manylinux1 base image
[ https://issues.apache.org/jira/browse/ARROW-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3884. - Resolution: Fixed Issue resolved by pull request 3079 [https://github.com/apache/arrow/pull/3079] > [Python] Add LLVM6 to manylinux1 base image > --- > > Key: ARROW-3884 > URL: https://issues.apache.org/jira/browse/ARROW-3884 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This is necessary to be able to build and bundle libgandiva with the 0.12 > release > This (epic!) build definition in Apache Kudu may be useful for building only > the pieces that we need for linking the Gandiva libraries, which may help > keep the image size minimal > https://github.com/apache/kudu/blob/master/thirdparty/build-definitions.sh#L175 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly
[ https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3874: -- Labels: cmake pull-request-available (was: cmake) > [Gandiva] Cannot build: LLVM not detected correctly > --- > > Key: ARROW-3874 > URL: https://issues.apache.org/jira/browse/ARROW-3874 > Project: Apache Arrow > Issue Type: Bug > Components: Gandiva >Affects Versions: 0.12.0 > Environment: Fedora 29, master (1013a1dc) > gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5) > llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos) > cmake version 3.12.1 >Reporter: Suvayu Ali >Priority: Major > Labels: cmake, pull-request-available > Attachments: CMakeError.log, CMakeOutput.log, > arrow-cmake-findllvm.patch > > > I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while > detecting LLVM on the system. > {code} > $ cd build/data-an/arrow/arrow/cpp/ > $ export ARROW_HOME=/opt/data-an > $ mkdir release > $ cd release/ > $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DARROW_GANDIVA=ON ../ > [...] > -- Found LLVM 6.0.1 > -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm > CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message): > Target X86 is not in the set of libraries. > Call Stack (most recent call first): > cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames) > src/gandiva/CMakeLists.txt:25 (find_package) > -- Configuring incomplete, errors occurred! > {code} > The cmake log files are attached. > When I invoke cmake with options other than *Gandiva*, it finishes > successfully. > Here are the llvm libraries that are installed on my system: > {code} > $ rpm -qa llvm\* | sort > llvm3.9-libs-3.9.1-13.fc28.x86_64 > llvm4.0-libs-4.0.1-5.fc28.x86_64 > llvm-6.0.1-8.fc28.x86_64 > llvm-devel-6.0.1-8.fc28.x86_64 > llvm-libs-6.0.1-8.fc28.i686 > llvm-libs-6.0.1-8.fc28.x86_64 > $ ls /usr/lib64/libLLVM* /usr/include/llvm > /usr/lib64/libLLVM-6.0.1.so /usr/lib64/libLLVM-6.0.so /usr/lib64/libLLVM.so > /usr/include/llvm: > ADT FuzzMutate Object Support > Analysis InitializePasses.h ObjectYAML TableGen > AsmParserIR Option Target > BinaryFormat IRReaderPassAnalysisSupport.h Testing > Bitcode LineEditor Passes ToolDrivers > CodeGen LinkAllIR.h Pass.h Transforms > Config LinkAllPasses.h PassInfo.h WindowsManifest > DebugInfoLinker PassRegistry.h WindowsResource > Demangle LTO PassSupport.h XRay > ExecutionEngine MC ProfileData > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3199) [Plasma] Check for EAGAIN in recvmsg and sendmsg
[ https://issues.apache.org/jira/browse/ARROW-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philipp Moritz resolved ARROW-3199. --- Resolution: Fixed Issue resolved by pull request 2551 [https://github.com/apache/arrow/pull/2551] > [Plasma] Check for EAGAIN in recvmsg and sendmsg > > > Key: ARROW-3199 > URL: https://issues.apache.org/jira/browse/ARROW-3199 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Philipp Moritz >Assignee: Philipp Moritz >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > It turns out that > [https://github.com/apache/arrow/blob/673125fd416cbd2e5c2cb9cb6a4c925adecdaf2c/cpp/src/plasma/fling.cc#L63] > and probably also > [https://github.com/apache/arrow/blob/673125fd416cbd2e5c2cb9cb6a4c925adecdaf2c/cpp/src/plasma/fling.cc#L49] > can block and give an EAGAIN error. > This was discovered during stress tests by https://github.com/stephanie-wang/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2759) Export notification socket of Plasma
[ https://issues.apache.org/jira/browse/ARROW-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philipp Moritz resolved ARROW-2759. --- Resolution: Fixed Fix Version/s: 0.12.0 Issue resolved by pull request 3008 [https://github.com/apache/arrow/pull/3008] > Export notification socket of Plasma > > > Key: ARROW-2759 > URL: https://issues.apache.org/jira/browse/ARROW-2759 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++), Python >Reporter: Siyuan Zhuang >Assignee: Siyuan Zhuang >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Currently, I am implementing an async interface for Ray. The implementation > needs some kind of message polling methods like `get_next_notification`. > Unfortunately, I find `get_next_notification` in > `[https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx]` > blocking, which is an impediment to implementing async utilities. Also, it's > hard to check the status of the socket (it could be closed or break up). So I > suggest export the notification socket so that there will be more flexibility. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3884) [Python] Add LLVM6 to manylinux1 base image
[ https://issues.apache.org/jira/browse/ARROW-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3884: -- Labels: pull-request-available (was: ) > [Python] Add LLVM6 to manylinux1 base image > --- > > Key: ARROW-3884 > URL: https://issues.apache.org/jira/browse/ARROW-3884 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > > This is necessary to be able to build and bundle libgandiva with the 0.12 > release > This (epic!) build definition in Apache Kudu may be useful for building only > the pieces that we need for linking the Gandiva libraries, which may help > keep the image size minimal > https://github.com/apache/kudu/blob/master/thirdparty/build-definitions.sh#L175 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3932) [Python/Documentation] Include Benchmarks.md in Sphinx docs
Uwe L. Korn created ARROW-3932: -- Summary: [Python/Documentation] Include Benchmarks.md in Sphinx docs Key: ARROW-3932 URL: https://issues.apache.org/jira/browse/ARROW-3932 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Reporter: Uwe L. Korn Assignee: Uwe L. Korn https://github.com/apache/arrow/pull/2856#issuecomment-443711136 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3842) [R] RecordBatchStreamWriter api
[ https://issues.apache.org/jira/browse/ARROW-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3842. - Resolution: Fixed Fix Version/s: 0.12.0 Issue resolved by pull request 3043 [https://github.com/apache/arrow/pull/3043] > [R] RecordBatchStreamWriter api > --- > > Key: ARROW-3842 > URL: https://issues.apache.org/jira/browse/ARROW-3842 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Romain François >Assignee: Romain François >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 10m > Remaining Estimate: 0h > > To support the "Writing and Reading Streams" section of the vignette, perhaps > we should rely more on the RecordBatchStreamWriter class and less the > `write_record_batch` function. > We should be able to write code resembling the python api : > {code:r} > batch <- ... > sink <- buffer_output_stream() > writer <- record_batch_stream_writer(sink, batch$schema()) > writer$write_batch() > writer$close() > sink$getvalue() > {code} > Most of the code is there, but we need to add > - RecordBatchStreamWriter$write_batch() : write a record batch to the stream. > We already have RecordBatchStreamWriter$WriteRecordBatch > - RecordBatchStreamWriter$close() : not sure why it is lower case close() in > python but upper case in C++. We already have RecordBatchWriter$Close() > - BufferOutputStream$getvalue() : we already have BufferOutputStream$Finish() > Currently the constructor for a BufferOutputStream is buffer_output_stream(), > perhaps we can align with python and make it BufferOutputStream, that would > not clash with the `arrow::BufferOutputStream` class because of the > namespacing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals
[ https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707796#comment-16707796 ] Wes McKinney commented on ARROW-3586: - Might want to do that in a different conda environment > [Python] Segmentation fault when converting empty table to pandas with > categoricals > --- > > Key: ARROW-3586 > URL: https://issues.apache.org/jira/browse/ARROW-3586 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.10.0, 0.11.0 > Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas > 0.23.4 > - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4 >Reporter: Andreas >Priority: Major > Fix For: 0.12.0 > > > {code:java} > import pyarrow as pa > table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], > names=['col']) > table.to_pandas(categories=['col']){code} > This produces a segmentation fault for certain types (e.g, int\{32,64}) while > it works for others (e.g. string, binary). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals
[ https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707795#comment-16707795 ] Wes McKinney commented on ARROW-3586: - You can pip install the 0.11 or 0.11.1 wheel and check, {{pip install pyarrow==0.11.0}} > [Python] Segmentation fault when converting empty table to pandas with > categoricals > --- > > Key: ARROW-3586 > URL: https://issues.apache.org/jira/browse/ARROW-3586 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.10.0, 0.11.0 > Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas > 0.23.4 > - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4 >Reporter: Andreas >Priority: Major > Fix For: 0.12.0 > > > {code:java} > import pyarrow as pa > table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], > names=['col']) > table.to_pandas(categories=['col']){code} > This produces a segmentation fault for certain types (e.g, int\{32,64}) while > it works for others (e.g. string, binary). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals
[ https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707703#comment-16707703 ] Francois Saint-Jacques commented on ARROW-3586: --- Is this possible this was solved in the master branch? I can't seem to reproduce locally. ``` for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]: print(pa.Table.from_arrays(arrays=[pa.array([], type=t)], names=['col']).to_pandas(categories=['col'])) Empty DataFrame Columns: [col] Index: [] Empty DataFrame Columns: [col] Index: [] Empty DataFrame Columns: [col] Index: [] Empty DataFrame Columns: [col] Index: [] for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]: print(pa.Table.from_arrays(arrays=[pa.array([1,2,3], type=t)], names=['col']).to_pandas(categories=['col'])) col 0 1 1 2 2 3 col 0 1 1 2 2 3 col 0 1.0 1 2.0 2 3.0 col 0 1.0 1 2.0 2 3.0 ``` > [Python] Segmentation fault when converting empty table to pandas with > categoricals > --- > > Key: ARROW-3586 > URL: https://issues.apache.org/jira/browse/ARROW-3586 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.10.0, 0.11.0 > Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas > 0.23.4 > - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4 >Reporter: Andreas >Priority: Major > Fix For: 0.12.0 > > > {code:java} > import pyarrow as pa > table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], > names=['col']) > table.to_pandas(categories=['col']){code} > This produces a segmentation fault for certain types (e.g, int\{32,64}) while > it works for others (e.g. string, binary). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals
[ https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707706#comment-16707706 ] Francois Saint-Jacques commented on ARROW-3586: --- Note that I was using python3, not sure if this would have any impact. > [Python] Segmentation fault when converting empty table to pandas with > categoricals > --- > > Key: ARROW-3586 > URL: https://issues.apache.org/jira/browse/ARROW-3586 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.10.0, 0.11.0 > Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas > 0.23.4 > - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4 >Reporter: Andreas >Priority: Major > Fix For: 0.12.0 > > > {code:java} > import pyarrow as pa > table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], > names=['col']) > table.to_pandas(categories=['col']){code} > This produces a segmentation fault for certain types (e.g, int\{32,64}) while > it works for others (e.g. string, binary). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals
[ https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707703#comment-16707703 ] Francois Saint-Jacques edited comment on ARROW-3586 at 12/3/18 7:49 PM: Is this possible this was solved in the master branch? I can't seem to reproduce locally. {code:java} for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]: print(pa.Table.from_arrays(arrays=[pa.array([], type=t)], names=['col']).to_pandas(categories=['col'])) Empty DataFrame Columns: [col] Index: [] Empty DataFrame Columns: [col] Index: [] Empty DataFrame Columns: [col] Index: [] Empty DataFrame Columns: [col] Index: [] for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]: print(pa.Table.from_arrays(arrays=[pa.array([1,2,3], type=t)], names=['col']).to_pandas(categories=['col'])) col 0 1 1 2 2 3 col 0 1 1 2 2 3 col 0 1.0 1 2.0 2 3.0 col 0 1.0 1 2.0 2 3.0 {code} was (Author: fsaintjacques): Is this possible this was solved in the master branch? I can't seem to reproduce locally. ``` for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]: print(pa.Table.from_arrays(arrays=[pa.array([], type=t)], names=['col']).to_pandas(categories=['col'])) Empty DataFrame Columns: [col] Index: [] Empty DataFrame Columns: [col] Index: [] Empty DataFrame Columns: [col] Index: [] Empty DataFrame Columns: [col] Index: [] for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]: print(pa.Table.from_arrays(arrays=[pa.array([1,2,3], type=t)], names=['col']).to_pandas(categories=['col'])) col 0 1 1 2 2 3 col 0 1 1 2 2 3 col 0 1.0 1 2.0 2 3.0 col 0 1.0 1 2.0 2 3.0 ``` > [Python] Segmentation fault when converting empty table to pandas with > categoricals > --- > > Key: ARROW-3586 > URL: https://issues.apache.org/jira/browse/ARROW-3586 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.10.0, 0.11.0 > Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas > 0.23.4 > - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4 >Reporter: Andreas >Priority: Major > Fix For: 0.12.0 > > > {code:java} > import pyarrow as pa > table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], > names=['col']) > table.to_pandas(categories=['col']){code} > This produces a segmentation fault for certain types (e.g, int\{32,64}) while > it works for others (e.g. string, binary). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2839) [JS] Support whatwg/streams in IPC reader/writer
[ https://issues.apache.org/jira/browse/ARROW-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-2839: - Fix Version/s: (was: JS-0.4.0) JS-0.5.0 > [JS] Support whatwg/streams in IPC reader/writer > > > Key: ARROW-2839 > URL: https://issues.apache.org/jira/browse/ARROW-2839 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Affects Versions: JS-0.3.1 >Reporter: Paul Taylor >Assignee: Paul Taylor >Priority: Major > Fix For: JS-0.5.0 > > > We should make it easy to stream Arrow in the browser via > [whatwg/streams|https://github.com/whatwg/streams]. I already have this > working at Graphistry, but I had to use some of the IPC internal methods. > Creating this issue to track back-porting that work and the few minor > refactors to the IPC internals that we'll need to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3470) [C++] Row-wise conversion tutorial has fallen out of date
[ https://issues.apache.org/jira/browse/ARROW-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707664#comment-16707664 ] Francois Saint-Jacques commented on ARROW-3470: --- See added PR for the difference in documentation (single embedded code block with comments). > [C++] Row-wise conversion tutorial has fallen out of date > - > > Key: ARROW-3470 > URL: https://issues.apache.org/jira/browse/ARROW-3470 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 10m > Remaining Estimate: 0h > > As reported on user@ list -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3470) [C++] Row-wise conversion tutorial has fallen out of date
[ https://issues.apache.org/jira/browse/ARROW-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3470: -- Labels: pull-request-available (was: ) > [C++] Row-wise conversion tutorial has fallen out of date > - > > Key: ARROW-3470 > URL: https://issues.apache.org/jira/browse/ARROW-3470 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > > As reported on user@ list -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3303) [C++] Enable example arrays to be written with a simplified JSON representation
[ https://issues.apache.org/jira/browse/ARROW-3303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-3303: - Assignee: Antoine Pitrou > [C++] Enable example arrays to be written with a simplified JSON > representation > --- > > Key: ARROW-3303 > URL: https://issues.apache.org/jira/browse/ARROW-3303 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Fix For: 0.13.0 > > > In addition to making it easier to generate random data as described in > ARROW-2329, I think it would be useful to reduce some of the boilerplate > associated with writing down explicit test cases. The benefits of this will > be especially pronounced when writing nested arrays. > Example code that could be improved this way: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array-test.cc#L3271 > Rather than having a ton of hand-written assertions, we could compare with > the expected true dataset. Of course, this itself has to be tested > endogenously, but I think we can write enough tests for the JSON parser bit > to be able to have confidence in tests that are written with it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps
[ https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee closed ARROW-3907. Resolution: Not A Problem Fix Version/s: 0.11.1 Closing for now. Not convinced Safe is the best solution to address timestamp resolution. If a schema is used it should be clear the intent is to convert pandas nanoseconds to a lower resolution. I think the same can be said for other types of conversions like floats to int. > [Python] from_pandas errors when schemas are used with lower resolution > timestamps > -- > > Key: ARROW-3907 > URL: https://issues.apache.org/jira/browse/ARROW-3907 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: David Lee >Priority: Major > Fix For: 0.11.1 > > > When passing in a schema object to from_pandas a resolution error occurs if > the schema uses a lower resolution timestamp. Do we need to also add > "coerce_timestamps" and "allow_truncated_timestamps" parameters found in > write_table() to from_pandas()? > Error: > pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would > lose data: 1532015191753713000', 'Conversion failed for column modified with > type datetime64[ns]') > Code: > > {code:java} > processed_schema = pa.schema([ > pa.field('Id', pa.string()), > pa.field('modified', pa.timestamp('ms')), > pa.field('records', pa.int32()) > ]) > pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3470) [C++] Row-wise conversion tutorial has fallen out of date
[ https://issues.apache.org/jira/browse/ARROW-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707519#comment-16707519 ] Francois Saint-Jacques commented on ARROW-3470: --- I've extracted the full example into a single file and added multiple cmake functionnality to build it (mimicking the benchmark/test facility). I'm wondering if it's ok to simplify the whole documented example with a single code block where the text is in comments? > [C++] Row-wise conversion tutorial has fallen out of date > - > > Key: ARROW-3470 > URL: https://issues.apache.org/jira/browse/ARROW-3470 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Francois Saint-Jacques >Priority: Major > Fix For: 0.12.0 > > > As reported on user@ list -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3470) [C++] Row-wise conversion tutorial has fallen out of date
[ https://issues.apache.org/jira/browse/ARROW-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-3470: - Assignee: Francois Saint-Jacques > [C++] Row-wise conversion tutorial has fallen out of date > - > > Key: ARROW-3470 > URL: https://issues.apache.org/jira/browse/ARROW-3470 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Francois Saint-Jacques >Priority: Major > Fix For: 0.12.0 > > > As reported on user@ list -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3931) Make possible to build regardless of LANG
[ https://issues.apache.org/jira/browse/ARROW-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3931: -- Labels: pull-request-available (was: ) > Make possible to build regardless of LANG > - > > Key: ARROW-3931 > URL: https://issues.apache.org/jira/browse/ARROW-3931 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.12.0 >Reporter: Kousuke Saruta >Priority: Minor > Labels: pull-request-available > > At the time of building C++ libs, CompilerInfo.cmake checks the version of > compiler to be used. > How to check is string matching of output of gcc -v or like clang -v. > When LANG is not related to English, build will fail because string match > fails. > The following is the case of ja_JP.UTF-8 (Japanese). > {code} > CMake Error at cmake_modules/CompilerInfo.cmake:92 (message): > > > > Unknown compiler. Version info: > > > > > > > > 組み込み spec を使用しています。 > > > > > > > COLLECT_GCC=/usr/bin/c++ > > > > > > > > COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper > > > > > > > > ターゲット: x86_64-redhat-linux > > > > > > > configure 設定: ../configure --prefix=/usr --mandir=/usr/share/man > > > > --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla > > > > --enable-bootstrap --enable-shared --enable-threads=posix > > > > --enable-checking=release --with-system-zlib --enable-__cxa_atexit > > > > --disable-libunwind-exceptions --enable-gnu-unique-object > > > >
[jira] [Created] (ARROW-3931) Make possible to build regardless of LANG
Kousuke Saruta created ARROW-3931: - Summary: Make possible to build regardless of LANG Key: ARROW-3931 URL: https://issues.apache.org/jira/browse/ARROW-3931 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.12.0 Reporter: Kousuke Saruta At the time of building C++ libs, CompilerInfo.cmake checks the version of compiler to be used. How to check is string matching of output of gcc -v or like clang -v. When LANG is not related to English, build will fail because string match fails. The following is the case of ja_JP.UTF-8 (Japanese). {code} CMake Error at cmake_modules/CompilerInfo.cmake:92 (message): Unknown compiler. Version info: 組み込み spec を使用しています。 COLLECT_GCC=/usr/bin/c++ COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper ターゲット: x86_64-redhat-linux configure 設定: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto
[jira] [Updated] (ARROW-3906) [C++] Break builder.cc into multiple compilation units
[ https://issues.apache.org/jira/browse/ARROW-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3906: -- Labels: pull-request-available (was: ) > [C++] Break builder.cc into multiple compilation units > -- > > Key: ARROW-3906 > URL: https://issues.apache.org/jira/browse/ARROW-3906 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > To improve readability I suggest splitting {{builder.cc}} into independent > compilation units. Concrete builder classes are generally independent of each > other. The only concern is whether inlining some of the base class > implementations is important for performance. > This would also make incremental compilation faster when changing one of the > concrete classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3853) [C++] Implement string to timestamp cast
[ https://issues.apache.org/jira/browse/ARROW-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3853. - Resolution: Fixed Fix Version/s: (was: 0.13.0) 0.12.0 Issue resolved by pull request 3044 [https://github.com/apache/arrow/pull/3044] > [C++] Implement string to timestamp cast > > > Key: ARROW-3853 > URL: https://issues.apache.org/jira/browse/ARROW-3853 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > Companion work to ARROW-3738 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3930) [C++] Random test data generation is slow
[ https://issues.apache.org/jira/browse/ARROW-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3930: -- Labels: pull-request-available (was: ) > [C++] Random test data generation is slow > - > > Key: ARROW-3930 > URL: https://issues.apache.org/jira/browse/ARROW-3930 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > It seems a non-negligible amount of time in the test suite is spent in the > Mersenne Twister random engine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3930) [C++] Random test data generation is slow
Antoine Pitrou created ARROW-3930: - Summary: [C++] Random test data generation is slow Key: ARROW-3930 URL: https://issues.apache.org/jira/browse/ARROW-3930 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.11.1 Reporter: Antoine Pitrou It seems a non-negligible amount of time in the test suite is spent in the Mersenne Twister random engine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3929) [Go] improve memory usage of CSV reader to improve runtime performances
[ https://issues.apache.org/jira/browse/ARROW-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3929: -- Labels: pull-request-available (was: ) > [Go] improve memory usage of CSV reader to improve runtime performances > --- > > Key: ARROW-3929 > URL: https://issues.apache.org/jira/browse/ARROW-3929 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Sebastien Binet >Assignee: Sebastien Binet >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly
[ https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707086#comment-16707086 ] Suvayu Ali commented on ARROW-3874: --- Done: [https://github.com/apache/arrow/pull/3072] Your question about {{jni.h}} gave me enough hints to find the correct missing package :), and now the build progresses until it fails with: {code} Scanning dependencies of target csv-chunker-test CMakeFiles/json-integration-test.dir/json-integration-test.cc.o:json-integration-test.cc:function boost::system::error_category::std_category::equivalent(std::error_code const&, int) const: error: undefined reference to 'boost::system::detail::generic_category_ncx()' {code} This is strange because I have {{boost-system-1.66.0-14.fc29.x86_64}} installed on my system. But I guess that's a test, and the libraries were built successfully. > [Gandiva] Cannot build: LLVM not detected correctly > --- > > Key: ARROW-3874 > URL: https://issues.apache.org/jira/browse/ARROW-3874 > Project: Apache Arrow > Issue Type: Bug > Components: Gandiva >Affects Versions: 0.12.0 > Environment: Fedora 29, master (1013a1dc) > gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5) > llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos) > cmake version 3.12.1 >Reporter: Suvayu Ali >Priority: Major > Labels: cmake > Attachments: CMakeError.log, CMakeOutput.log, > arrow-cmake-findllvm.patch > > > I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while > detecting LLVM on the system. > {code} > $ cd build/data-an/arrow/arrow/cpp/ > $ export ARROW_HOME=/opt/data-an > $ mkdir release > $ cd release/ > $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DARROW_GANDIVA=ON ../ > [...] > -- Found LLVM 6.0.1 > -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm > CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message): > Target X86 is not in the set of libraries. > Call Stack (most recent call first): > cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames) > src/gandiva/CMakeLists.txt:25 (find_package) > -- Configuring incomplete, errors occurred! > {code} > The cmake log files are attached. > When I invoke cmake with options other than *Gandiva*, it finishes > successfully. > Here are the llvm libraries that are installed on my system: > {code} > $ rpm -qa llvm\* | sort > llvm3.9-libs-3.9.1-13.fc28.x86_64 > llvm4.0-libs-4.0.1-5.fc28.x86_64 > llvm-6.0.1-8.fc28.x86_64 > llvm-devel-6.0.1-8.fc28.x86_64 > llvm-libs-6.0.1-8.fc28.i686 > llvm-libs-6.0.1-8.fc28.x86_64 > $ ls /usr/lib64/libLLVM* /usr/include/llvm > /usr/lib64/libLLVM-6.0.1.so /usr/lib64/libLLVM-6.0.so /usr/lib64/libLLVM.so > /usr/include/llvm: > ADT FuzzMutate Object Support > Analysis InitializePasses.h ObjectYAML TableGen > AsmParserIR Option Target > BinaryFormat IRReaderPassAnalysisSupport.h Testing > Bitcode LineEditor Passes ToolDrivers > CodeGen LinkAllIR.h Pass.h Transforms > Config LinkAllPasses.h PassInfo.h WindowsManifest > DebugInfoLinker PassRegistry.h WindowsResource > Demangle LTO PassSupport.h XRay > ExecutionEngine MC ProfileData > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly
[ https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707008#comment-16707008 ] Pindikura Ravindra commented on ARROW-3874: --- The llvm related change looks good. Would you like to raise a PR ? For the java issue, can you please check if you have a jni.h file in the jdk install directory ? > [Gandiva] Cannot build: LLVM not detected correctly > --- > > Key: ARROW-3874 > URL: https://issues.apache.org/jira/browse/ARROW-3874 > Project: Apache Arrow > Issue Type: Bug > Components: Gandiva >Affects Versions: 0.12.0 > Environment: Fedora 29, master (1013a1dc) > gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5) > llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos) > cmake version 3.12.1 >Reporter: Suvayu Ali >Priority: Major > Labels: cmake > Attachments: CMakeError.log, CMakeOutput.log, > arrow-cmake-findllvm.patch > > > I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while > detecting LLVM on the system. > {code} > $ cd build/data-an/arrow/arrow/cpp/ > $ export ARROW_HOME=/opt/data-an > $ mkdir release > $ cd release/ > $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DARROW_GANDIVA=ON ../ > [...] > -- Found LLVM 6.0.1 > -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm > CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message): > Target X86 is not in the set of libraries. > Call Stack (most recent call first): > cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames) > src/gandiva/CMakeLists.txt:25 (find_package) > -- Configuring incomplete, errors occurred! > {code} > The cmake log files are attached. > When I invoke cmake with options other than *Gandiva*, it finishes > successfully. > Here are the llvm libraries that are installed on my system: > {code} > $ rpm -qa llvm\* | sort > llvm3.9-libs-3.9.1-13.fc28.x86_64 > llvm4.0-libs-4.0.1-5.fc28.x86_64 > llvm-6.0.1-8.fc28.x86_64 > llvm-devel-6.0.1-8.fc28.x86_64 > llvm-libs-6.0.1-8.fc28.i686 > llvm-libs-6.0.1-8.fc28.x86_64 > $ ls /usr/lib64/libLLVM* /usr/include/llvm > /usr/lib64/libLLVM-6.0.1.so /usr/lib64/libLLVM-6.0.so /usr/lib64/libLLVM.so > /usr/include/llvm: > ADT FuzzMutate Object Support > Analysis InitializePasses.h ObjectYAML TableGen > AsmParserIR Option Target > BinaryFormat IRReaderPassAnalysisSupport.h Testing > Bitcode LineEditor Passes ToolDrivers > CodeGen LinkAllIR.h Pass.h Transforms > Config LinkAllPasses.h PassInfo.h WindowsManifest > DebugInfoLinker PassRegistry.h WindowsResource > Demangle LTO PassSupport.h XRay > ExecutionEngine MC ProfileData > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3929) [Go] improve memory usage of CSV reader to improve runtime performances
Sebastien Binet created ARROW-3929: -- Summary: [Go] improve memory usage of CSV reader to improve runtime performances Key: ARROW-3929 URL: https://issues.apache.org/jira/browse/ARROW-3929 Project: Apache Arrow Issue Type: Improvement Components: Go Reporter: Sebastien Binet Assignee: Sebastien Binet -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3681) [Go] add benchmarks for CSV reader
[ https://issues.apache.org/jira/browse/ARROW-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3681: -- Labels: pull-request-available (was: ) > [Go] add benchmarks for CSV reader > -- > > Key: ARROW-3681 > URL: https://issues.apache.org/jira/browse/ARROW-3681 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Sebastien Binet >Assignee: Sebastien Binet >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3916) [Python] Support caller-provided filesystem in `ParquetWriter` constructor
[ https://issues.apache.org/jira/browse/ARROW-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706856#comment-16706856 ] Mackenzie commented on ARROW-3916: -- Yep! Here it is: https://github.com/apache/arrow/pull/3070 > [Python] Support caller-provided filesystem in `ParquetWriter` constructor > -- > > Key: ARROW-3916 > URL: https://issues.apache.org/jira/browse/ARROW-3916 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Affects Versions: 0.11.1 >Reporter: Mackenzie >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > > Currently to write files incrementally to S3, the following pattern appears > necessary: > {{def write_dfs_to_s3(dfs, fname):}} > {{ first_df = dfs[0]}} > {{ table = pa.Table.from_pandas(first_df, preserve_index=False)}} > {{ fs = s3fs.S3FileSystem()}} > {{ fh = fs.open(fname, 'wb')}} > {{ with pq.ParquetWriter(fh, table.schema) as writer:}} > {{ # set file handle on writer so writer manages closing it when it > is itself closed}} > {{ writer.file_handle = fh}} > {{ writer.write_table(table=table)}} > {{ for df in dfs[1:]:}} > {{ table = pa.Table.from_pandas(df, preserve_index=False)}} > {{ writer.write_table(table=table)}} > This works as expected, but is quite roundabout. It would be much easier if > `ParquetWriter` supported `filesystem` as a keyword argument in its > constructor, in which case the `_get_fs_from_path` would be overriden by the > usual pattern of using the kwarg after ensuring it is a proper file system > with `_ensure_filesystem`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3681) [Go] add benchmarks for CSV reader
[ https://issues.apache.org/jira/browse/ARROW-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastien Binet reassigned ARROW-3681: -- Assignee: Sebastien Binet > [Go] add benchmarks for CSV reader > -- > > Key: ARROW-3681 > URL: https://issues.apache.org/jira/browse/ARROW-3681 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Sebastien Binet >Assignee: Sebastien Binet >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)