[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-12-03 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708321#comment-16708321
 ] 

Suvayu Ali commented on ARROW-3874:
---

Since I'm using {{java-1.8.0-openjdk}}, I had to install 
{{java-1.8.0-openjdk-devel}} to get {{jni.h}}.  For other java versions on F29, 
it should be {{java--openjdk-devel}}. 

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Assignee: Suvayu Ali
>Priority: Major
>  Labels: cmake, pull-request-available
> Fix For: 0.12.0
>
> Attachments: CMakeError.log, CMakeOutput.log, 
> arrow-cmake-findllvm.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2323) [JS] Document JavaScript release management

2018-12-03 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette reassigned ARROW-2323:


Assignee: Brian Hulette

> [JS] Document JavaScript release management
> ---
>
> Key: ARROW-2323
> URL: https://issues.apache.org/jira/browse/ARROW-2323
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Assignee: Brian Hulette
>Priority: Major
> Fix For: JS-0.4.0
>
>
> The JavaScript post-vote release management process is not documented. For 
> example, there is certain NPM-related steps required to be able to publish 
> artifacts after the release vote has taken place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2984) [JS] Refactor release verification script to share code with main source release verification script

2018-12-03 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-2984:
-
Fix Version/s: (was: JS-0.4.0)
   JS-0.5.0

> [JS] Refactor release verification script to share code with main source 
> release verification script
> 
>
> Key: ARROW-2984
> URL: https://issues.apache.org/jira/browse/ARROW-2984
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: JS-0.5.0
>
>
> There is some possible code duplication. See discussion in ARROW-2977 
> https://github.com/apache/arrow/pull/2369



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3892) [JS] Remove any dependency on compromised NPM flatmap-stream package

2018-12-03 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette reassigned ARROW-3892:


Assignee: Brian Hulette

> [JS] Remove any dependency on compromised NPM flatmap-stream package
> 
>
> Key: ARROW-3892
> URL: https://issues.apache.org/jira/browse/ARROW-3892
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>
> We are erroring out as the result of 
> https://github.com/dominictarr/event-stream/issues/116
> {code}
>  npm ERR! code ENOVERSIONS
>  npm ERR! No valid versions available for flatmap-stream
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3892) [JS] Remove any dependency on compromised NPM flatmap-stream package

2018-12-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3892:
--
Labels: pull-request-available  (was: )

> [JS] Remove any dependency on compromised NPM flatmap-stream package
> 
>
> Key: ARROW-3892
> URL: https://issues.apache.org/jira/browse/ARROW-3892
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>
> We are erroring out as the result of 
> https://github.com/dominictarr/event-stream/issues/116
> {code}
>  npm ERR! code ENOVERSIONS
>  npm ERR! No valid versions available for flatmap-stream
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (ARROW-3834) [Doc] Merge Python & C++ and move to top-level

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-3834:
-

> [Doc] Merge Python & C++ and move to top-level
> --
>
> Key: ARROW-3834
> URL: https://issues.apache.org/jira/browse/ARROW-3834
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Merge the C++, Python and Format documentation and move it to the top-level 
> folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column

2018-12-03 Thread Brian Hulette (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708149#comment-16708149
 ] 

Brian Hulette commented on ARROW-3667:
--

Makes sense, thanks for the context.
Maybe I'll start a discussion on the mailing list to define how we represent 
the null datatype in JSON.

> [JS] Incorrectly reads record batches with an all null column
> -
>
> Key: ARROW-3667
> URL: https://issues.apache.org/jira/browse/ARROW-3667
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: JS-0.3.1
>Reporter: Brian Hulette
>Priority: Major
> Fix For: JS-0.4.0
>
>
> The JS library seems to incorrectly read any columns that come after an 
> all-null column in IPC buffers produced by pyarrow.
> Here's a python script that generates two arrow buffers, one with an all-null 
> column followed by a utf-8 column, and a second with those two reversed
> {code:python}
> import pyarrow as pa
> import pandas as pd
> def serialize_to_arrow(df, fd, compress=True):
>   batch = pa.RecordBatch.from_pandas(df)
>   writer = pa.RecordBatchFileWriter(fd, batch.schema)
>   writer.write_batch(batch)
>   writer.close()
> if __name__ == "__main__":
> df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', 
> 'def', 'ghi']}, columns=['nulls', 'not nulls'])
> with open('bad.arrow', 'wb') as fd:
> serialize_to_arrow(df, fd)
> df = pd.DataFrame(df, columns=['not nulls', 'nulls'])
> with open('good.arrow', 'wb') as fd:
> serialize_to_arrow(df, fd)
> {code}
> JS incorrectly interprets the [null, not null] case:
> {code:javascript}
> > var arrow = require('apache-arrow')
> undefined
> > var fs = require('fs')
> undefined
> > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not 
> > nulls').get(0)
> 'abc'
> > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0)
> '\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u'
> {code}
> Presumably this is because pyarrow is omitting some (or all) of the buffers 
> associated with the all-null column, but the JS IPC reader is still looking 
> for them, causing the buffer count to get out of sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3834) [Doc] Merge Python & C++ and move to top-level

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3834.
-
Resolution: Fixed

Issue resolved by pull request 2856
[https://github.com/apache/arrow/pull/2856]

> [Doc] Merge Python & C++ and move to top-level
> --
>
> Key: ARROW-3834
> URL: https://issues.apache.org/jira/browse/ARROW-3834
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Merge the C++, Python and Format documentation and move it to the top-level 
> folder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column

2018-12-03 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-3667:
-
Fix Version/s: (was: JS-0.4.0)
   JS-0.5.0

> [JS] Incorrectly reads record batches with an all null column
> -
>
> Key: ARROW-3667
> URL: https://issues.apache.org/jira/browse/ARROW-3667
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: JS-0.3.1
>Reporter: Brian Hulette
>Priority: Major
> Fix For: JS-0.5.0
>
>
> The JS library seems to incorrectly read any columns that come after an 
> all-null column in IPC buffers produced by pyarrow.
> Here's a python script that generates two arrow buffers, one with an all-null 
> column followed by a utf-8 column, and a second with those two reversed
> {code:python}
> import pyarrow as pa
> import pandas as pd
> def serialize_to_arrow(df, fd, compress=True):
>   batch = pa.RecordBatch.from_pandas(df)
>   writer = pa.RecordBatchFileWriter(fd, batch.schema)
>   writer.write_batch(batch)
>   writer.close()
> if __name__ == "__main__":
> df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', 
> 'def', 'ghi']}, columns=['nulls', 'not nulls'])
> with open('bad.arrow', 'wb') as fd:
> serialize_to_arrow(df, fd)
> df = pd.DataFrame(df, columns=['not nulls', 'nulls'])
> with open('good.arrow', 'wb') as fd:
> serialize_to_arrow(df, fd)
> {code}
> JS incorrectly interprets the [null, not null] case:
> {code:javascript}
> > var arrow = require('apache-arrow')
> undefined
> > var fs = require('fs')
> undefined
> > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not 
> > nulls').get(0)
> 'abc'
> > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0)
> '\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u'
> {code}
> Presumably this is because pyarrow is omitting some (or all) of the buffers 
> associated with the all-null column, but the JS IPC reader is still looking 
> for them, causing the buffer count to get out of sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-951) [JS] Fix generated API documentation

2018-12-03 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-951:

Fix Version/s: (was: JS-0.4.0)
   JS-0.5.0

> [JS] Fix generated API documentation
> 
>
> Key: ARROW-951
> URL: https://issues.apache.org/jira/browse/ARROW-951
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Minor
>  Labels: documentation
> Fix For: JS-0.5.0
>
>
> The current generated API documentation doesn't respect the project's 
> namespaces, it simply lists all exported objects. We should see if we can 
> make typedoc display the project's structure (even if it means re-structuring 
> the code a bit), or find another approach for doc generation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3337) [JS] IPC writer doesn't serialize the dictionary of nested Vectors

2018-12-03 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-3337:
-
Fix Version/s: (was: JS-0.4.0)
   JS-0.5.0

> [JS] IPC writer doesn't serialize the dictionary of nested Vectors
> --
>
> Key: ARROW-3337
> URL: https://issues.apache.org/jira/browse/ARROW-3337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: JS-0.3.1
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
> Fix For: JS-0.5.0
>
>
> The JS writer only serializes dictionaries for [top-level 
> children|https://github.com/apache/arrow/blob/ee9b1ba426e2f1f117cde8d8f4ba6fbe3be5674c/js/src/ipc/writer/binary.ts#L40]
>  of a Table. This is wrong, and an oversight on my part. The fix here is to 
> put the actual Dictionary vectors in the `schema.dictionaries` map instead of 
> the dictionary fields, like I understand the C++ does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2860) [Python] Null values in a single partition of Parquet dataset, results in invalid schema on read

2018-12-03 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708141#comment-16708141
 ] 

Wes McKinney commented on ARROW-2860:
-

Thanks for checking that. There's a couple of related issues that may be the 
same thing

> [Python] Null values in a single partition of Parquet dataset, results in 
> invalid schema on read
> 
>
> Key: ARROW-2860
> URL: https://issues.apache.org/jira/browse/ARROW-2860
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Sam Oluwalana
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> from datetime import datetime, timedelta
> def generate_data(event_type, event_id, offset=0):
> """Generate data."""
> now = datetime.utcnow() + timedelta(seconds=offset)
> obj = {
> 'event_type': event_type,
> 'event_id': event_id,
> 'event_date': now.date(),
> 'foo': None,
> 'bar': u'hello',
> }
> if event_type == 2:
> obj['foo'] = 1
> obj['bar'] = u'world'
> if event_type == 3:
> obj['different'] = u'data'
> obj['bar'] = u'event type 3'
> else:
> obj['different'] = None
> return obj
> data = [
> generate_data(1, 1, 1),
> generate_data(1, 1, 3600 * 72),
> generate_data(2, 1, 1),
> generate_data(2, 1, 3600 * 72),
> generate_data(3, 1, 1),
> generate_data(3, 1, 3600 * 72),
> ]
> df = pd.DataFrame.from_records(data, index='event_id')
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, root_path='/tmp/events', 
> partition_cols=['event_type', 'event_date'])
> dataset = pq.ParquetDataset('/tmp/events')
> table = dataset.read()
> print(table.num_rows)
> {code}
> Expected output:
> {code:python}
> 6
> {code}
> Actual:
> {code:python}
> python example_failure.py
> Traceback (most recent call last):
>   File "example_failure.py", line 43, in 
> dataset = pq.ParquetDataset('/tmp/events')
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 745, in __init__
> self.validate_schemas()
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 775, in validate_schemas
> dataset_schema))
> ValueError: Schema in partition[event_type=2, event_date=0] 
> /tmp/events/event_type=3/event_date=2018-07-16 
> 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different.
> bar: string
> different: string
> foo: double
> event_id: int64
> metadata
> 
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> vs
> bar: string
> different: null
> foo: double
> event_id: int64
> metadata
> 
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> {code}
> Apparently what is happening is that pyarrow is interpreting the schema from 
> each of the partitions individually and the partitions for `event_type=3 / 
> event_date=*`  both have values for the column `different` whereas the other 
> columns do not. The discrepancy causes the `None` values of the other 
> partitions to be labeled as `pandas_type` `empty` instead of `unicode`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2909) [JS] Add convenience function for creating a table from a list of vectors

2018-12-03 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette resolved ARROW-2909.
--
Resolution: Fixed

Issue resolved by pull request 2322
[https://github.com/apache/arrow/pull/2322]

> [JS] Add convenience function for creating a table from a list of vectors
> -
>
> Key: ARROW-2909
> URL: https://issues.apache.org/jira/browse/ARROW-2909
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Similar to ARROW-2766, but requires users to first turn their arrays into 
> vectors, so we don't have to deduce type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2860) [Python] Null values in a single partition of Parquet dataset, results in invalid schema on read

2018-12-03 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708137#comment-16708137
 ] 

Tanya Schlusser commented on ARROW-2860:


I think this was resolved with https://issues.apache.org/jira/browse/ARROW-2891

pull request 2302
[https://github.com/apache/arrow/pull/2302]

When I run {{example_failure.py}} it does not fail and returns the expected 
result.

> [Python] Null values in a single partition of Parquet dataset, results in 
> invalid schema on read
> 
>
> Key: ARROW-2860
> URL: https://issues.apache.org/jira/browse/ARROW-2860
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Sam Oluwalana
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> from datetime import datetime, timedelta
> def generate_data(event_type, event_id, offset=0):
> """Generate data."""
> now = datetime.utcnow() + timedelta(seconds=offset)
> obj = {
> 'event_type': event_type,
> 'event_id': event_id,
> 'event_date': now.date(),
> 'foo': None,
> 'bar': u'hello',
> }
> if event_type == 2:
> obj['foo'] = 1
> obj['bar'] = u'world'
> if event_type == 3:
> obj['different'] = u'data'
> obj['bar'] = u'event type 3'
> else:
> obj['different'] = None
> return obj
> data = [
> generate_data(1, 1, 1),
> generate_data(1, 1, 3600 * 72),
> generate_data(2, 1, 1),
> generate_data(2, 1, 3600 * 72),
> generate_data(3, 1, 1),
> generate_data(3, 1, 3600 * 72),
> ]
> df = pd.DataFrame.from_records(data, index='event_id')
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, root_path='/tmp/events', 
> partition_cols=['event_type', 'event_date'])
> dataset = pq.ParquetDataset('/tmp/events')
> table = dataset.read()
> print(table.num_rows)
> {code}
> Expected output:
> {code:python}
> 6
> {code}
> Actual:
> {code:python}
> python example_failure.py
> Traceback (most recent call last):
>   File "example_failure.py", line 43, in 
> dataset = pq.ParquetDataset('/tmp/events')
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 745, in __init__
> self.validate_schemas()
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 775, in validate_schemas
> dataset_schema))
> ValueError: Schema in partition[event_type=2, event_date=0] 
> /tmp/events/event_type=3/event_date=2018-07-16 
> 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different.
> bar: string
> different: string
> foo: double
> event_id: int64
> metadata
> 
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> vs
> bar: string
> different: null
> foo: double
> event_id: int64
> metadata
> 
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> {code}
> Apparently what is happening is that pyarrow is interpreting the schema from 
> each of the partitions individually and the partitions for `event_type=3 / 
> event_date=*`  both have values for the column `different` whereas the other 
> columns do not. The discrepancy causes the `None` values of the other 
> partitions to be labeled as `pandas_type` `empty` instead of `unicode`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3933:

Labels: parquet  (was: )

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Priority: Minor
>  Labels: parquet
> Fix For: 0.12.0
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3933:

Fix Version/s: 0.12.0

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Priority: Minor
>  Labels: parquet
> Fix For: 0.12.0
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3933:

Summary: [Python] Segfault reading Parquet files from GNOMAD  (was: pyarrow 
segfault reading Parquet files from GNOMAD)

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Priority: Minor
>  Labels: parquet
> Fix For: 0.12.0
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3933:

Component/s: Python

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Priority: Minor
>  Labels: parquet
> Fix For: 0.12.0
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3914) [C++/Python/Packaging] Docker-compose setup for Alpine linux

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3914.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 3059
[https://github.com/apache/arrow/pull/3059]

> [C++/Python/Packaging] Docker-compose setup for Alpine linux
> 
>
> Key: ARROW-3914
> URL: https://issues.apache.org/jira/browse/ARROW-3914
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Packaging, Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3934) [Gandiva] Don't compile precompiled tests if ARROW_GANDIVA_BUILD_TESTS=off

2018-12-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3934:
--
Labels: pull-request-available  (was: )

> [Gandiva] Don't compile precompiled tests if ARROW_GANDIVA_BUILD_TESTS=off
> --
>
> Key: ARROW-3934
> URL: https://issues.apache.org/jira/browse/ARROW-3934
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Currently the precompiled tests are compiled in any case, even if 
> ARROW_GANDIVA_BUILD_TESTS=off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3934) [Gandiva] Don't compile precompiled tests if ARROW_GANDIVA_BUILD_TESTS=off

2018-12-03 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3934:
-

 Summary: [Gandiva] Don't compile precompiled tests if 
ARROW_GANDIVA_BUILD_TESTS=off
 Key: ARROW-3934
 URL: https://issues.apache.org/jira/browse/ARROW-3934
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz
Assignee: Philipp Moritz
 Fix For: 0.12.0


Currently the precompiled tests are compiled in any case, even if 
ARROW_GANDIVA_BUILD_TESTS=off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3933) pyarrow segfault reading Parquet files from GNOMAD

2018-12-03 Thread David Konerding (JIRA)
David Konerding created ARROW-3933:
--

 Summary: pyarrow segfault reading Parquet files from GNOMAD
 Key: ARROW-3933
 URL: https://issues.apache.org/jira/browse/ARROW-3933
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
 Environment: Ubuntu 18.04 or Mac OS X
Reporter: David Konerding


I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
Error also occurs out of box on Mac OS X.

$ sudo snap install --classic google-cloud-sdk
$ gsutil cp 
gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
 .
$ conda install pyarrow
$ python test.py
Segmentation fault (core dumped)

test.py:

import pyarrow.parquet as pq
path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
pq.read_table(path)

gdb output:

Thread 3 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdf199700 (LWP 13703)]
0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
unsigned long*) () from 
/home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11

I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps

2018-12-03 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708059#comment-16708059
 ] 

Wes McKinney commented on ARROW-3907:
-

ETL can be a messy business. If you have ideas about improving the APIs for 
schema coercion / casting, I'd be interested to discuss more

> [Python] from_pandas errors when schemas are used with lower resolution 
> timestamps
> --
>
> Key: ARROW-3907
> URL: https://issues.apache.org/jira/browse/ARROW-3907
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: David Lee
>Priority: Major
> Fix For: 0.11.1
>
>
> When passing in a schema object to from_pandas a resolution error occurs if 
> the schema uses a lower resolution timestamp. Do we need to also add 
> "coerce_timestamps" and "allow_truncated_timestamps" parameters found in 
> write_table() to from_pandas()?
> Error:
> pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would 
> lose data: 1532015191753713000', 'Conversion failed for column modified with 
> type datetime64[ns]')
> Code:
>  
> {code:java}
> processed_schema = pa.schema([
> pa.field('Id', pa.string()),
> pa.field('modified', pa.timestamp('ms')),
> pa.field('records', pa.int32())
> ])
> pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3874.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 3072
[https://github.com/apache/arrow/pull/3072]

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake, pull-request-available
> Fix For: 0.12.0
>
> Attachments: CMakeError.log, CMakeOutput.log, 
> arrow-cmake-findllvm.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3874:
---

Assignee: Suvayu Ali

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Assignee: Suvayu Ali
>Priority: Major
>  Labels: cmake, pull-request-available
> Fix For: 0.12.0
>
> Attachments: CMakeError.log, CMakeOutput.log, 
> arrow-cmake-findllvm.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3906) [C++] Break builder.cc into multiple compilation units

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3906:
---

Assignee: Antoine Pitrou

> [C++] Break builder.cc into multiple compilation units
> --
>
> Key: ARROW-3906
> URL: https://issues.apache.org/jira/browse/ARROW-3906
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> To improve readability I suggest splitting {{builder.cc}} into independent 
> compilation units. Concrete builder classes are generally independent of each 
> other. The only concern is whether inlining some of the base class 
> implementations is important for performance.
> This would also make incremental compilation faster when changing one of the 
> concrete classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3906) [C++] Break builder.cc into multiple compilation units

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3906.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 3076
[https://github.com/apache/arrow/pull/3076]

> [C++] Break builder.cc into multiple compilation units
> --
>
> Key: ARROW-3906
> URL: https://issues.apache.org/jira/browse/ARROW-3906
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> To improve readability I suggest splitting {{builder.cc}} into independent 
> compilation units. Concrete builder classes are generally independent of each 
> other. The only concern is whether inlining some of the base class 
> implementations is important for performance.
> This would also make incremental compilation faster when changing one of the 
> concrete classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3884) [Python] Add LLVM6 to manylinux1 base image

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3884.
-
Resolution: Fixed

Issue resolved by pull request 3079
[https://github.com/apache/arrow/pull/3079]

> [Python] Add LLVM6 to manylinux1 base image
> ---
>
> Key: ARROW-3884
> URL: https://issues.apache.org/jira/browse/ARROW-3884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is necessary to be able to build and bundle libgandiva with the 0.12 
> release
> This (epic!) build definition in Apache Kudu may be useful for building only 
> the pieces that we need for linking the Gandiva libraries, which may help 
> keep the image size minimal
> https://github.com/apache/kudu/blob/master/thirdparty/build-definitions.sh#L175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-12-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3874:
--
Labels: cmake pull-request-available  (was: cmake)

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake, pull-request-available
> Attachments: CMakeError.log, CMakeOutput.log, 
> arrow-cmake-findllvm.patch
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3199) [Plasma] Check for EAGAIN in recvmsg and sendmsg

2018-12-03 Thread Philipp Moritz (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Moritz resolved ARROW-3199.
---
Resolution: Fixed

Issue resolved by pull request 2551
[https://github.com/apache/arrow/pull/2551]

> [Plasma] Check for EAGAIN in recvmsg and sendmsg
> 
>
> Key: ARROW-3199
> URL: https://issues.apache.org/jira/browse/ARROW-3199
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> It turns out that 
> [https://github.com/apache/arrow/blob/673125fd416cbd2e5c2cb9cb6a4c925adecdaf2c/cpp/src/plasma/fling.cc#L63]
>  and probably also 
> [https://github.com/apache/arrow/blob/673125fd416cbd2e5c2cb9cb6a4c925adecdaf2c/cpp/src/plasma/fling.cc#L49]
>  can block and give an EAGAIN error.
> This was discovered during stress tests by https://github.com/stephanie-wang/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2759) Export notification socket of Plasma

2018-12-03 Thread Philipp Moritz (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Moritz resolved ARROW-2759.
---
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 3008
[https://github.com/apache/arrow/pull/3008]

> Export notification socket of Plasma
> 
>
> Key: ARROW-2759
> URL: https://issues.apache.org/jira/browse/ARROW-2759
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++), Python
>Reporter: Siyuan Zhuang
>Assignee: Siyuan Zhuang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently, I am implementing an async interface for Ray. The implementation 
> needs some kind of message polling methods like `get_next_notification`.
>  Unfortunately, I find `get_next_notification` in 
> `[https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx]` 
> blocking, which is an impediment to implementing async utilities. Also, it's 
> hard to check the status of the socket (it could be closed or break up). So I 
> suggest export the notification socket so that there will be more flexibility.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3884) [Python] Add LLVM6 to manylinux1 base image

2018-12-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3884:
--
Labels: pull-request-available  (was: )

> [Python] Add LLVM6 to manylinux1 base image
> ---
>
> Key: ARROW-3884
> URL: https://issues.apache.org/jira/browse/ARROW-3884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> This is necessary to be able to build and bundle libgandiva with the 0.12 
> release
> This (epic!) build definition in Apache Kudu may be useful for building only 
> the pieces that we need for linking the Gandiva libraries, which may help 
> keep the image size minimal
> https://github.com/apache/kudu/blob/master/thirdparty/build-definitions.sh#L175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3932) [Python/Documentation] Include Benchmarks.md in Sphinx docs

2018-12-03 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-3932:
--

 Summary: [Python/Documentation] Include Benchmarks.md in Sphinx 
docs
 Key: ARROW-3932
 URL: https://issues.apache.org/jira/browse/ARROW-3932
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn


https://github.com/apache/arrow/pull/2856#issuecomment-443711136



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3842) [R] RecordBatchStreamWriter api

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3842.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 3043
[https://github.com/apache/arrow/pull/3043]

> [R] RecordBatchStreamWriter api
> ---
>
> Key: ARROW-3842
> URL: https://issues.apache.org/jira/browse/ARROW-3842
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Romain François
>Assignee: Romain François
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To support the "Writing and Reading Streams" section of the vignette, perhaps 
> we should rely more on the RecordBatchStreamWriter class and less the 
> `write_record_batch` function. 
> We should be able to write code resembling the python api : 
> {code:r}
> batch <- ... 
> sink <- buffer_output_stream()
> writer <- record_batch_stream_writer(sink, batch$schema())
> writer$write_batch()
> writer$close()
> sink$getvalue()
> {code}
> Most of the code is there, but we need to add 
> - RecordBatchStreamWriter$write_batch() : write a record batch to the stream. 
> We already have RecordBatchStreamWriter$WriteRecordBatch
> - RecordBatchStreamWriter$close() : not sure why it is lower case close() in 
> python but upper case in C++. We already have RecordBatchWriter$Close()
> - BufferOutputStream$getvalue() : we already have BufferOutputStream$Finish()
> Currently the constructor for a BufferOutputStream is buffer_output_stream(), 
> perhaps we can align with python and make it BufferOutputStream, that would 
> not clash with the `arrow::BufferOutputStream` class because of the 
> namespacing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals

2018-12-03 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707796#comment-16707796
 ] 

Wes McKinney commented on ARROW-3586:
-

Might want to do that in a different conda environment

> [Python] Segmentation fault when converting empty table to pandas with 
> categoricals
> ---
>
> Key: ARROW-3586
> URL: https://issues.apache.org/jira/browse/ARROW-3586
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11.0
> Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas 
> 0.23.4
> - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4
>Reporter: Andreas
>Priority: Major
> Fix For: 0.12.0
>
>
> {code:java}
> import pyarrow as pa
> table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], 
> names=['col'])
> table.to_pandas(categories=['col']){code}
> This produces a segmentation fault for certain types (e.g, int\{32,64}) while 
> it works for others (e.g. string, binary).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals

2018-12-03 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707795#comment-16707795
 ] 

Wes McKinney commented on ARROW-3586:
-

You can pip install the 0.11 or 0.11.1 wheel and check, {{pip install 
pyarrow==0.11.0}}

> [Python] Segmentation fault when converting empty table to pandas with 
> categoricals
> ---
>
> Key: ARROW-3586
> URL: https://issues.apache.org/jira/browse/ARROW-3586
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11.0
> Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas 
> 0.23.4
> - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4
>Reporter: Andreas
>Priority: Major
> Fix For: 0.12.0
>
>
> {code:java}
> import pyarrow as pa
> table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], 
> names=['col'])
> table.to_pandas(categories=['col']){code}
> This produces a segmentation fault for certain types (e.g, int\{32,64}) while 
> it works for others (e.g. string, binary).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals

2018-12-03 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707703#comment-16707703
 ] 

Francois Saint-Jacques commented on ARROW-3586:
---

Is this possible this was solved in the master branch? I can't seem to 
reproduce locally.

```
for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]:
  print(pa.Table.from_arrays(arrays=[pa.array([], type=t)], 
names=['col']).to_pandas(categories=['col'])) 
Empty DataFrame
Columns: [col]
Index: []
Empty DataFrame
Columns: [col]
Index: []
Empty DataFrame
Columns: [col]
Index: []
Empty DataFrame
Columns: [col]
Index: []

for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]:
  print(pa.Table.from_arrays(arrays=[pa.array([1,2,3], type=t)], 
names=['col']).to_pandas(categories=['col']))
 col
0 1
1 2
2 3
 col
0 1
1 2
2 3
 col
0 1.0
1 2.0
2 3.0
 col
0 1.0
1 2.0
2 3.0

```

> [Python] Segmentation fault when converting empty table to pandas with 
> categoricals
> ---
>
> Key: ARROW-3586
> URL: https://issues.apache.org/jira/browse/ARROW-3586
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11.0
> Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas 
> 0.23.4
> - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4
>Reporter: Andreas
>Priority: Major
> Fix For: 0.12.0
>
>
> {code:java}
> import pyarrow as pa
> table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], 
> names=['col'])
> table.to_pandas(categories=['col']){code}
> This produces a segmentation fault for certain types (e.g, int\{32,64}) while 
> it works for others (e.g. string, binary).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals

2018-12-03 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707706#comment-16707706
 ] 

Francois Saint-Jacques commented on ARROW-3586:
---

Note that I was using python3, not sure if this would have any impact.

> [Python] Segmentation fault when converting empty table to pandas with 
> categoricals
> ---
>
> Key: ARROW-3586
> URL: https://issues.apache.org/jira/browse/ARROW-3586
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11.0
> Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas 
> 0.23.4
> - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4
>Reporter: Andreas
>Priority: Major
> Fix For: 0.12.0
>
>
> {code:java}
> import pyarrow as pa
> table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], 
> names=['col'])
> table.to_pandas(categories=['col']){code}
> This produces a segmentation fault for certain types (e.g, int\{32,64}) while 
> it works for others (e.g. string, binary).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals

2018-12-03 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707703#comment-16707703
 ] 

Francois Saint-Jacques edited comment on ARROW-3586 at 12/3/18 7:49 PM:


Is this possible this was solved in the master branch? I can't seem to 
reproduce locally.

 
{code:java}
for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]:
   print(pa.Table.from_arrays(arrays=[pa.array([], type=t)], 
names=['col']).to_pandas(categories=['col'])) 
 Empty DataFrame
 Columns: [col]
 Index: []
 Empty DataFrame
 Columns: [col]
 Index: []
 Empty DataFrame
 Columns: [col]
 Index: []
 Empty DataFrame
 Columns: [col]
 Index: []
for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]:
  print(pa.Table.from_arrays(arrays=[pa.array([1,2,3], type=t)], 
names=['col']).to_pandas(categories=['col']))
 col
 0 1
 1 2
 2 3
 col
 0 1
 1 2
 2 3
 col
 0 1.0
 1 2.0
 2 3.0
 col
 0 1.0
 1 2.0
 2 3.0
{code}
 

 


was (Author: fsaintjacques):
Is this possible this was solved in the master branch? I can't seem to 
reproduce locally.

```
for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]:
  print(pa.Table.from_arrays(arrays=[pa.array([], type=t)], 
names=['col']).to_pandas(categories=['col'])) 
Empty DataFrame
Columns: [col]
Index: []
Empty DataFrame
Columns: [col]
Index: []
Empty DataFrame
Columns: [col]
Index: []
Empty DataFrame
Columns: [col]
Index: []

for t in [pa.int32(), pa.int64(), pa.float32(), pa.float64()]:
  print(pa.Table.from_arrays(arrays=[pa.array([1,2,3], type=t)], 
names=['col']).to_pandas(categories=['col']))
 col
0 1
1 2
2 3
 col
0 1
1 2
2 3
 col
0 1.0
1 2.0
2 3.0
 col
0 1.0
1 2.0
2 3.0

```

> [Python] Segmentation fault when converting empty table to pandas with 
> categoricals
> ---
>
> Key: ARROW-3586
> URL: https://issues.apache.org/jira/browse/ARROW-3586
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11.0
> Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas 
> 0.23.4
> - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4
>Reporter: Andreas
>Priority: Major
> Fix For: 0.12.0
>
>
> {code:java}
> import pyarrow as pa
> table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], 
> names=['col'])
> table.to_pandas(categories=['col']){code}
> This produces a segmentation fault for certain types (e.g, int\{32,64}) while 
> it works for others (e.g. string, binary).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2839) [JS] Support whatwg/streams in IPC reader/writer

2018-12-03 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-2839:
-
Fix Version/s: (was: JS-0.4.0)
   JS-0.5.0

> [JS] Support whatwg/streams in IPC reader/writer
> 
>
> Key: ARROW-2839
> URL: https://issues.apache.org/jira/browse/ARROW-2839
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: JS-0.3.1
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
> Fix For: JS-0.5.0
>
>
> We should make it easy to stream Arrow in the browser via 
> [whatwg/streams|https://github.com/whatwg/streams]. I already have this 
> working at Graphistry, but I had to use some of the IPC internal methods. 
> Creating this issue to track back-porting that work and the few minor 
> refactors to the IPC internals that we'll need to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3470) [C++] Row-wise conversion tutorial has fallen out of date

2018-12-03 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707664#comment-16707664
 ] 

Francois Saint-Jacques commented on ARROW-3470:
---

See added PR for the difference in documentation (single embedded code block 
with comments).

> [C++] Row-wise conversion tutorial has fallen out of date
> -
>
> Key: ARROW-3470
> URL: https://issues.apache.org/jira/browse/ARROW-3470
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As reported on user@ list



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3470) [C++] Row-wise conversion tutorial has fallen out of date

2018-12-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3470:
--
Labels: pull-request-available  (was: )

> [C++] Row-wise conversion tutorial has fallen out of date
> -
>
> Key: ARROW-3470
> URL: https://issues.apache.org/jira/browse/ARROW-3470
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> As reported on user@ list



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3303) [C++] Enable example arrays to be written with a simplified JSON representation

2018-12-03 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-3303:
-

Assignee: Antoine Pitrou

> [C++] Enable example arrays to be written with a simplified JSON 
> representation
> ---
>
> Key: ARROW-3303
> URL: https://issues.apache.org/jira/browse/ARROW-3303
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.13.0
>
>
> In addition to making it easier to generate random data as described in 
> ARROW-2329, I think it would be useful to reduce some of the boilerplate 
> associated with writing down explicit test cases. The benefits of this will 
> be especially pronounced when writing nested arrays. 
> Example code that could be improved this way:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array-test.cc#L3271
> Rather than having a ton of hand-written assertions, we could compare with 
> the expected true dataset. Of course, this itself has to be tested 
> endogenously, but I think we can write enough tests for the JSON parser bit 
> to be able to have confidence in tests that are written with it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps

2018-12-03 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee closed ARROW-3907.

   Resolution: Not A Problem
Fix Version/s: 0.11.1

Closing for now. Not convinced Safe is the best solution to address timestamp 
resolution. If a schema is used it should be clear the intent is to convert 
pandas nanoseconds to a lower resolution. I think the same can be said for 
other types of conversions like floats to int.

> [Python] from_pandas errors when schemas are used with lower resolution 
> timestamps
> --
>
> Key: ARROW-3907
> URL: https://issues.apache.org/jira/browse/ARROW-3907
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: David Lee
>Priority: Major
> Fix For: 0.11.1
>
>
> When passing in a schema object to from_pandas a resolution error occurs if 
> the schema uses a lower resolution timestamp. Do we need to also add 
> "coerce_timestamps" and "allow_truncated_timestamps" parameters found in 
> write_table() to from_pandas()?
> Error:
> pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would 
> lose data: 1532015191753713000', 'Conversion failed for column modified with 
> type datetime64[ns]')
> Code:
>  
> {code:java}
> processed_schema = pa.schema([
> pa.field('Id', pa.string()),
> pa.field('modified', pa.timestamp('ms')),
> pa.field('records', pa.int32())
> ])
> pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3470) [C++] Row-wise conversion tutorial has fallen out of date

2018-12-03 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707519#comment-16707519
 ] 

Francois Saint-Jacques commented on ARROW-3470:
---

I've extracted the full example into a single file and added multiple cmake 
functionnality to build it (mimicking the benchmark/test facility). I'm 
wondering if it's ok to simplify the whole documented example with a single 
code block where the text is in comments?

> [C++] Row-wise conversion tutorial has fallen out of date
> -
>
> Key: ARROW-3470
> URL: https://issues.apache.org/jira/browse/ARROW-3470
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.12.0
>
>
> As reported on user@ list



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3470) [C++] Row-wise conversion tutorial has fallen out of date

2018-12-03 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-3470:
-

Assignee: Francois Saint-Jacques

> [C++] Row-wise conversion tutorial has fallen out of date
> -
>
> Key: ARROW-3470
> URL: https://issues.apache.org/jira/browse/ARROW-3470
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.12.0
>
>
> As reported on user@ list



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3931) Make possible to build regardless of LANG

2018-12-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3931:
--
Labels: pull-request-available  (was: )

> Make possible to build regardless of LANG
> -
>
> Key: ARROW-3931
> URL: https://issues.apache.org/jira/browse/ARROW-3931
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.0
>Reporter: Kousuke Saruta
>Priority: Minor
>  Labels: pull-request-available
>
> At the time of building C++ libs, CompilerInfo.cmake checks the version of 
> compiler to be used.
> How to check is string matching of output of gcc -v or like clang -v.
> When LANG is not related to English, build will fail because string match 
> fails.
> The following is the case of  ja_JP.UTF-8 (Japanese).
> {code}
> CMake Error at cmake_modules/CompilerInfo.cmake:92 (message): 
>   
>   
> 
>   Unknown compiler.  Version info:
>   
>   
> 
>   
>   
>   
> 
>   組み込み spec を使用しています。 
>   
>  
>   
>   
>   
> 
>   COLLECT_GCC=/usr/bin/c++
>   
>   
> 
>   
>   
>   
> 
>   COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper  
>   
>   
> 
>   
>   
>   
> 
>   ターゲット: x86_64-redhat-linux  
>   
>  
>   
>   
>   
> 
>   configure 設定: ../configure --prefix=/usr --mandir=/usr/share/man
>   
>   
>   
>   --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla 
>   
>   
> 
>   --enable-bootstrap --enable-shared --enable-threads=posix   
>   
>   
> 
>   --enable-checking=release --with-system-zlib --enable-__cxa_atexit  
>   
>   
> 
>   --disable-libunwind-exceptions --enable-gnu-unique-object   
>   
>   
> 
>   

[jira] [Created] (ARROW-3931) Make possible to build regardless of LANG

2018-12-03 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created ARROW-3931:
-

 Summary: Make possible to build regardless of LANG
 Key: ARROW-3931
 URL: https://issues.apache.org/jira/browse/ARROW-3931
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.12.0
Reporter: Kousuke Saruta


At the time of building C++ libs, CompilerInfo.cmake checks the version of 
compiler to be used.
How to check is string matching of output of gcc -v or like clang -v.
When LANG is not related to English, build will fail because string match fails.
The following is the case of  ja_JP.UTF-8 (Japanese).

{code}
CMake Error at cmake_modules/CompilerInfo.cmake:92 (message):   

  
  Unknown compiler.  Version info:  

  


  
  組み込み spec を使用しています。   

 


  
  COLLECT_GCC=/usr/bin/c++  

  


  
  COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper

  


  
  ターゲット: x86_64-redhat-linux

 


  
  configure 設定: ../configure --prefix=/usr --mandir=/usr/share/man  


  --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla   

  
  --enable-bootstrap --enable-shared --enable-threads=posix 

  
  --enable-checking=release --with-system-zlib --enable-__cxa_atexit

  
  --disable-libunwind-exceptions --enable-gnu-unique-object 

  
  --enable-linker-build-id --with-linker-hash-style=gnu 

  
  --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto 

 

[jira] [Updated] (ARROW-3906) [C++] Break builder.cc into multiple compilation units

2018-12-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3906:
--
Labels: pull-request-available  (was: )

> [C++] Break builder.cc into multiple compilation units
> --
>
> Key: ARROW-3906
> URL: https://issues.apache.org/jira/browse/ARROW-3906
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> To improve readability I suggest splitting {{builder.cc}} into independent 
> compilation units. Concrete builder classes are generally independent of each 
> other. The only concern is whether inlining some of the base class 
> implementations is important for performance.
> This would also make incremental compilation faster when changing one of the 
> concrete classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3853) [C++] Implement string to timestamp cast

2018-12-03 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3853.
-
   Resolution: Fixed
Fix Version/s: (was: 0.13.0)
   0.12.0

Issue resolved by pull request 3044
[https://github.com/apache/arrow/pull/3044]

> [C++] Implement string to timestamp cast
> 
>
> Key: ARROW-3853
> URL: https://issues.apache.org/jira/browse/ARROW-3853
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Companion work to ARROW-3738



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3930) [C++] Random test data generation is slow

2018-12-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3930:
--
Labels: pull-request-available  (was: )

> [C++] Random test data generation is slow
> -
>
> Key: ARROW-3930
> URL: https://issues.apache.org/jira/browse/ARROW-3930
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> It seems a non-negligible amount of time in the test suite is spent in the 
> Mersenne Twister random engine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3930) [C++] Random test data generation is slow

2018-12-03 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3930:
-

 Summary: [C++] Random test data generation is slow
 Key: ARROW-3930
 URL: https://issues.apache.org/jira/browse/ARROW-3930
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.11.1
Reporter: Antoine Pitrou


It seems a non-negligible amount of time in the test suite is spent in the 
Mersenne Twister random engine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3929) [Go] improve memory usage of CSV reader to improve runtime performances

2018-12-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3929:
--
Labels: pull-request-available  (was: )

> [Go] improve memory usage of CSV reader to improve runtime performances
> ---
>
> Key: ARROW-3929
> URL: https://issues.apache.org/jira/browse/ARROW-3929
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Sebastien Binet
>Assignee: Sebastien Binet
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-12-03 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707086#comment-16707086
 ] 

Suvayu Ali commented on ARROW-3874:
---

Done: [https://github.com/apache/arrow/pull/3072]

Your question about {{jni.h}} gave me enough hints to find the correct missing 
package :), and now the build progresses until it fails with:

{code}
Scanning dependencies of target csv-chunker-test
CMakeFiles/json-integration-test.dir/json-integration-test.cc.o:json-integration-test.cc:function
 boost::system::error_category::std_category::equivalent(std::error_code 
const&, int) const:
error: undefined reference to 'boost::system::detail::generic_category_ncx()'
{code}

This is strange because I have {{boost-system-1.66.0-14.fc29.x86_64}} installed 
on my system.  But I guess that's a test, and the libraries were built 
successfully.

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log, 
> arrow-cmake-findllvm.patch
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-12-03 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707008#comment-16707008
 ] 

Pindikura Ravindra commented on ARROW-3874:
---

The llvm related change looks good. Would you like to raise a PR ?

For the java issue, can you please check if you have a jni.h file in the jdk 
install directory ?

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log, 
> arrow-cmake-findllvm.patch
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3929) [Go] improve memory usage of CSV reader to improve runtime performances

2018-12-03 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-3929:
--

 Summary: [Go] improve memory usage of CSV reader to improve 
runtime performances
 Key: ARROW-3929
 URL: https://issues.apache.org/jira/browse/ARROW-3929
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Sebastien Binet
Assignee: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3681) [Go] add benchmarks for CSV reader

2018-12-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3681:
--
Labels: pull-request-available  (was: )

> [Go] add benchmarks for CSV reader
> --
>
> Key: ARROW-3681
> URL: https://issues.apache.org/jira/browse/ARROW-3681
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Sebastien Binet
>Assignee: Sebastien Binet
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3916) [Python] Support caller-provided filesystem in `ParquetWriter` constructor

2018-12-03 Thread Mackenzie (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706856#comment-16706856
 ] 

Mackenzie commented on ARROW-3916:
--

Yep! Here it is: https://github.com/apache/arrow/pull/3070

> [Python] Support caller-provided filesystem in `ParquetWriter` constructor
> --
>
> Key: ARROW-3916
> URL: https://issues.apache.org/jira/browse/ARROW-3916
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Mackenzie
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> Currently to write files incrementally to S3, the following pattern appears 
> necessary:
> {{def write_dfs_to_s3(dfs, fname):}}
> {{    first_df = dfs[0]}}
> {{    table = pa.Table.from_pandas(first_df, preserve_index=False)}}
> {{    fs = s3fs.S3FileSystem()}}
> {{    fh = fs.open(fname, 'wb')}}
> {{    with pq.ParquetWriter(fh, table.schema) as writer:}}
> {{         # set file handle on writer so writer manages closing it when it 
> is itself closed}}
> {{        writer.file_handle = fh}}
> {{        writer.write_table(table=table)}}
> {{        for df in dfs[1:]:}}
> {{            table = pa.Table.from_pandas(df, preserve_index=False)}}
> {{            writer.write_table(table=table)}}
> This works as expected, but is quite roundabout. It would be much easier if 
> `ParquetWriter` supported `filesystem` as a keyword argument in its 
> constructor, in which case the `_get_fs_from_path` would be overriden by the 
> usual pattern of using the kwarg after ensuring it is a proper file system 
> with `_ensure_filesystem`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3681) [Go] add benchmarks for CSV reader

2018-12-03 Thread Sebastien Binet (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastien Binet reassigned ARROW-3681:
--

Assignee: Sebastien Binet

> [Go] add benchmarks for CSV reader
> --
>
> Key: ARROW-3681
> URL: https://issues.apache.org/jira/browse/ARROW-3681
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Sebastien Binet
>Assignee: Sebastien Binet
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)