[jira] [Commented] (ARROW-2115) [JS] Test arrow data production in integration test

2018-05-12 Thread Paul Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473357#comment-16473357
 ] 

Paul Taylor commented on ARROW-2115:


https://github.com/apache/arrow/pull/2035

> [JS] Test arrow data production in integration test
> ---
>
> Key: ARROW-2115
> URL: https://issues.apache.org/jira/browse/ARROW-2115
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
>
> Currently the integration tests only treat the JS implementation as a 
> consumer, and we also need to test its ability to produce arrow data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-987) [JS] Implement JSON writer for Integration tests

2018-05-12 Thread Paul Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473356#comment-16473356
 ] 

Paul Taylor commented on ARROW-987:
---

[~wesmckinn] I believe this is done

> [JS] Implement JSON writer for Integration tests
> 
>
> Key: ARROW-987
> URL: https://issues.apache.org/jira/browse/ARROW-987
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
> Fix For: JS-0.3.1
>
>
> Rather than storing generated binary files in the repo, we could just run the 
> integration tests on the JS implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-987) [JS] Implement JSON writer for Integration tests

2018-05-12 Thread Paul Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor resolved ARROW-987.
---
   Resolution: Fixed
Fix Version/s: (was: JS-0.4.0)
   JS-0.3.1

> [JS] Implement JSON writer for Integration tests
> 
>
> Key: ARROW-987
> URL: https://issues.apache.org/jira/browse/ARROW-987
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
> Fix For: JS-0.3.1
>
>
> Rather than storing generated binary files in the repo, we could just run the 
> integration tests on the JS implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2116) [JS] Implement IPC writer

2018-05-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2116:
--
Labels: pull-request-available  (was: )

> [JS] Implement IPC writer
> -
>
> Key: ARROW-2116
> URL: https://issues.apache.org/jira/browse/ARROW-2116
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2356) [JS] JSON reader fails on FixedSizeBinary data buffer

2018-05-12 Thread Paul Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor resolved ARROW-2356.

   Resolution: Fixed
Fix Version/s: (was: JS-0.4.0)
   JS-0.3.1

> [JS] JSON reader fails on FixedSizeBinary data buffer
> -
>
> Key: ARROW-2356
> URL: https://issues.apache.org/jira/browse/ARROW-2356
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: JS-0.3.1
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.3.1
>
>
> The JSON reader doesn't ingest the FixedSizeBinary data buffer correctly, and 
> we haven't known about it because the JS integration test runner is 
> accidentally exiting with code 0 on failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2428) [Python] Support ExtensionArrays in to_pandas conversion

2018-05-12 Thread Alex Hagerman (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473253#comment-16473253
 ] 

Alex Hagerman commented on ARROW-2428:
--

[~xhochy] I was reading through the meta issue and trying to understand what we 
have to make sure to pass. Do you think this has settled enough to begin work? 
It appears pandas will expect a class defining the type, which I'm guessing the 
objects in the arrow column will be instances of that user type? Do we expect 
arrow columns to meet all the requirements of ExtensionArray?

 

I was specifically looking at this to understand what options have to be passed 
and what the ExtensionArray requires.

https://github.com/pandas-dev/pandas/pull/19174/files#diff-e448fe09dbe8aed468d89a4c90e65cff

> [Python] Support ExtensionArrays in to_pandas conversion
> 
>
> Key: ARROW-2428
> URL: https://issues.apache.org/jira/browse/ARROW-2428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> With the next release of Pandas, it will be possible to define custom column 
> types that back a {{pandas.Series}}. Thus we will not be able to cover all 
> possible column types in the {{to_pandas}} conversion by default as we won't 
> be aware of all extension arrays.
> To enable users to create {{ExtensionArray}} instances from Arrow columns in 
> the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} 
> call where they can overload the default conversion routines with the ones 
> that produce their {{ExtensionArray}} instances.
> This should avoid additional copies in the case where we would nowadays first 
> convert the Arrow column into a default Pandas column (probably of object 
> type) and the user would afterwards convert it to a more efficient 
> {{ExtensionArray}}. This hook here will be especially useful when you build 
> {{ExtensionArrays}} where the storage is backed by Arrow.
> The meta-issue that tracks the implementation inside of Pandas is: 
> https://github.com/pandas-dev/pandas/issues/19696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1644) [Python] Read and write nested Parquet data with a mix of struct and list nesting levels

2018-05-12 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473191#comment-16473191
 ] 

Joshua Storck commented on ARROW-1644:
--

The reading half of this issue is addressed by this: 
https://github.com/apache/parquet-cpp/pull/462. Perhaps we should split this 
into two separate issues?

> [Python] Read and write nested Parquet data with a mix of struct and list 
> nesting levels
> 
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 0.10.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1599) [Python] Unable to read Parquet files with list inside struct

2018-05-12 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473189#comment-16473189
 ] 

Joshua Storck commented on ARROW-1599:
--

This PR should address this: https://github.com/apache/parquet-cpp/pull/462. 
[~JKung], could you possibly test out that version or provide a sample file 
that you are trying to read so that I can add it to the unit tests?

> [Python] Unable to read Parquet files with list inside struct
> -
>
> Key: ARROW-1599
> URL: https://issues.apache.org/jira/browse/ARROW-1599
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Ubuntu
>Reporter: Jovann Kung
>Priority: Major
> Fix For: 0.10.0
>
>
> Is PyArrow currently unable to read in Parquet files with a vector as a 
> column? For example, the schema of such a file is below:
> {{
> mbc: FLOAT
> deltae: FLOAT
> labels: FLOAT
> features.type: INT32 INT_8
> features.size: INT32
> features.indices.list.element: INT32
> features.values.list.element: DOUBLE}}
> Using either pq.read_table() or pq.ParquetDataset('/path/to/parquet').read() 
> yields the following error: ArrowNotImplementedError: Currently only nesting 
> with Lists is supported.
> From the error I assume that this may be implemented in further releases?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2500) [Java] IPC Writers/readers are not always setting validity bits correctly

2018-05-12 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-2500:
--

Assignee: Bo Meng

> [Java] IPC Writers/readers are not always setting validity bits correctly
> -
>
> Key: ARROW-2500
> URL: https://issues.apache.org/jira/browse/ARROW-2500
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Emilio Lahr-Vivaz
>Assignee: Bo Meng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> When writing multiple batches to a Stream/File Writer, the first validity bit 
> can get garbled between writing and reading. I couldn't pinpoint the exact 
> issue, but I was able to re-create it with a fairly simple unit test.
> in TestArrowStream.java:
> {code:java}
>   @Test
>   public void testReadWriteMultipleBatches() throws IOException {
> ByteArrayOutputStream os = new ByteArrayOutputStream();
> try (IntVector vector = new IntVector("foo", allocator);) {
>   Schema schema = new 
> Schema(Collections.singletonList(vector.getField()), null);
>   try (VectorSchemaRoot root = new VectorSchemaRoot(schema, 
> Collections.singletonList((FieldVector) vector), vector.getValueCount());
>ArrowStreamWriter writer = new ArrowStreamWriter(root, new 
> MapDictionaryProvider(), Channels.newChannel(os));) {
> writer.start();
> vector.setNull(0);
> vector.setSafe(1, 1);
> vector.setSafe(2, 2);
> vector.setNull(3);
> vector.setSafe(4, 1);
> vector.setValueCount(5);
> root.setRowCount(5);
> writer.writeBatch();
> vector.setNull(0);
> vector.setSafe(1, 1);
> vector.setSafe(2, 2);
> vector.setValueCount(3);
> root.setRowCount(3);
> writer.writeBatch();
>   }
> }
> ByteArrayInputStream in = new ByteArrayInputStream(os.toByteArray());
> try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator);) {
>   IntVector read = (IntVector) 
> reader.getVectorSchemaRoot().getFieldVectors().get(0);
>   reader.loadNextBatch();
>   assertEquals(read.getValueCount(), 5);
>   assertNull(read.getObject(0));
>   assertEquals(read.getObject(1), Integer.valueOf(1));
>   assertEquals(read.getObject(2), Integer.valueOf(2));
>   assertNull(read.getObject(3));
>   assertEquals(read.getObject(4), Integer.valueOf(1));
>   reader.loadNextBatch();
>   assertEquals(read.getValueCount(), 3);
>   assertNull(read.getObject(0));
>   assertEquals(read.getObject(1), Integer.valueOf(1));
>   assertEquals(read.getObject(2), Integer.valueOf(2));
> }
>   }
> {code}
> in TestArrowFile.java:
> {code}
>  @Test
>   public void testReadWriteMultipleBatches() throws IOException {
> File file = new File("target/mytest_nulls_multibatch.arrow");
> try (IntVector vector = new IntVector("foo", allocator);) {
>   Schema schema = new 
> Schema(Collections.singletonList(vector.getField()), null);
>   try (FileOutputStream fileOutputStream = new FileOutputStream(file);
>VectorSchemaRoot root = new VectorSchemaRoot(schema, 
> Collections.singletonList((FieldVector) vector), vector.getValueCount());
>ArrowFileWriter writer = new ArrowFileWriter(root, new 
> MapDictionaryProvider(), fileOutputStream.getChannel());) {
> writer.start();
> vector.setNull(0);
> vector.setSafe(1, 1);
> vector.setSafe(2, 2);
> vector.setNull(3);
> vector.setSafe(4, 1);
> vector.setValueCount(5);
> root.setRowCount(5);
> writer.writeBatch();
> vector.setNull(0);
> vector.setSafe(1, 1);
> vector.setSafe(2, 2);
> vector.setValueCount(3);
> root.setRowCount(3);
> writer.writeBatch();
>   }
> }
> try (FileInputStream fileInputStream = new FileInputStream(file);
>  ArrowFileReader reader = new 
> ArrowFileReader(fileInputStream.getChannel(), allocator);) {
>   IntVector read = (IntVector) 
> reader.getVectorSchemaRoot().getFieldVectors().get(0);
>   reader.loadNextBatch();
>   assertEquals(read.getValueCount(), 5);
>   assertNull(read.getObject(0));
>   assertEquals(read.getObject(1), Integer.valueOf(1));
>   assertEquals(read.getObject(2), Integer.valueOf(2));
>   assertNull(read.getObject(3));
>   assertEquals(read.getObject(4), Integer.valueOf(1));
>   reader.loadNextBatch();
>   assertEquals(read.getValueCount(), 3);
>   assertNull(read.getObject(0));
>   assertEquals(read.getObject(1), Integer.valueOf(1));
>  

[jira] [Resolved] (ARROW-2567) [C++/Python] Unit is ignored on comparison of TimestampArrays

2018-05-12 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-2567.

Resolution: Fixed

Issue resolved by pull request 2025
[https://github.com/apache/arrow/pull/2025]

> [C++/Python] Unit is ignored on comparison of TimestampArrays
> -
>
> Key: ARROW-2567
> URL: https://issues.apache.org/jira/browse/ARROW-2567
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Just ran into this:
> {code:java}
> ipdb> p py_array
> 
> [
> Timestamp('1970-01-01 00:00:00'),
> Timestamp('1970-01-01 00:00:00.1'),
> Timestamp('1970-01-01 00:00:00.2'),
> Timestamp('1970-01-01 00:00:00.3'),
> ipdb> p jvm_array
> 
> [
> Timestamp('1970-01-01 00:00:00'),
> Timestamp('1970-01-01 00:00:01'),
> Timestamp('1970-01-01 00:00:02'),
> Timestamp('1970-01-01 00:00:03'),
> ipdb> py_array.equals(jvm_array)
> True{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)