date:20191128

[jira] [Created] (ARROW-7278) [C++][Gandiva] Implement Boyer-Moore string search algorithm for functions doing string matching

2019-11-28 Thread Projjal Chanda (Jira)

Projjal Chanda created ARROW-7278:
-

 Summary: [C++][Gandiva] Implement Boyer-Moore string search 
algorithm for functions doing string matching
 Key: ARROW-7278
 URL: https://issues.apache.org/jira/browse/ARROW-7278
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Projjal Chanda
Assignee: Projjal Chanda


Discussed in https://github.com/apache/arrow/pull/5902#discussion_r351159392



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7277) [Document] Add discussion about vector lifecycle

2019-11-28 Thread Liya Fan (Jira)

Liya Fan created ARROW-7277:
---

 Summary: [Document] Add discussion about vector lifecycle
 Key: ARROW-7277
 URL: https://issues.apache.org/jira/browse/ARROW-7277
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


As discussed in 
https://issues.apache.org/jira/browse/ARROW-7254?focusedCommentId=16983284=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16983284,
 we need a discussion about the lifecycle of a vector.

Each vector has a lifecycle, and different operations should be performed in 
particular phases of the lifecycle. If we violate this, some unexpected results 
may be produced. This may cause some confusion for Arrow users. So we want to 
add a new section to the prose document, to make it clear and explicit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7276) [Ruby] Add support for building Arrow::ListArray from [[...]]

2019-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7276:
--
Labels: pull-request-available  (was: )

> [Ruby] Add support for building Arrow::ListArray from [[...]]
> -
>
> Key: ARROW-7276
> URL: https://issues.apache.org/jira/browse/ARROW-7276
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7276) [Ruby] Add support for building Arrow::ListArray from [[...]]

2019-11-28 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-7276:
---

 Summary: [Ruby] Add support for building Arrow::ListArray from 
[[...]]
 Key: ARROW-7276
 URL: https://issues.apache.org/jira/browse/ARROW-7276
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7275) [Ruby] Add support for Arrow::ListDataType.new(data_type)

2019-11-28 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-7275:
---

 Summary: [Ruby] Add support for Arrow::ListDataType.new(data_type)
 Key: ARROW-7275
 URL: https://issues.apache.org/jira/browse/ARROW-7275
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7275) [Ruby] Add support for Arrow::ListDataType.new(data_type)

2019-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7275:
--
Labels: pull-request-available  (was: )

> [Ruby] Add support for Arrow::ListDataType.new(data_type)
> -
>
> Key: ARROW-7275
> URL: https://issues.apache.org/jira/browse/ARROW-7275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2019-11-28 Thread Andy Thomason (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984568#comment-16984568
 ] 

Andy Thomason commented on ARROW-5949:
--

I've implemented this in two of our internal I/O libraries at work and should 
be able to help out if I get

the time. I've sent a test generator to Andy which should help. We have a huge 
repository of Arrow files to test it on.

 

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Priority: Major
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7274) [C++] Add Result APIs to Decimal class

2019-11-28 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-7274:
-

 Summary: [C++] Add Result APIs to Decimal class
 Key: ARROW-7274
 URL: https://issues.apache.org/jira/browse/ARROW-7274
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7248) [Rust] Automatically Regenerate IPC messages from Flatbuffers

2019-11-28 Thread Neville Dipale (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-7248:
--
Summary: [Rust] Automatically Regenerate IPC messages from Flatbuffers  
(was: Automatically Regenerate IPC messages from Flatbuffers)

> [Rust] Automatically Regenerate IPC messages from Flatbuffers
> -
>
> Key: ARROW-7248
> URL: https://issues.apache.org/jira/browse/ARROW-7248
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Martin Grund
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It would be great if there was an automatic way to regenerate the code for 
> the Flatbuffer input files. This makes following the mainline development 
> easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6891) [Rust] [Parquet] Add Utf8 support to ArrowReader

2019-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6891:
--
Labels: pull-request-available  (was: )

> [Rust] [Parquet] Add Utf8 support to ArrowReader 
> -
>
> Key: ARROW-6891
> URL: https://issues.apache.org/jira/browse/ARROW-6891
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Andy Grove
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Add Utf8 support to ArrowReader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7273) [Python] Non-nullable null field is allowed / crashes when writing to parquet

2019-11-28 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7273:
-
Labels: parquet  (was: )

> [Python] Non-nullable null field is allowed / crashes when writing to parquet
> -
>
> Key: ARROW-7273
> URL: https://issues.apache.org/jira/browse/ARROW-7273
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet
>
> It seems to be possible to create a "non-nullable null field". While this 
> does not make any sense (so already a reason to disallow this I think), this 
> can also lead to crashed in further operations, such as writing to parquet:
> {code}
> In [18]: table = pa.table([pa.array([None, None], pa.null())], 
> schema=pa.schema([pa.field('a', pa.null(), nullable=False)]))
> In [19]: table
> Out[19]:
> pyarrow.Table
> a: null not null
> In [20]: pq.write_table(table, "test_null.parquet")
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F1128 14:08:30.267439 27560 column_writer.cc:837]  Check failed: (nullptr) != 
> (values)
> *** Check failure stack trace: ***
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7273) [Python] Non-nullable null field is allowed / crashes when writing to parquet

2019-11-28 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-7273:


 Summary: [Python] Non-nullable null field is allowed / crashes 
when writing to parquet
 Key: ARROW-7273
 URL: https://issues.apache.org/jira/browse/ARROW-7273
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche


It seems to be possible to create a "non-nullable null field". While this does 
not make any sense (so already a reason to disallow this I think), this can 
also lead to crashed in further operations, such as writing to parquet:

{code}
In [18]: table = pa.table([pa.array([None, None], pa.null())], 
schema=pa.schema([pa.field('a', pa.null(), nullable=False)]))

In [19]: table
Out[19]:
pyarrow.Table
a: null not null

In [20]: pq.write_table(table, "test_null.parquet")
WARNING: Logging before InitGoogleLogging() is written to STDERR
F1128 14:08:30.267439 27560 column_writer.cc:837]  Check failed: (nullptr) != 
(values)
*** Check failure stack trace: ***
Aborted (core dumped)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas

2019-11-28 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-7209.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5867
[https://github.com/apache/arrow/pull/5867]

> [Python] tests with pandas master are failing now __from_arrow__ support 
> landed in pandas
> -
>
> Key: ARROW-7209
> URL: https://issues.apache.org/jira/browse/ARROW-7209
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in 
> https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our 
> tests where assuming this did not yet work in pandas, and thus need to be 
> updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

2019-11-28 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7059:
-
Description: 
Reading Parquet files with large number of columns still seems to be very slow 
in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I 
set {{use_threads=False}} to make for an apples-to-apples comparison with 
respect to # of CPUs.


{code}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")
print(pa.__version__)
%time res = pq.read_table("test_wide.parquet", use_threads=False)
{code}

*In 0.14.1 with use_threads=False:*

{{0.14.1}}
{{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
{{Wall time: 525 ms}}
**

*In 0.15.1 with* *use_threads=False**:*

{{0.15.1}}
{{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
{{Wall time: 9.93 s}}
{{}} 

  was:
Reading Parquet files with large number of columns still seems to be very slow 
in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I 
set {{use_threads=False}} to make for an apples-to-apples comparison with 
respect to # of CPUs.

{{import numpy as np}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}
{{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})}}
{{pq.write_table(table, "test_wide.parquet")}}
{{res = pq.read_table("test_wide.parquet")}}
{{print(pa.__version__)}}
use_threads=False
{{%time res = pq.read_table("test_wide.parquet", use_threads=False)}}

*In 0.14.1 with use_threads=False:*

{{0.14.1}}
{{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
{{Wall time: 525 ms}}
**

*In 0.15.1 with* *use_threads=False**:*

{{0.15.1}}
{{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
{{Wall time: 9.93 s}}
{{}} 


> [Python] Reading parquet file with many columns is much slower in 0.15.x 
> versus 0.14.x
> --
>
> Key: ARROW-7059
> URL: https://issues.apache.org/jira/browse/ARROW-7059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Linux OS with RHEL 7.7 distribution
> blkcqas037:~$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):32
> On-line CPU(s) list:   0-31
> Thread(s) per core:2
> Core(s) per socket:8
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 79
> Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
>Reporter: Eric Kisslinger
>Priority: Major
>  Labels: parquet, performance
> Fix For: 1.0.0
>
> Attachments: image-2019-11-06-08-18-42-783.png, 
> image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, 
> image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, 
> image-2019-11-06-13-16-05-102.png
>
>
> Reading Parquet files with large number of columns still seems to be very 
> slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 
> except I set {{use_threads=False}} to make for an apples-to-apples comparison 
> with respect to # of CPUs.
> {code}
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})
> pq.write_table(table, "test_wide.parquet")
> res = pq.read_table("test_wide.parquet")
> print(pa.__version__)
> %time res = pq.read_table("test_wide.parquet", use_threads=False)
> {code}
> *In 0.14.1 with use_threads=False:*
> {{0.14.1}}
> {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
> {{Wall time: 525 ms}}
> **
> *In 0.15.1 with* *use_threads=False**:*
> {{0.15.1}}
> {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
> {{Wall time: 9.93 s}}
> {{}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

2019-11-28 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7059:
-
Description: 
Reading Parquet files with large number of columns still seems to be very slow 
in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I 
set {{use_threads=False}} to make for an apples-to-apples comparison with 
respect to # of CPUs.


{code}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")
print(pa.__version__)
%time res = pq.read_table("test_wide.parquet", use_threads=False)
{code}

*In 0.14.1 with use_threads=False:*

{{0.14.1}}
{{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
{{Wall time: 525 ms}}
**

*In 0.15.1 with* *use_threads=False**:*

{{0.15.1}}
{{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
{{Wall time: 9.93 s}}

  was:
Reading Parquet files with large number of columns still seems to be very slow 
in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I 
set {{use_threads=False}} to make for an apples-to-apples comparison with 
respect to # of CPUs.


{code}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")
print(pa.__version__)
%time res = pq.read_table("test_wide.parquet", use_threads=False)
{code}

*In 0.14.1 with use_threads=False:*

{{0.14.1}}
{{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
{{Wall time: 525 ms}}
**

*In 0.15.1 with* *use_threads=False**:*

{{0.15.1}}
{{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
{{Wall time: 9.93 s}}
{{}} 


> [Python] Reading parquet file with many columns is much slower in 0.15.x 
> versus 0.14.x
> --
>
> Key: ARROW-7059
> URL: https://issues.apache.org/jira/browse/ARROW-7059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Linux OS with RHEL 7.7 distribution
> blkcqas037:~$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):32
> On-line CPU(s) list:   0-31
> Thread(s) per core:2
> Core(s) per socket:8
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 79
> Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
>Reporter: Eric Kisslinger
>Priority: Major
>  Labels: parquet, performance
> Fix For: 1.0.0
>
> Attachments: image-2019-11-06-08-18-42-783.png, 
> image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, 
> image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, 
> image-2019-11-06-13-16-05-102.png
>
>
> Reading Parquet files with large number of columns still seems to be very 
> slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 
> except I set {{use_threads=False}} to make for an apples-to-apples comparison 
> with respect to # of CPUs.
> {code}
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})
> pq.write_table(table, "test_wide.parquet")
> res = pq.read_table("test_wide.parquet")
> print(pa.__version__)
> %time res = pq.read_table("test_wide.parquet", use_threads=False)
> {code}
> *In 0.14.1 with use_threads=False:*
> {{0.14.1}}
> {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
> {{Wall time: 525 ms}}
> **
> *In 0.15.1 with* *use_threads=False**:*
> {{0.15.1}}
> {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
> {{Wall time: 9.93 s}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

2019-11-28 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7059:
-
Description: 
Reading Parquet files with large number of columns still seems to be very slow 
in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I 
set {{use_threads=False}} to make for an apples-to-apples comparison with 
respect to # of CPUs.

{{import numpy as np}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}
{{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})}}
{{pq.write_table(table, "test_wide.parquet")}}
{{res = pq.read_table("test_wide.parquet")}}
{{print(pa.__version__)}}
use_threads=False
{{%time res = pq.read_table("test_wide.parquet", use_threads=False)}}

*In 0.14.1 with use_threads=False:*

{{0.14.1}}
{{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
{{Wall time: 525 ms}}
**

*In 0.15.1 with* *use_threads=False**:*

{{0.15.1}}
{{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
{{Wall time: 9.93 s}}
{{}} 

  was:
Reading Parquet files with large number of columns still seems to be very slow 
in 0.15.1 compared to 0.14.1. I using the same test used in 
https://issues.apache.org/jira/browse/ARROW-6876 except I set 
{{use_threads=False}} to make for an apples-to-apples comparison with respect 
to # of CPUs.

{{import numpy as np}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}
{{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})}}
{{pq.write_table(table, "test_wide.parquet")}}
{{res = pq.read_table("test_wide.parquet")}}
{{print(pa.__version__)}}
use_threads=False
{{%time res = pq.read_table("test_wide.parquet", use_threads=False)}}

*In 0.14.1 with use_threads=False:*

{{0.14.1}}
{{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
{{Wall time: 525 ms}}
**

*In 0.15.1 with* *use_threads=False**:*

{{0.15.1}}
{{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
{{Wall time: 9.93 s}}
{{}} 


> [Python] Reading parquet file with many columns is much slower in 0.15.x 
> versus 0.14.x
> --
>
> Key: ARROW-7059
> URL: https://issues.apache.org/jira/browse/ARROW-7059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Linux OS with RHEL 7.7 distribution
> blkcqas037:~$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):32
> On-line CPU(s) list:   0-31
> Thread(s) per core:2
> Core(s) per socket:8
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 79
> Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
>Reporter: Eric Kisslinger
>Priority: Major
>  Labels: parquet, performance
> Fix For: 1.0.0
>
> Attachments: image-2019-11-06-08-18-42-783.png, 
> image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, 
> image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, 
> image-2019-11-06-13-16-05-102.png
>
>
> Reading Parquet files with large number of columns still seems to be very 
> slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 
> except I set {{use_threads=False}} to make for an apples-to-apples comparison 
> with respect to # of CPUs.
> {{import numpy as np}}
> {{import pyarrow as pa}}
> {{import pyarrow.parquet as pq}}
> {{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in 
> range(1)})}}
> {{pq.write_table(table, "test_wide.parquet")}}
> {{res = pq.read_table("test_wide.parquet")}}
> {{print(pa.__version__)}}
> use_threads=False
> {{%time res = pq.read_table("test_wide.parquet", use_threads=False)}}
> *In 0.14.1 with use_threads=False:*
> {{0.14.1}}
> {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
> {{Wall time: 525 ms}}
> **
> *In 0.15.1 with* *use_threads=False**:*
> {{0.15.1}}
> {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
> {{Wall time: 9.93 s}}
> {{}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6876) [Python] Reading parquet file with many columns becomes slow for 0.15.0

2019-11-28 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984268#comment-16984268
 ] 

Joris Van den Bossche commented on ARROW-6876:
--

The open issue about this is ARROW-7059

> [Python] Reading parquet file with many columns becomes slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

2019-11-28 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7059:
-
Labels: parquet performance  (was: performance)

> [Python] Reading parquet file with many columns is much slower in 0.15.x 
> versus 0.14.x
> --
>
> Key: ARROW-7059
> URL: https://issues.apache.org/jira/browse/ARROW-7059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Linux OS with RHEL 7.7 distribution
> blkcqas037:~$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):32
> On-line CPU(s) list:   0-31
> Thread(s) per core:2
> Core(s) per socket:8
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 79
> Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
>Reporter: Eric Kisslinger
>Priority: Major
>  Labels: parquet, performance
> Fix For: 1.0.0
>
> Attachments: image-2019-11-06-08-18-42-783.png, 
> image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, 
> image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, 
> image-2019-11-06-13-16-05-102.png
>
>
> Reading Parquet files with large number of columns still seems to be very 
> slow in 0.15.1 compared to 0.14.1. I using the same test used in 
> https://issues.apache.org/jira/browse/ARROW-6876 except I set 
> {{use_threads=False}} to make for an apples-to-apples comparison with respect 
> to # of CPUs.
> {{import numpy as np}}
> {{import pyarrow as pa}}
> {{import pyarrow.parquet as pq}}
> {{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in 
> range(1)})}}
> {{pq.write_table(table, "test_wide.parquet")}}
> {{res = pq.read_table("test_wide.parquet")}}
> {{print(pa.__version__)}}
> use_threads=False
> {{%time res = pq.read_table("test_wide.parquet", use_threads=False)}}
> *In 0.14.1 with use_threads=False:*
> {{0.14.1}}
> {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
> {{Wall time: 525 ms}}
> **
> *In 0.15.1 with* *use_threads=False**:*
> {{0.15.1}}
> {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
> {{Wall time: 9.93 s}}
> {{}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7230) [C++] Use vendored std::optional instead of boost::optional in Gandiva

2019-11-28 Thread Pindikura Ravindra (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra reassigned ARROW-7230:
-

Assignee: Projjal Chanda

> [C++] Use vendored std::optional instead of boost::optional in Gandiva
> --
>
> Key: ARROW-7230
> URL: https://issues.apache.org/jira/browse/ARROW-7230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Gandiva
>Reporter: Wes McKinney
>Assignee: Projjal Chanda
>Priority: Major
>
> This may help with overall codebase consistency



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7278) [C++][Gandiva] Implement Boyer-Moore string search algorithm for functions doing string matching

[jira] [Created] (ARROW-7277) [Document] Add discussion about vector lifecycle

[jira] [Updated] (ARROW-7276) [Ruby] Add support for building Arrow::ListArray from [[...]]

[jira] [Created] (ARROW-7276) [Ruby] Add support for building Arrow::ListArray from [[...]]

[jira] [Created] (ARROW-7275) [Ruby] Add support for Arrow::ListDataType.new(data_type)

[jira] [Updated] (ARROW-7275) [Ruby] Add support for Arrow::ListDataType.new(data_type)

[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

[jira] [Created] (ARROW-7274) [C++] Add Result APIs to Decimal class

[jira] [Updated] (ARROW-7248) [Rust] Automatically Regenerate IPC messages from Flatbuffers

[jira] [Updated] (ARROW-6891) [Rust] [Parquet] Add Utf8 support to ArrowReader

[jira] [Updated] (ARROW-7273) [Python] Non-nullable null field is allowed / crashes when writing to parquet

[jira] [Created] (ARROW-7273) [Python] Non-nullable null field is allowed / crashes when writing to parquet

[jira] [Resolved] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas

[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

[jira] [Commented] (ARROW-6876) [Python] Reading parquet file with many columns becomes slow for 0.15.0

[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

[jira] [Assigned] (ARROW-7230) [C++] Use vendored std::optional instead of boost::optional in Gandiva

19 matches

Site Navigation

Mail list logo

Footer information