[jira] [Created] (ARROW-7278) [C++][Gandiva] Implement Boyer-Moore string search algorithm for functions doing string matching
Projjal Chanda created ARROW-7278: - Summary: [C++][Gandiva] Implement Boyer-Moore string search algorithm for functions doing string matching Key: ARROW-7278 URL: https://issues.apache.org/jira/browse/ARROW-7278 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Projjal Chanda Assignee: Projjal Chanda Discussed in https://github.com/apache/arrow/pull/5902#discussion_r351159392 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7277) [Document] Add discussion about vector lifecycle
Liya Fan created ARROW-7277: --- Summary: [Document] Add discussion about vector lifecycle Key: ARROW-7277 URL: https://issues.apache.org/jira/browse/ARROW-7277 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan As discussed in https://issues.apache.org/jira/browse/ARROW-7254?focusedCommentId=16983284=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16983284, we need a discussion about the lifecycle of a vector. Each vector has a lifecycle, and different operations should be performed in particular phases of the lifecycle. If we violate this, some unexpected results may be produced. This may cause some confusion for Arrow users. So we want to add a new section to the prose document, to make it clear and explicit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7276) [Ruby] Add support for building Arrow::ListArray from [[...]]
[ https://issues.apache.org/jira/browse/ARROW-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7276: -- Labels: pull-request-available (was: ) > [Ruby] Add support for building Arrow::ListArray from [[...]] > - > > Key: ARROW-7276 > URL: https://issues.apache.org/jira/browse/ARROW-7276 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7276) [Ruby] Add support for building Arrow::ListArray from [[...]]
Kouhei Sutou created ARROW-7276: --- Summary: [Ruby] Add support for building Arrow::ListArray from [[...]] Key: ARROW-7276 URL: https://issues.apache.org/jira/browse/ARROW-7276 Project: Apache Arrow Issue Type: Improvement Components: Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7275) [Ruby] Add support for Arrow::ListDataType.new(data_type)
Kouhei Sutou created ARROW-7275: --- Summary: [Ruby] Add support for Arrow::ListDataType.new(data_type) Key: ARROW-7275 URL: https://issues.apache.org/jira/browse/ARROW-7275 Project: Apache Arrow Issue Type: Improvement Components: Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7275) [Ruby] Add support for Arrow::ListDataType.new(data_type)
[ https://issues.apache.org/jira/browse/ARROW-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7275: -- Labels: pull-request-available (was: ) > [Ruby] Add support for Arrow::ListDataType.new(data_type) > - > > Key: ARROW-7275 > URL: https://issues.apache.org/jira/browse/ARROW-7275 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984568#comment-16984568 ] Andy Thomason commented on ARROW-5949: -- I've implemented this in two of our internal I/O libraries at work and should be able to help out if I get the time. I've sent a test generator to Andy which should help. We have a huge repository of Arrow files to test it on. > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Priority: Major > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7274) [C++] Add Result APIs to Decimal class
Antoine Pitrou created ARROW-7274: - Summary: [C++] Add Result APIs to Decimal class Key: ARROW-7274 URL: https://issues.apache.org/jira/browse/ARROW-7274 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7248) [Rust] Automatically Regenerate IPC messages from Flatbuffers
[ https://issues.apache.org/jira/browse/ARROW-7248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-7248: -- Summary: [Rust] Automatically Regenerate IPC messages from Flatbuffers (was: Automatically Regenerate IPC messages from Flatbuffers) > [Rust] Automatically Regenerate IPC messages from Flatbuffers > - > > Key: ARROW-7248 > URL: https://issues.apache.org/jira/browse/ARROW-7248 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Martin Grund >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > It would be great if there was an automatic way to regenerate the code for > the Flatbuffer input files. This makes following the mainline development > easier. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6891) [Rust] [Parquet] Add Utf8 support to ArrowReader
[ https://issues.apache.org/jira/browse/ARROW-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6891: -- Labels: pull-request-available (was: ) > [Rust] [Parquet] Add Utf8 support to ArrowReader > - > > Key: ARROW-6891 > URL: https://issues.apache.org/jira/browse/ARROW-6891 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Affects Versions: 1.0.0 >Reporter: Andy Grove >Assignee: Renjie Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Add Utf8 support to ArrowReader -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7273) [Python] Non-nullable null field is allowed / crashes when writing to parquet
[ https://issues.apache.org/jira/browse/ARROW-7273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7273: - Labels: parquet (was: ) > [Python] Non-nullable null field is allowed / crashes when writing to parquet > - > > Key: ARROW-7273 > URL: https://issues.apache.org/jira/browse/ARROW-7273 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet > > It seems to be possible to create a "non-nullable null field". While this > does not make any sense (so already a reason to disallow this I think), this > can also lead to crashed in further operations, such as writing to parquet: > {code} > In [18]: table = pa.table([pa.array([None, None], pa.null())], > schema=pa.schema([pa.field('a', pa.null(), nullable=False)])) > In [19]: table > Out[19]: > pyarrow.Table > a: null not null > In [20]: pq.write_table(table, "test_null.parquet") > WARNING: Logging before InitGoogleLogging() is written to STDERR > F1128 14:08:30.267439 27560 column_writer.cc:837] Check failed: (nullptr) != > (values) > *** Check failure stack trace: *** > Aborted (core dumped) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7273) [Python] Non-nullable null field is allowed / crashes when writing to parquet
Joris Van den Bossche created ARROW-7273: Summary: [Python] Non-nullable null field is allowed / crashes when writing to parquet Key: ARROW-7273 URL: https://issues.apache.org/jira/browse/ARROW-7273 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche It seems to be possible to create a "non-nullable null field". While this does not make any sense (so already a reason to disallow this I think), this can also lead to crashed in further operations, such as writing to parquet: {code} In [18]: table = pa.table([pa.array([None, None], pa.null())], schema=pa.schema([pa.field('a', pa.null(), nullable=False)])) In [19]: table Out[19]: pyarrow.Table a: null not null In [20]: pq.write_table(table, "test_null.parquet") WARNING: Logging before InitGoogleLogging() is written to STDERR F1128 14:08:30.267439 27560 column_writer.cc:837] Check failed: (nullptr) != (values) *** Check failure stack trace: *** Aborted (core dumped) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas
[ https://issues.apache.org/jira/browse/ARROW-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-7209. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5867 [https://github.com/apache/arrow/pull/5867] > [Python] tests with pandas master are failing now __from_arrow__ support > landed in pandas > - > > Key: ARROW-7209 > URL: https://issues.apache.org/jira/browse/ARROW-7209 > Project: Apache Arrow > Issue Type: Test > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 6h 10m > Remaining Estimate: 0h > > I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in > https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our > tests where assuming this did not yet work in pandas, and thus need to be > updated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x
[ https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7059: - Description: Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I set {{use_threads=False}} to make for an apples-to-apples comparison with respect to # of CPUs. {code} import numpy as np import pyarrow as pa import pyarrow.parquet as pq table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)}) pq.write_table(table, "test_wide.parquet") res = pq.read_table("test_wide.parquet") print(pa.__version__) %time res = pq.read_table("test_wide.parquet", use_threads=False) {code} *In 0.14.1 with use_threads=False:* {{0.14.1}} {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} {{Wall time: 525 ms}} ** *In 0.15.1 with* *use_threads=False**:* {{0.15.1}} {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} {{Wall time: 9.93 s}} {{}} was: Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I set {{use_threads=False}} to make for an apples-to-apples comparison with respect to # of CPUs. {{import numpy as np}} {{import pyarrow as pa}} {{import pyarrow.parquet as pq}} {{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})}} {{pq.write_table(table, "test_wide.parquet")}} {{res = pq.read_table("test_wide.parquet")}} {{print(pa.__version__)}} use_threads=False {{%time res = pq.read_table("test_wide.parquet", use_threads=False)}} *In 0.14.1 with use_threads=False:* {{0.14.1}} {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} {{Wall time: 525 ms}} ** *In 0.15.1 with* *use_threads=False**:* {{0.15.1}} {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} {{Wall time: 9.93 s}} {{}} > [Python] Reading parquet file with many columns is much slower in 0.15.x > versus 0.14.x > -- > > Key: ARROW-7059 > URL: https://issues.apache.org/jira/browse/ARROW-7059 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 > Environment: Linux OS with RHEL 7.7 distribution > blkcqas037:~$ lscpu > Architecture: x86_64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Little Endian > CPU(s):32 > On-line CPU(s) list: 0-31 > Thread(s) per core:2 > Core(s) per socket:8 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family:6 > Model: 79 > Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz >Reporter: Eric Kisslinger >Priority: Major > Labels: parquet, performance > Fix For: 1.0.0 > > Attachments: image-2019-11-06-08-18-42-783.png, > image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, > image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, > image-2019-11-06-13-16-05-102.png > > > Reading Parquet files with large number of columns still seems to be very > slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 > except I set {{use_threads=False}} to make for an apples-to-apples comparison > with respect to # of CPUs. > {code} > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)}) > pq.write_table(table, "test_wide.parquet") > res = pq.read_table("test_wide.parquet") > print(pa.__version__) > %time res = pq.read_table("test_wide.parquet", use_threads=False) > {code} > *In 0.14.1 with use_threads=False:* > {{0.14.1}} > {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} > {{Wall time: 525 ms}} > ** > *In 0.15.1 with* *use_threads=False**:* > {{0.15.1}} > {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} > {{Wall time: 9.93 s}} > {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x
[ https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7059: - Description: Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I set {{use_threads=False}} to make for an apples-to-apples comparison with respect to # of CPUs. {code} import numpy as np import pyarrow as pa import pyarrow.parquet as pq table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)}) pq.write_table(table, "test_wide.parquet") res = pq.read_table("test_wide.parquet") print(pa.__version__) %time res = pq.read_table("test_wide.parquet", use_threads=False) {code} *In 0.14.1 with use_threads=False:* {{0.14.1}} {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} {{Wall time: 525 ms}} ** *In 0.15.1 with* *use_threads=False**:* {{0.15.1}} {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} {{Wall time: 9.93 s}} was: Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I set {{use_threads=False}} to make for an apples-to-apples comparison with respect to # of CPUs. {code} import numpy as np import pyarrow as pa import pyarrow.parquet as pq table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)}) pq.write_table(table, "test_wide.parquet") res = pq.read_table("test_wide.parquet") print(pa.__version__) %time res = pq.read_table("test_wide.parquet", use_threads=False) {code} *In 0.14.1 with use_threads=False:* {{0.14.1}} {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} {{Wall time: 525 ms}} ** *In 0.15.1 with* *use_threads=False**:* {{0.15.1}} {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} {{Wall time: 9.93 s}} {{}} > [Python] Reading parquet file with many columns is much slower in 0.15.x > versus 0.14.x > -- > > Key: ARROW-7059 > URL: https://issues.apache.org/jira/browse/ARROW-7059 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 > Environment: Linux OS with RHEL 7.7 distribution > blkcqas037:~$ lscpu > Architecture: x86_64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Little Endian > CPU(s):32 > On-line CPU(s) list: 0-31 > Thread(s) per core:2 > Core(s) per socket:8 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family:6 > Model: 79 > Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz >Reporter: Eric Kisslinger >Priority: Major > Labels: parquet, performance > Fix For: 1.0.0 > > Attachments: image-2019-11-06-08-18-42-783.png, > image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, > image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, > image-2019-11-06-13-16-05-102.png > > > Reading Parquet files with large number of columns still seems to be very > slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 > except I set {{use_threads=False}} to make for an apples-to-apples comparison > with respect to # of CPUs. > {code} > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)}) > pq.write_table(table, "test_wide.parquet") > res = pq.read_table("test_wide.parquet") > print(pa.__version__) > %time res = pq.read_table("test_wide.parquet", use_threads=False) > {code} > *In 0.14.1 with use_threads=False:* > {{0.14.1}} > {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} > {{Wall time: 525 ms}} > ** > *In 0.15.1 with* *use_threads=False**:* > {{0.15.1}} > {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} > {{Wall time: 9.93 s}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x
[ https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7059: - Description: Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 except I set {{use_threads=False}} to make for an apples-to-apples comparison with respect to # of CPUs. {{import numpy as np}} {{import pyarrow as pa}} {{import pyarrow.parquet as pq}} {{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})}} {{pq.write_table(table, "test_wide.parquet")}} {{res = pq.read_table("test_wide.parquet")}} {{print(pa.__version__)}} use_threads=False {{%time res = pq.read_table("test_wide.parquet", use_threads=False)}} *In 0.14.1 with use_threads=False:* {{0.14.1}} {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} {{Wall time: 525 ms}} ** *In 0.15.1 with* *use_threads=False**:* {{0.15.1}} {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} {{Wall time: 9.93 s}} {{}} was: Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used in https://issues.apache.org/jira/browse/ARROW-6876 except I set {{use_threads=False}} to make for an apples-to-apples comparison with respect to # of CPUs. {{import numpy as np}} {{import pyarrow as pa}} {{import pyarrow.parquet as pq}} {{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(1)})}} {{pq.write_table(table, "test_wide.parquet")}} {{res = pq.read_table("test_wide.parquet")}} {{print(pa.__version__)}} use_threads=False {{%time res = pq.read_table("test_wide.parquet", use_threads=False)}} *In 0.14.1 with use_threads=False:* {{0.14.1}} {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} {{Wall time: 525 ms}} ** *In 0.15.1 with* *use_threads=False**:* {{0.15.1}} {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} {{Wall time: 9.93 s}} {{}} > [Python] Reading parquet file with many columns is much slower in 0.15.x > versus 0.14.x > -- > > Key: ARROW-7059 > URL: https://issues.apache.org/jira/browse/ARROW-7059 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 > Environment: Linux OS with RHEL 7.7 distribution > blkcqas037:~$ lscpu > Architecture: x86_64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Little Endian > CPU(s):32 > On-line CPU(s) list: 0-31 > Thread(s) per core:2 > Core(s) per socket:8 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family:6 > Model: 79 > Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz >Reporter: Eric Kisslinger >Priority: Major > Labels: parquet, performance > Fix For: 1.0.0 > > Attachments: image-2019-11-06-08-18-42-783.png, > image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, > image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, > image-2019-11-06-13-16-05-102.png > > > Reading Parquet files with large number of columns still seems to be very > slow in 0.15.1 compared to 0.14.1. I using the same test used in ARROW-6876 > except I set {{use_threads=False}} to make for an apples-to-apples comparison > with respect to # of CPUs. > {{import numpy as np}} > {{import pyarrow as pa}} > {{import pyarrow.parquet as pq}} > {{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in > range(1)})}} > {{pq.write_table(table, "test_wide.parquet")}} > {{res = pq.read_table("test_wide.parquet")}} > {{print(pa.__version__)}} > use_threads=False > {{%time res = pq.read_table("test_wide.parquet", use_threads=False)}} > *In 0.14.1 with use_threads=False:* > {{0.14.1}} > {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} > {{Wall time: 525 ms}} > ** > *In 0.15.1 with* *use_threads=False**:* > {{0.15.1}} > {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} > {{Wall time: 9.93 s}} > {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6876) [Python] Reading parquet file with many columns becomes slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984268#comment-16984268 ] Joris Van den Bossche commented on ARROW-6876: -- The open issue about this is ARROW-7059 > [Python] Reading parquet file with many columns becomes slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7059) [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x
[ https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7059: - Labels: parquet performance (was: performance) > [Python] Reading parquet file with many columns is much slower in 0.15.x > versus 0.14.x > -- > > Key: ARROW-7059 > URL: https://issues.apache.org/jira/browse/ARROW-7059 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 > Environment: Linux OS with RHEL 7.7 distribution > blkcqas037:~$ lscpu > Architecture: x86_64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Little Endian > CPU(s):32 > On-line CPU(s) list: 0-31 > Thread(s) per core:2 > Core(s) per socket:8 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family:6 > Model: 79 > Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz >Reporter: Eric Kisslinger >Priority: Major > Labels: parquet, performance > Fix For: 1.0.0 > > Attachments: image-2019-11-06-08-18-42-783.png, > image-2019-11-06-08-19-11-662.png, image-2019-11-06-08-23-18-897.png, > image-2019-11-06-08-25-05-885.png, image-2019-11-06-09-23-54-372.png, > image-2019-11-06-13-16-05-102.png > > > Reading Parquet files with large number of columns still seems to be very > slow in 0.15.1 compared to 0.14.1. I using the same test used in > https://issues.apache.org/jira/browse/ARROW-6876 except I set > {{use_threads=False}} to make for an apples-to-apples comparison with respect > to # of CPUs. > {{import numpy as np}} > {{import pyarrow as pa}} > {{import pyarrow.parquet as pq}} > {{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in > range(1)})}} > {{pq.write_table(table, "test_wide.parquet")}} > {{res = pq.read_table("test_wide.parquet")}} > {{print(pa.__version__)}} > use_threads=False > {{%time res = pq.read_table("test_wide.parquet", use_threads=False)}} > *In 0.14.1 with use_threads=False:* > {{0.14.1}} > {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} > {{Wall time: 525 ms}} > ** > *In 0.15.1 with* *use_threads=False**:* > {{0.15.1}} > {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} > {{Wall time: 9.93 s}} > {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7230) [C++] Use vendored std::optional instead of boost::optional in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra reassigned ARROW-7230: - Assignee: Projjal Chanda > [C++] Use vendored std::optional instead of boost::optional in Gandiva > -- > > Key: ARROW-7230 > URL: https://issues.apache.org/jira/browse/ARROW-7230 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Gandiva >Reporter: Wes McKinney >Assignee: Projjal Chanda >Priority: Major > > This may help with overall codebase consistency -- This message was sent by Atlassian Jira (v8.3.4#803005)