[jira] [Updated] (ARROW-7472) [Java] Fix some incorrect behavior in UnionListWriter
[ https://issues.apache.org/jira/browse/ARROW-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7472: -- Labels: pull-request-available (was: ) > [Java] Fix some incorrect behavior in UnionListWriter > - > > Key: ARROW-7472 > URL: https://issues.apache.org/jira/browse/ARROW-7472 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > > Currently the {{UnionListWriter/UnionFixedSizeListWriter}} {{getField/close}} > APIs seems incorrect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7472) [Java] Fix some incorrect behavior in UnionListWriter
Ji Liu created ARROW-7472: - Summary: [Java] Fix some incorrect behavior in UnionListWriter Key: ARROW-7472 URL: https://issues.apache.org/jira/browse/ARROW-7472 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently the {{UnionListWriter/UnionFixedSizeListWriter}} {{getField/close}} APIs seems incorrect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7386) [C#] Array offset does not work properly
[ https://issues.apache.org/jira/browse/ARROW-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-7386. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6029 [https://github.com/apache/arrow/pull/6029] > [C#] Array offset does not work properly > > > Key: ARROW-7386 > URL: https://issues.apache.org/jira/browse/ARROW-7386 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Reporter: Takashi Hashida >Assignee: Takashi Hashida >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > "Array.Values" always starts from first index of "ValueBuffer". > It should start from "Offset". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7420) [C++] Migrate tensor related APIs to Result-returning version
[ https://issues.apache.org/jira/browse/ARROW-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-7420. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6070 [https://github.com/apache/arrow/pull/6070] > [C++] Migrate tensor related APIs to Result-returning version > - > > Key: ARROW-7420 > URL: https://issues.apache.org/jira/browse/ARROW-7420 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Kenta Murata >Assignee: Kenta Murata >Priority: Trivial > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002987#comment-17002987 ] Wes McKinney commented on ARROW-7305: - I found a different but also concerning problem on macOS relating to ARROW-6994 and have just put up this patch https://github.com/apache/arrow/pull/6100 > [Python] High memory usage writing pyarrow.Table with large strings to parquet > -- > > Key: ARROW-7305 > URL: https://issues.apache.org/jira/browse/ARROW-7305 > Project: Apache Arrow > Issue Type: Task > Components: Python >Affects Versions: 0.15.1 > Environment: Mac OSX >Reporter: Bogdan Klichuk >Priority: Major > Labels: parquet > Attachments: 50mb.csv.gz > > > My case of datasets stored is specific. I have large strings (1-100MB each). > Let's take for example a single row. > 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string. > When I read this csv with pandas and then dump to parquet, my script consumes > 10x of the 43mb. > With increasing amount of such rows memory footprint overhead diminishes, but > I want to focus on this specific case. > Here's the footprint after running using memory profiler: > {code:java} > Line #Mem usageIncrement Line Contents > > 4 48.9 MiB 48.9 MiB @profile > 5 def test(): > 6143.7 MiB 94.7 MiB data = pd.read_csv('43mb.csv') > 7498.6 MiB354.9 MiB data.to_parquet('out.parquet') > {code} > Is this typical for parquet in case of big strings? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6994) [C++] Research jemalloc memory page reclamation configuration on macOS when background_thread option is unavailable
[ https://issues.apache.org/jira/browse/ARROW-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6994: -- Labels: pull-request-available (was: ) > [C++] Research jemalloc memory page reclamation configuration on macOS when > background_thread option is unavailable > --- > > Key: ARROW-6994 > URL: https://issues.apache.org/jira/browse/ARROW-6994 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > In ARROW-6977, this was disabled on macOS, but this will potentially have > negative performance and memory implications that were intended to have been > fixed in ARROW-6910 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7471) [Python] Cython flake8 failures
Wes McKinney created ARROW-7471: --- Summary: [Python] Cython flake8 failures Key: ARROW-7471 URL: https://issues.apache.org/jira/browse/ARROW-7471 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0 Observed on master {code} 0 pyarrow/_dataset.pyx:216:80: E501 line too long (84 > 79 characters) pyarrow/_dataset.pyx:232:80: E501 line too long (96 > 79 characters) pyarrow/_dataset.pyx:239:80: E501 line too long (83 > 79 characters) pyarrow/_dataset.pyx:317:35: E251 unexpected spaces around keyword / parameter equals pyarrow/_dataset.pyx:317:37: E251 unexpected spaces around keyword / parameter equals pyarrow/includes/libarrow_dataset.pxd:261:38: E126 continuation line over-indented for hanging indent pyarrow/includes/libarrow_dataset.pxd:293:80: E501 line too long (87 > 79 characters) pyarrow/includes/libarrow_dataset.pxd:313:80: E501 line too long (85 > 79 characters) 8 {code} It seems the Cython flake8 checks aren't being performed in CI -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6994) [C++] Research jemalloc memory page reclamation configuration on macOS when background_thread option is unavailable
[ https://issues.apache.org/jira/browse/ARROW-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002984#comment-17002984 ] Wes McKinney commented on ARROW-6994: - I tried my test script from ARROW-7305 (requires the file attached there) on macOS and essentially macOS leaks memory. Really bad {code} $ python arrow7305.py Starting RSS: 67297280 Read CSV RSS: 179224576 Wrote Parquet RSS: 645177344 Waited 1 second RSS: 645177344 Read CSV RSS: 703504384 Wrote Parquet RSS: 707674112 Waited 1 second RSS: 707674112 Read CSV RSS: 760279040 Wrote Parquet RSS: 764452864 Waited 1 second RSS: 764452864 Read CSV RSS: 826171392 Wrote Parquet RSS: 826232832 Waited 1 second RSS: 826232832 Read CSV RSS: 887971840 Wrote Parquet RSS: 892026880 Waited 1 second RSS: 892026880 Read CSV RSS: 942624768 Wrote Parquet RSS: 942669824 Waited 1 second RSS: 942669824 Read CSV RSS: 993071104 Wrote Parquet RSS: 993071104 Waited 1 second RSS: 993071104 Read CSV RSS: 1043841024 Wrote Parquet RSS: 1046388736 Waited 1 second RSS: 1046388736 Read CSV RSS: 1096822784 Wrote Parquet RSS: 1096822784 Waited 1 second RSS: 1096822784 Read CSV RSS: 1147224064 Wrote Parquet RSS: 1147301888 Waited 1 second RSS: 1147301888 Waited 1 second RSS: 1147301888 Waited 1 second RSS: 1147301888 Waited 1 second RSS: 1147301888 Waited 1 second RSS: 1147301888 Waited 1 second RSS: 1147301888 Waited 1 second RSS: 1147301888 Waited 1 second RSS: 1147301888 Waited 1 second RSS: 1147301888 Waited 1 second RSS: 1147301888 Waited 1 second RSS: 1147301888 {code} I'm seeing about a fix > [C++] Research jemalloc memory page reclamation configuration on macOS when > background_thread option is unavailable > --- > > Key: ARROW-6994 > URL: https://issues.apache.org/jira/browse/ARROW-6994 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > In ARROW-6977, this was disabled on macOS, but this will potentially have > negative performance and memory implications that were intended to have been > fixed in ARROW-6910 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6994) [C++] Research jemalloc memory page reclamation configuration on macOS when background_thread option is unavailable
[ https://issues.apache.org/jira/browse/ARROW-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6994: --- Assignee: Wes McKinney > [C++] Research jemalloc memory page reclamation configuration on macOS when > background_thread option is unavailable > --- > > Key: ARROW-6994 > URL: https://issues.apache.org/jira/browse/ARROW-6994 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > In ARROW-6977, this was disabled on macOS, but this will potentially have > negative performance and memory implications that were intended to have been > fixed in ARROW-6910 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002975#comment-17002975 ] Wes McKinney commented on ARROW-7305: - See the following script https://gist.github.com/wesm/193f644d10b5aee8c258b8f4f81c5161 Here is the output for me (off of master branch, I assume 0.15.1 is the same) {code} $ python arrow7305.py Starting RSS: 102367232 Read CSV RSS: 154279936 Wrote Parquet RSS: 522485760 Waited 1 second RSS: 161763328 Read CSV RSS: 164732928 Wrote Parquet RSS: 528371712 Waited 1 second RSS: 226361344 Read CSV RSS: 167698432 Wrote Parquet RSS: 528502784 Waited 1 second RSS: 226492416 Read CSV RSS: 172175360 Wrote Parquet RSS: 532971520 Waited 1 second RSS: 230961152 Read CSV RSS: 172093440 Wrote Parquet RSS: 532889600 Waited 1 second RSS: 230879232 Read CSV RSS: 230940672 Wrote Parquet RSS: 532992000 Waited 1 second RSS: 230981632 Read CSV RSS: 232812544 Wrote Parquet RSS: 534822912 Waited 1 second RSS: 232812544 Read CSV RSS: 235274240 Wrote Parquet RSS: 537608192 Waited 1 second RSS: 235577344 Read CSV RSS: 236883968 Wrote Parquet RSS: 531349504 Waited 1 second RSS: 229318656 Read CSV RSS: 231157760 Wrote Parquet RSS: 533168128 Waited 1 second RSS: 231157760 Waited 1 second RSS: 172433408 Waited 1 second RSS: 172433408 Waited 1 second RSS: 172433408 Waited 1 second RSS: 172433408 Waited 1 second RSS: 172433408 Waited 1 second RSS: 172433408 Waited 1 second RSS: 172433408 Waited 1 second RSS: 172433408 Waited 1 second RSS: 172433408 Waited 1 second RSS: 172433408 {code} Here is the output from 0.14.1 {code} $ python arrow7305.py Starting RSS: 74477568 Read CSV RSS: 126550016 Wrote Parquet RSS: 129470464 Waited 1 second RSS: 129470464 Read CSV RSS: 132321280 Wrote Parquet RSS: 135151616 Waited 1 second RSS: 135151616 Read CSV RSS: 135155712 Wrote Parquet RSS: 133169152 Waited 1 second RSS: 133169152 Read CSV RSS: 135159808 Wrote Parquet RSS: 133230592 Waited 1 second RSS: 133230592 Read CSV RSS: 135217152 Wrote Parquet RSS: 135217152 Waited 1 second RSS: 135217152 Read CSV RSS: 139567104 Wrote Parquet RSS: 139567104 Waited 1 second RSS: 139567104 Read CSV RSS: 141398016 Wrote Parquet RSS: 133378048 Waited 1 second RSS: 133378048 Read CSV RSS: 137068544 Wrote Parquet RSS: 133234688 Waited 1 second RSS: 133234688 Read CSV RSS: 135221248 Wrote Parquet RSS: 135221248 Waited 1 second RSS: 135221248 Read CSV RSS: 139567104 Wrote Parquet RSS: 133234688 Waited 1 second RSS: 133234688 Waited 1 second RSS: 133234688 Waited 1 second RSS: 133234688 Waited 1 second RSS: 133234688 Waited 1 second RSS: 133234688 Waited 1 second RSS: 133234688 Waited 1 second RSS: 133234688 Waited 1 second RSS: 133234688 Waited 1 second RSS: 133234688 Waited 1 second RSS: 133234688 Waited 1 second RSS: 133234688 {code} I've only begun to investigate but these changes have to do with the jemalloc version upgrade and the changes that we made to configuration options. I don't know what is causing the ~30-40MB difference in the baseline memory usage, though (could be differences in aggregate shared library sizes). We changed memory page management to be performed in the background which means that memory is not released to the OS immediately as it was before but rather on a short time delay as you can see. The basic idea is that requesting memory from the operating system is expensive, and so jemalloc is being a bit greedy about holding on to memory for a short period of time because applications that use a lot of memory often continue using a lot of memory, and this will result in improved performance. An alternative to our current configuration would be to disable the background_thread option and set the decay_ms to 0. This would likely yield worse performance in some applications. We're having to strike a delicate balance between having a piece of software that performs well in real world scenarios while also offering predictable resource utilization. It is hard to satisfy everyone. > [Python] High memory usage writing pyarrow.Table with large strings to parquet > -- > > Key: ARROW-7305 > URL: https://issues.apache.org/jira/browse/ARROW-7305 > Project: Apache Arrow > Issue Type: Task > Components: Python >Affects Versions: 0.15.1 > Environment: Mac OSX >Reporter: Bogdan Klichuk >Priority: Major > Labels: parquet > Attachments: 50mb.csv.gz > > > My case of datasets stored is specific. I have large strings (1-100MB each). > Let's take for example a single row. > 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string. > When I read this csv with pandas and then dump to parquet, my script consumes > 10x of the 43mb. > With increasing amount of such rows memory footprint overhead diminishes, but >
[jira] [Commented] (ARROW-6718) [Rust] packed_simd requires nightly
[ https://issues.apache.org/jira/browse/ARROW-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002842#comment-17002842 ] Paddy Horan commented on ARROW-6718: I would love to get back on stable. I added a feature to disable explicit SIMD to try and make progress toward this goal. Although the main thing we need is specialization on stable. If we can get the same level of performance then I'm all for removing packed_simd. At the time we adopted it, the author was trying to get it adopted into std. Since then he has stopped driving this forward until other features land. I'll take a look in the next few days to compare performance, etc. > [Rust] packed_simd requires nightly > > > Key: ARROW-6718 > URL: https://issues.apache.org/jira/browse/ARROW-6718 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Andy Grove >Priority: Major > > See [https://github.com/rust-lang/rfcs/pull/2366] for more info on > stabilization of this crate. > > {code:java} > error[E0554]: `#![feature]` may not be used on the stable release channel >--> > /home/andy/.cargo/registry/src/github.com-1ecc6299db9ec823/packed_simd-0.3.3/src/lib.rs:202:1 > | > 202 | / #![feature( > 203 | | repr_simd, > 204 | | const_fn, > 205 | | platform_intrinsics, > ... | > 215 | | custom_inner_attributes > 216 | | )] > | |__^ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7466) [CI][Java] Fix gandiva-jar-osx nightly build failure
[ https://issues.apache.org/jira/browse/ARROW-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Kumar resolved ARROW-7466. -- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6099 [https://github.com/apache/arrow/pull/6099] > [CI][Java] Fix gandiva-jar-osx nightly build failure > > > Key: ARROW-7466 > URL: https://issues.apache.org/jira/browse/ARROW-7466 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Projjal Chanda >Assignee: Projjal Chanda >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Gandiva-jar-osx nightly build has been failing for the past few days. From > [https://github.com/google/error-prone/issues/1441] the issue seems to be > error-prone version 2.3.3 currently used is incompatible with java 13 that is > being used in the nightly build. Updating it to 2.3.4 should fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7470) [JS] Fix typos
[ https://issues.apache.org/jira/browse/ARROW-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7470: -- Labels: pull-request-available (was: ) > [JS] Fix typos > -- > > Key: ARROW-7470 > URL: https://issues.apache.org/jira/browse/ARROW-7470 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Affects Versions: 1.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > Labels: pull-request-available > > Fix typos in files under js directory -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7470) [JS] Fix typos
Kazuaki Ishizaki created ARROW-7470: --- Summary: [JS] Fix typos Key: ARROW-7470 URL: https://issues.apache.org/jira/browse/ARROW-7470 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Affects Versions: 1.0.0 Reporter: Kazuaki Ishizaki Fix typos in files under js directory -- This message was sent by Atlassian Jira (v8.3.4#803005)