[jira] [Commented] (ARROW-4716) [Benchmarking] Make maching detection script cross-platform
[ https://issues.apache.org/jira/browse/ARROW-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816962#comment-16816962 ] Tanya Schlusser commented on ARROW-4716: Linux first is a great choice. Python is my favorite language and I am happy to do this. The existing shell script was only because it was easy and quick. I am very very sorry for dropping the ball the past couple of months. My mom passed away and I have been a total wreck (and moved back home from her house and got a job) but I still want to contribute and hope you will accept me back now. You all rock very much! > [Benchmarking] Make maching detection script cross-platform > --- > > Key: ARROW-4716 > URL: https://issues.apache.org/jira/browse/ARROW-4716 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Antoine Pitrou >Priority: Major > > The machine detection script ({{make_machine_json.sh}}) currently looks like > it will only work properly on macOS. Ideally it should work more or less > correctly on all of them macOS, Linux and Windows (some values may remain > undetected on some platforms). > This probably entails: > - switching to Python rather than bash > - using something like [psutil|https://psutil.readthedocs.io/en/latest/] to > grab useful machine information > - calling {{nvidia-smi}} to query GPU characteristics (for example > "nvidia-smi -q -i 0 -x") -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python
[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785957#comment-16785957 ] Tanya Schlusser commented on ARROW-3543: <3 Thank you Olaf! > [R] Time zone adjustment issue when reading Feather file written by Python > -- > > Key: ARROW-3543 > URL: https://issues.apache.org/jira/browse/ARROW-3543 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Olaf >Priority: Critical > Fix For: 0.13.0 > > > Hello the dream team, > Pasting from [https://github.com/wesm/feather/issues/351] > Thanks for this wonderful package. I was playing with feather and some > timestamps and I noticed some dangerous behavior. Maybe it is a bug. > Consider this > > {code:java} > import pandas as pd > import feather > import numpy as np > df = pd.DataFrame( > {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), > pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 > 14:01:02.200')]} > ) > df['timestamp_est'] = > pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) > df > Out[17]: > string_time_utc timestamp_est > 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 > 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > {code} > Here I create the corresponding `EST` timestamp of my original timestamps (in > `UTC` time). > Now saving the dataframe to `csv` or to `feather` will generate two > completely different results. > > {code:java} > df.to_csv('P://testing.csv') > df.to_feather('P://testing.feather') > {code} > Switching to R. > Using the good old `csv` gives me something a bit annoying, but expected. R > thinks my timezone is `UTC` by default, and wrongly attached this timezone to > `timestamp_est`. No big deal, I can always use `with_tz` or even better: > import as character and process as timestamp while in R. > > {code:java} > > dataframe <- read_csv('P://testing.csv') > Parsed with column specification: > cols( > X1 = col_integer(), > string_time_utc = col_datetime(format = ""), > timestamp_est = col_datetime(format = "") > ) > Warning message: > Missing column names filled in: 'X1' [1] > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 4 > X1 string_time_utc timestamp_est > > 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 > 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > mytimezone > > 1 UTC > 2 UTC > 3 UTC {code} > {code:java} > #Now look at what happens with feather: > > > dataframe <- read_feather('P://testing.feather') > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 3 > string_time_utc timestamp_est mytimezone > > 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" > 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" > 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} > My timestamps have been converted!!! pure insanity. > Am I missing something here? > Thanks!! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python
[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785529#comment-16785529 ] Tanya Schlusser commented on ARROW-3543: Hi [~Olafsson] I am still looking at this. My mom passed away last week and I have been listless and distracted since then. I am sorry for your inconvenience. > [R] Time zone adjustment issue when reading Feather file written by Python > -- > > Key: ARROW-3543 > URL: https://issues.apache.org/jira/browse/ARROW-3543 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Olaf >Priority: Critical > Fix For: 0.13.0 > > > Hello the dream team, > Pasting from [https://github.com/wesm/feather/issues/351] > Thanks for this wonderful package. I was playing with feather and some > timestamps and I noticed some dangerous behavior. Maybe it is a bug. > Consider this > > {code:java} > import pandas as pd > import feather > import numpy as np > df = pd.DataFrame( > {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), > pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 > 14:01:02.200')]} > ) > df['timestamp_est'] = > pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) > df > Out[17]: > string_time_utc timestamp_est > 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 > 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > {code} > Here I create the corresponding `EST` timestamp of my original timestamps (in > `UTC` time). > Now saving the dataframe to `csv` or to `feather` will generate two > completely different results. > > {code:java} > df.to_csv('P://testing.csv') > df.to_feather('P://testing.feather') > {code} > Switching to R. > Using the good old `csv` gives me something a bit annoying, but expected. R > thinks my timezone is `UTC` by default, and wrongly attached this timezone to > `timestamp_est`. No big deal, I can always use `with_tz` or even better: > import as character and process as timestamp while in R. > > {code:java} > > dataframe <- read_csv('P://testing.csv') > Parsed with column specification: > cols( > X1 = col_integer(), > string_time_utc = col_datetime(format = ""), > timestamp_est = col_datetime(format = "") > ) > Warning message: > Missing column names filled in: 'X1' [1] > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 4 > X1 string_time_utc timestamp_est > > 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 > 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > mytimezone > > 1 UTC > 2 UTC > 3 UTC {code} > {code:java} > #Now look at what happens with feather: > > > dataframe <- read_feather('P://testing.feather') > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 3 > string_time_utc timestamp_est mytimezone > > 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" > 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" > 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} > My timestamps have been converted!!! pure insanity. > Am I missing something here? > Thanks!! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4716) [Benchmarking] Make maching detection script cross-platform
[ https://issues.apache.org/jira/browse/ARROW-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780524#comment-16780524 ] Tanya Schlusser commented on ARROW-4716: (y) > [Benchmarking] Make maching detection script cross-platform > --- > > Key: ARROW-4716 > URL: https://issues.apache.org/jira/browse/ARROW-4716 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Antoine Pitrou >Priority: Major > > The machine detection script ({{make_machine_json.sh}}) currently looks like > it will only work properly on macOS. Ideally it should work more or less > correctly on all of them macOS, Linux and Windows (some values may remain > undetected on some platforms). > This probably entails: > - switching to Python rather than bash > - using something like [psutil|https://psutil.readthedocs.io/en/latest/] to > grab useful machine information > - calling {{nvidia-smi}} to query GPU characteristics (for example > "nvidia-smi -q -i 0 -x") -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: benchmark-data-model.png > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, > benchmark-data-model.erdplus, benchmark-data-model.png, > benchmark-data-model.png > > Time Spent: 10m > Remaining Estimate: 0h > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: (was: benchmark-data-model.png) > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > Time Spent: 10m > Remaining Estimate: 0h > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763145#comment-16763145 ] Tanya Schlusser commented on ARROW-4313: Thank you Antoine! I missed this last comment. "actual frequency" is a good name, and I used it. * I did not understand the conversations about little-and-big-endian, and did not add fields related to that to the database. * I was surprised during testing about the behavior of nulls in the database, so some things don't yet work the way I'd like (the example script fails in one place.) Thank you everyone for so much feedback. I have uploaded new files for the current data model and am happy to change things according to feedback. If you don't like something, it can be fixed :) > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > Time Spent: 10m > Remaining Estimate: 0h > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: (was: benchmark-data-model.erdplus) > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > Time Spent: 10m > Remaining Estimate: 0h > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: benchmark-data-model.erdplus > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, > benchmark-data-model.erdplus, benchmark-data-model.png, > benchmark-data-model.png > > Time Spent: 10m > Remaining Estimate: 0h > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README
[ https://issues.apache.org/jira/browse/ARROW-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4425: --- Description: It would be nice to link to the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] -Confluence page-(*EDIT) in the Sphinx docs directly from the main project [README|https://github.com/apache/arrow/blob/master/README.md] (in the already existing "Getting involved" section) because it's a bit hard to find right now. "contributing" page: [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] main project README: [https://github.com/apache/arrow/blob/master/README.md] EDIT: Moving the "Contributing" wiki to a page in the actual [Arrow Sphinx docs (location in repo)|https://github.com/apache/arrow/tree/master/docs] would also make it easier to find and modify. An additional task, ARROW-4427 was added to do this. was: It would be nice to link to the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] -Confluence page-(*EDIT) in the static docs directly from the main project [README|https://github.com/apache/arrow/blob/master/README.md] (in the already existing "Getting involved" section) because it's a bit hard to find right now. "contributing" page: [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] main project README: [https://github.com/apache/arrow/blob/master/README.md] EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site (location in repo)|https://github.com/apache/arrow/tree/master/site] would also make it easier to find and modify. An additional task, ARROW-4427 was added to do this. > Add link to 'Contributing' page in the top-level Arrow README > - > > Key: ARROW-4425 > URL: https://issues.apache.org/jira/browse/ARROW-4425 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > It would be nice to link to the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > -Confluence page-(*EDIT) in the Sphinx docs directly from the main project > [README|https://github.com/apache/arrow/blob/master/README.md] (in the > already existing "Getting involved" section) because it's a bit hard to find > right now. > "contributing" page: > [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > main project README: [https://github.com/apache/arrow/blob/master/README.md] > > EDIT: Moving the "Contributing" wiki to a page in the actual [Arrow Sphinx > docs (location in repo)|https://github.com/apache/arrow/tree/master/docs] > would also make it easier to find and modify. An additional task, ARROW-4427 > was added to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4429) Add git rebase tips to the 'Contributing' page in the developer docs
Tanya Schlusser created ARROW-4429: -- Summary: Add git rebase tips to the 'Contributing' page in the developer docs Key: ARROW-4429 URL: https://issues.apache.org/jira/browse/ARROW-4429 Project: Apache Arrow Issue Type: Task Components: Documentation Reporter: Tanya Schlusser A recent discussion on the listserv (link below) asked about how contributors should handle rebasing. It would be helpful if the tips made it into the developer documentation somehow. I suggest in the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] page—currently a wiki, but hopefully eventually part of the Sphinx docs ARROW-4427. Here is the relevant thread: [https://lists.apache.org/thread.html/c74d8027184550b8d9041e3f2414b517ffb76ccbc1d5aa4563d364b6@%3Cdev.arrow.apache.org%3E] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4427) Move Confluence Wiki pages to the Sphinx docs
[ https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4427: --- Summary: Move Confluence Wiki pages to the Sphinx docs (was: Move "Contributing to Apache Arrow" page to the static docs) > Move Confluence Wiki pages to the Sphinx docs > - > > Key: ARROW-4427 > URL: https://issues.apache.org/jira/browse/ARROW-4427 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > It's hard to find and modify the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > and other developers' wiki pages in Confluence. If these were moved to > inside the project web page, that would make it easier. > There are 5 steps to this: > # Create a new directory inside of `arrow/docs/source` to house the wiki > pages. (It will look like the > [cpp|https://github.com/apache/arrow/tree/master/docs/source/cpp] or > [python|https://github.com/apache/arrow/tree/master/docs/source/python] > directories.) > # Copy the wiki page contents to new `*.rst` pages inside this new directory. > # Add an `index.rst` that links to them all with enough description to help > navigation. > # Modify the Sphinx index page > [`arrow/docs/source/index.rst`|https://github.com/apache/arrow/blob/master/docs/source/index.rst] > to have an entry that points to the new index page made in step 3 > # Modify the static site page > [`arrow/site/_includes/header.html`|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33] > to point to the newly created page instead of the wiki page. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4427) Move Confluence Wiki pages to the Sphinx docs
[ https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756190#comment-16756190 ] Tanya Schlusser edited comment on ARROW-4427 at 1/30/19 3:06 PM: - Hoo boy. A big task! Modified the description + title per discussion above. was (Author: tanya): Hoo boy. And all of their child wiki pages. > Move Confluence Wiki pages to the Sphinx docs > - > > Key: ARROW-4427 > URL: https://issues.apache.org/jira/browse/ARROW-4427 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > It's hard to find and modify the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > and other developers' wiki pages in Confluence. If these were moved to > inside the project web page, that would make it easier. > There are 5 steps to this: > # Create a new directory inside of `arrow/docs/source` to house the wiki > pages. (It will look like the > [cpp|https://github.com/apache/arrow/tree/master/docs/source/cpp] or > [python|https://github.com/apache/arrow/tree/master/docs/source/python] > directories.) > # Copy the wiki page contents to new `*.rst` pages inside this new directory. > # Add an `index.rst` that links to them all with enough description to help > navigation. > # Modify the Sphinx index page > [`arrow/docs/source/index.rst`|https://github.com/apache/arrow/blob/master/docs/source/index.rst] > to have an entry that points to the new index page made in step 3 > # Modify the static site page > [`arrow/site/_includes/header.html`|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33] > to point to the newly created page instead of the wiki page. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4427) Move "Contributing to Apache Arrow" page to the static docs
[ https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4427: --- Description: It's hard to find and modify the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] and other developers' wiki pages in Confluence. If these were moved to inside the project web page, that would make it easier. There are 5 steps to this: # Create a new directory inside of `arrow/docs/source` to house the wiki pages. (It will look like the [cpp|https://github.com/apache/arrow/tree/master/docs/source/cpp] or [python|https://github.com/apache/arrow/tree/master/docs/source/python] directories.) # Copy the wiki page contents to new `*.rst` pages inside this new directory. # Add an `index.rst` that links to them all with enough description to help navigation. # Modify the Sphinx index page [`arrow/docs/source/index.rst`|https://github.com/apache/arrow/blob/master/docs/source/index.rst] to have an entry that points to the new index page made in step 3 # Modify the static site page [`arrow/site/_includes/header.html`|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33] to point to the newly created page instead of the wiki page. was: It's hard to find and modify the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] wiki page in Confluence. If it were moved to inside the static web page, that would make it easier. There are two steps to this: # Copy the wiki page contents to a new web page at the top "site" level (under arrow/site/ just like the [committers page|https://github.com/apache/arrow/blob/master/site/committers.html]) Maybe named "contributing.html" or something. # Modify the [navigation section in arrow/site/_includes/header.html|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33] to point to the newly created page instead of the wiki page. The affected pages are all part of the Jekyll components, so there isn't a need to build the Sphinx part of the docs to check your work. > Move "Contributing to Apache Arrow" page to the static docs > --- > > Key: ARROW-4427 > URL: https://issues.apache.org/jira/browse/ARROW-4427 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > It's hard to find and modify the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > and other developers' wiki pages in Confluence. If these were moved to > inside the project web page, that would make it easier. > There are 5 steps to this: > # Create a new directory inside of `arrow/docs/source` to house the wiki > pages. (It will look like the > [cpp|https://github.com/apache/arrow/tree/master/docs/source/cpp] or > [python|https://github.com/apache/arrow/tree/master/docs/source/python] > directories.) > # Copy the wiki page contents to new `*.rst` pages inside this new directory. > # Add an `index.rst` that links to them all with enough description to help > navigation. > # Modify the Sphinx index page > [`arrow/docs/source/index.rst`|https://github.com/apache/arrow/blob/master/docs/source/index.rst] > to have an entry that points to the new index page made in step 3 > # Modify the static site page > [`arrow/site/_includes/header.html`|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33] > to point to the newly created page instead of the wiki page. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4427) Move "Contributing to Apache Arrow" page to the static docs
[ https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756190#comment-16756190 ] Tanya Schlusser commented on ARROW-4427: Hoo boy. And all of their child wiki pages. > Move "Contributing to Apache Arrow" page to the static docs > --- > > Key: ARROW-4427 > URL: https://issues.apache.org/jira/browse/ARROW-4427 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > It's hard to find and modify the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > wiki page in Confluence. If it were moved to inside the static web page, > that would make it easier. > There are two steps to this: > # Copy the wiki page contents to a new web page at the top "site" level > (under arrow/site/ just like the [committers > page|https://github.com/apache/arrow/blob/master/site/committers.html]) Maybe > named "contributing.html" or something. > # Modify the [navigation section in > arrow/site/_includes/header.html|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33] > to point to the newly created page instead of the wiki page. > The affected pages are all part of the Jekyll components, so there isn't a > need to build the Sphinx part of the docs to check your work. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4427) Move "Contributing to Apache Arrow" page to the static docs
[ https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756188#comment-16756188 ] Tanya Schlusser commented on ARROW-4427: Ok. Am I understanding that maybe a number of the wiki pages should be moved—anything not directly related to Jira? So: * [Contributing to Apache Arrow|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow?src=contextnavpagetreemode] * [Guide for Committers and Project Maintainers|https://cwiki.apache.org/confluence/display/ARROW/Guide+for+Committers+and+Project+Maintainers] * [HDFS Filesystem Support|https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support] * [How to Verify Release Candidates|https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates] * [Product Requirements|https://cwiki.apache.org/confluence/display/ARROW/Product+requirements] (possibly not this one as it's empty) * [Release Management Guide|https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide] What do you think of another directory, then, in `arrow/docs/source` where all of the listed pages reside...say `arrow/docs/source/dev` or something? > Move "Contributing to Apache Arrow" page to the static docs > --- > > Key: ARROW-4427 > URL: https://issues.apache.org/jira/browse/ARROW-4427 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > It's hard to find and modify the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > wiki page in Confluence. If it were moved to inside the static web page, > that would make it easier. > There are two steps to this: > # Copy the wiki page contents to a new web page at the top "site" level > (under arrow/site/ just like the [committers > page|https://github.com/apache/arrow/blob/master/site/committers.html]) Maybe > named "contributing.html" or something. > # Modify the [navigation section in > arrow/site/_includes/header.html|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33] > to point to the newly created page instead of the wiki page. > The affected pages are all part of the Jekyll components, so there isn't a > need to build the Sphinx part of the docs to check your work. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README
[ https://issues.apache.org/jira/browse/ARROW-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4425: --- Description: It would be nice to link to the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] -Confluence- page in the Sphinx docs directly from the main project [README|https://github.com/apache/arrow/blob/master/README.md] (in the already existing "Getting involved" section) because it's a bit hard to find right now. "contributing" page: [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] main project README: [https://github.com/apache/arrow/blob/master/README.md] EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site (location in repo)|https://github.com/apache/arrow/tree/master/site] would also make it easier to find and modify. An additional task was added to do this. was: It would be nice to link to the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] -Confluence- page in the Sphinx docs directly from the main project [README|https://github.com/apache/arrow/blob/master/README.md] (in the already existing "Getting involved" section) because it's a bit hard to find right now. "contributing" page: [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] main project README: [https://github.com/apache/arrow/blob/master/README.md] EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site (location in repo)|https://github.com/apache/arrow/tree/master/site] would also make it easier to find and modify. A sub-task was added to do this. > Add link to 'Contributing' page in the top-level Arrow README > - > > Key: ARROW-4425 > URL: https://issues.apache.org/jira/browse/ARROW-4425 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > It would be nice to link to the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > -Confluence- page in the Sphinx docs directly from the main project > [README|https://github.com/apache/arrow/blob/master/README.md] (in the > already existing "Getting involved" section) because it's a bit hard to find > right now. > "contributing" page: > [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > main project README: [https://github.com/apache/arrow/blob/master/README.md] > > EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow > site (location in repo)|https://github.com/apache/arrow/tree/master/site] > would also make it easier to find and modify. An additional task was added > to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README
[ https://issues.apache.org/jira/browse/ARROW-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4425: --- Description: It would be nice to link to the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] -Confluence page-(*EDIT) in the static docs directly from the main project [README|https://github.com/apache/arrow/blob/master/README.md] (in the already existing "Getting involved" section) because it's a bit hard to find right now. "contributing" page: [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] main project README: [https://github.com/apache/arrow/blob/master/README.md] EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site (location in repo)|https://github.com/apache/arrow/tree/master/site] would also make it easier to find and modify. An additional task, ARROW-4427 was added to do this. was: It would be nice to link to the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] -Confluence- page in the Sphinx docs directly from the main project [README|https://github.com/apache/arrow/blob/master/README.md] (in the already existing "Getting involved" section) because it's a bit hard to find right now. "contributing" page: [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] main project README: [https://github.com/apache/arrow/blob/master/README.md] EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site (location in repo)|https://github.com/apache/arrow/tree/master/site] would also make it easier to find and modify. An additional task was added to do this. > Add link to 'Contributing' page in the top-level Arrow README > - > > Key: ARROW-4425 > URL: https://issues.apache.org/jira/browse/ARROW-4425 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > It would be nice to link to the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > -Confluence page-(*EDIT) in the static docs directly from the main project > [README|https://github.com/apache/arrow/blob/master/README.md] (in the > already existing "Getting involved" section) because it's a bit hard to find > right now. > "contributing" page: > [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > main project README: [https://github.com/apache/arrow/blob/master/README.md] > > EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow > site (location in repo)|https://github.com/apache/arrow/tree/master/site] > would also make it easier to find and modify. An additional task, ARROW-4427 > was added to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4427) Move "Contributing to Apache Arrow" page to the static docs
Tanya Schlusser created ARROW-4427: -- Summary: Move "Contributing to Apache Arrow" page to the static docs Key: ARROW-4427 URL: https://issues.apache.org/jira/browse/ARROW-4427 Project: Apache Arrow Issue Type: Task Components: Documentation Reporter: Tanya Schlusser It's hard to find and modify the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] wiki page in Confluence. If it were moved to inside the static web page, that would make it easier. There are two steps to this: # Copy the wiki page contents to a new web page at the top "site" level (under arrow/site/ just like the [committers page|https://github.com/apache/arrow/blob/master/site/committers.html]) Maybe named "contributing.html" or something. # Modify the [navigation section in arrow/site/_includes/header.html|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33] to point to the newly created page instead of the wiki page. The affected pages are all part of the Jekyll components, so there isn't a need to build the Sphinx part of the docs to check your work. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README
[ https://issues.apache.org/jira/browse/ARROW-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4425: --- Description: It would be nice to link to the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] -Confluence- page in the Sphinx docs directly from the main project [README|https://github.com/apache/arrow/blob/master/README.md] (in the already existing "Getting involved" section) because it's a bit hard to find right now. "contributing" page: [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] main project README: [https://github.com/apache/arrow/blob/master/README.md] EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site (location in repo)|https://github.com/apache/arrow/tree/master/site] would also make it easier to find and modify. A sub-task was added to do this. was: It would be nice to add a link to the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] Confluence page directly from the main project [README|https://github.com/apache/arrow/blob/master/README.md] (in the already existing "Getting involved" section) because it's a bit hard to find right now. "contributing" page: [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] main project README: [https://github.com/apache/arrow/blob/master/README.md] > Add link to 'Contributing' page in the top-level Arrow README > - > > Key: ARROW-4425 > URL: https://issues.apache.org/jira/browse/ARROW-4425 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > It would be nice to link to the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > -Confluence- page in the Sphinx docs directly from the main project > [README|https://github.com/apache/arrow/blob/master/README.md] (in the > already existing "Getting involved" section) because it's a bit hard to find > right now. > "contributing" page: > [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > main project README: [https://github.com/apache/arrow/blob/master/README.md] > > EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow > site (location in repo)|https://github.com/apache/arrow/tree/master/site] > would also make it easier to find and modify. A sub-task was added to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README
[ https://issues.apache.org/jira/browse/ARROW-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756158#comment-16756158 ] Tanya Schlusser commented on ARROW-4425: Fair statement. Confluence is really hard for me to navigate. Updating and adding a sub-task. > Add link to 'Contributing' page in the top-level Arrow README > - > > Key: ARROW-4425 > URL: https://issues.apache.org/jira/browse/ARROW-4425 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > It would be nice to add a link to the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > Confluence page directly from the main project > [README|https://github.com/apache/arrow/blob/master/README.md] (in the > already existing "Getting involved" section) because it's a bit hard to find > right now. > "contributing" page: > [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > main project README: [https://github.com/apache/arrow/blob/master/README.md] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README
Tanya Schlusser created ARROW-4425: -- Summary: Add link to 'Contributing' page in the top-level Arrow README Key: ARROW-4425 URL: https://issues.apache.org/jira/browse/ARROW-4425 Project: Apache Arrow Issue Type: Task Components: Documentation Reporter: Tanya Schlusser It would be nice to add a link to the ["Contributing to Apache Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] Confluence page directly from the main project [README|https://github.com/apache/arrow/blob/master/README.md] (in the already existing "Getting involved" section) because it's a bit hard to find right now. "contributing" page: [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] main project README: [https://github.com/apache/arrow/blob/master/README.md] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755570#comment-16755570 ] Tanya Schlusser commented on ARROW-4313: I think part of this was to allow anybody to contribute benchmarks from their own machine. And while dedicated benchmarking machines like the ones you will set up will have all parameters set for optimal benchmarking, benchmarks run on other machines may give different results. Collecting details about the machine that might explain those differences (in case someone cares to explore the dataset) is part of the goal of the data model. One concern, of course, is that people get wildly different results than a benchmark says, and may say "Oh boo–the representative person from the company made fake results that I can't replicate on my machine" ... and with details about a system, performance differences can maybe be traced back to differences in setup, because they were recorded. Not all fields need to be filled out all the time. My priorities are: # Identifying which fields flat-out wrong # Differentiating between necessary columns and extraneous ones that can be left null To me, it is not a big deal to have an extra column dangling around that almost nobody uses. No harm. (Unless it's mislabeled or otherwise wrong; that's what I'm interested in getting out of the discussion here.) > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755509#comment-16755509 ] Tanya Schlusser commented on ARROW-4313: [~aregm] I do not know. I am depending on the other people commenting here to make sure the hardware tables make sense because honestly I don't ever pay attention to hardware because my use cases never stress my system. At one point Wes suggested it. I am glad there is a debate. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: benchmark-data-model.erdplus > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: (was: benchmark-data-model.png) > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755504#comment-16755504 ] Tanya Schlusser commented on ARROW-4313: Thank you very much for everyone's detailed feedback. I absolutely need guidance with the Machine / CPU / GPU specs. I have updated the [^benchmark-data-model.png] and the [^benchmark-data-model.erdplus], and added all of the recommended columns. *Summary of changes:* * All the dimension tables have been renamed to exclude the `_dim`. (It was to distinguish dimension vs. fact tables.) * `cpu` ** Added a `cpu_thread_count`. ** Changed `cpu.speed_Hz` to two columns: `frequency_max_Hz` and `frequency_min_Hz` and also added a column `machine.overclock_frequency_Hz` to the `machine` table to allow for overclocking like Wes mentioned in the beginning. * `os` ** Added both `os.architecture_name` and `os.architecture_bits`, the latter forced to be in \{32, 64}, and pulled from the architecture name (maybe it will become just a computed column in the joined view...). I think it's a good idea. * `project` ** Added a `project.project_name` (oversight before) * `benchmark_language` ** Split out `language` to `language_name` and `language_version` because maybe people will want to compare between them (e.g. Python 2.7, 3.5+) * `environment` ** Removed foreign key for `machine_id` — that should be in the benchmark report separately. Many machines will have the same environment. * `benchmark` ** Added foreign key for `benchmark_language_id`—a benchmark with the same name may exist for different languages. ** Added foreign key for `project_id`—moved it from table `benchmark_result` * `benchmark_result` ** Added foreign key for `machine_id` (was removed from `environment`) ** Deleted foreign key for `project_id`, placing it in `benchmark` (as stated above) *Questions* * `cpu` and `gpu` dimension ** Is it a mistake to make `cpu.cpu_model_name` unique? I mean, are the LX cache levels, core counts, or any other attribute ever different for the same CPU model string? ** The same for GPU. ** I have commented the columns to say that `cpu_thread_count` corresponds to `sysctl -n hw.logicalcpu` and `cpu_core_count` corresponds to `sysctl -n hw.physicalcpu`; corrections gratefully accepted. ** Would it be less confusing to make the column names the exact same strings as correspond to their value from `sysctl`, e.g. change `cpu.cpu_model_name` to `cpu.cpu_brand_string` to correspond to the output of `sysctl -n machdep.cpu.brand_string`? ** On that note is CPU RAM the same thing as `sysctl -n machdep.cpu.cache.size`? * `environment` ** I'm worried I'm doing something inelegant with the dependency list. It will hold everything – Conda / virtualenv; versions of Numpy; all permutations of the various dependencies in what in ASV is the dependency matrix. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: (was: benchmark-data-model.erdplus) > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: benchmark-data-model.png > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4354) Explore Codespeed feasibility and ease of customization
[ https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753511#comment-16753511 ] Tanya Schlusser commented on ARROW-4354: I have attached a drawing of the codespeed data model, [^codespeed-data-model.png]. * Codespeed provides a data model and a web UI. * ASV provides a benchmark framework, a file-directory-based data model, and a static frontend. Both have a lot in common, and both would require revision to work with the additional machine specifications (GPU; CPU cache sizes) and multiple benchmark languages we are interested in. There is no benefit to using ASV from a web service perspective because with a database we will need either an API or a web interface, or both—either would require some front-end work; I'm ambivalent. Anyway we can think about that later. Once we have decided on a data model and spun up the database, it may be nice to enable interaction with the database via HTTP, I am interested in exploring [postgrapgile|https://www.graphile.org/postgraphile/] for an API; it literally parses the Postgres public schema and presents it as a GraphQL interface with no additional work, and has an existing build ready for [AWS lambda|https://github.com/graphile/postgraphile-lambda-example]. Then, all the data manipulation could be directly in the database. There is a {{--cors}} command-line option to enable use on a static site if we go that route. Codespeed also provides (I think) a JSON REST API so, again, the only question is whether we care to have a separate repo for a dynamic webpage. > Explore Codespeed feasibility and ease of customization > --- > > Key: ARROW-4354 > URL: https://issues.apache.org/jira/browse/ARROW-4354 > Project: Apache Arrow > Issue Type: Task > Components: Developer Tools >Reporter: Areg Melik-Adamyan >Assignee: Tanya Schlusser >Priority: Major > Labels: performance > Attachments: codespeed-data-model.png > > > @Tanya Schlusser can you please explore this option and report out? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4354) Explore Codespeed feasibility and ease of customization
[ https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4354: --- Attachment: codespeed-data-model.png > Explore Codespeed feasibility and ease of customization > --- > > Key: ARROW-4354 > URL: https://issues.apache.org/jira/browse/ARROW-4354 > Project: Apache Arrow > Issue Type: Task > Components: Developer Tools >Reporter: Areg Melik-Adamyan >Assignee: Tanya Schlusser >Priority: Major > Labels: performance > Attachments: codespeed-data-model.png > > > @Tanya Schlusser can you please explore this option and report out? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753301#comment-16753301 ] Tanya Schlusser commented on ARROW-4313: I've attached a diagram [benchmark-data-model.png|https://issues.apache.org/jira/secure/attachment/12956481/benchmark-data-model.png] and a corresponding {{.erdplus}} file [benchmark-data-model.erdplus|https://issues.apache.org/jira/secure/attachment/12956482/benchmark-data-model.erdplus] (JSON--viewable and editable by getting a free account on [erdplus.com|https://erdplus.com/#/]) with a draft data model for everyone's consideration. I tried to incorporate elements of both the codespeed and the ASV projects. Happy to modify per feedback—or leave this to a more experienced person if I'm becoming the slow link. Of course there will be a view with all of the relevant information joined. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: benchmark-data-model.png > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: benchmark-data-model.erdplus > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4354) Explore Codespeed feasibility and ease of customization
[ https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752464#comment-16752464 ] Tanya Schlusser edited comment on ARROW-4354 at 1/25/19 5:15 PM: - Thank you for the clarifications, Wes and Areg :)! was (Author: tanya): Thank you for the clarifications, Wes and Arek :)! > Explore Codespeed feasibility and ease of customization > --- > > Key: ARROW-4354 > URL: https://issues.apache.org/jira/browse/ARROW-4354 > Project: Apache Arrow > Issue Type: Task > Components: Developer Tools >Reporter: Areg Melik-Adamyan >Assignee: Tanya Schlusser >Priority: Major > Labels: performance > > @Tanya Schlusser can you please explore this option and report out? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4354) Explore Codespeed feasibility and ease of customization
[ https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752464#comment-16752464 ] Tanya Schlusser commented on ARROW-4354: Thank you for the clarifications, Wes and Arek :)! > Explore Codespeed feasibility and ease of customization > --- > > Key: ARROW-4354 > URL: https://issues.apache.org/jira/browse/ARROW-4354 > Project: Apache Arrow > Issue Type: Task > Components: Developer Tools >Reporter: Areg Melik-Adamyan >Assignee: Tanya Schlusser >Priority: Major > Labels: performance > > @Tanya Schlusser can you please explore this option and report out? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4354) Explore Codespeed feasibility and ease of customization
[ https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751203#comment-16751203 ] Tanya Schlusser commented on ARROW-4354: Codespeed is nice too. [Link to codespeed repo|https://github.com/tobami/codespeed]. However it looks like codespeed is licensed under LGPL v 2.1 — I believe LGPL 3 is the first one that is compatible with Apache; which only means their codebase can't be in the Arrow codebase...maybe not a big deal. I agree the backend is nice and simple is a good thing. > Explore Codespeed feasibility and ease of customization > --- > > Key: ARROW-4354 > URL: https://issues.apache.org/jira/browse/ARROW-4354 > Project: Apache Arrow > Issue Type: Task > Components: Developer Tools >Reporter: Areg Melik-Adamyan >Priority: Major > Labels: performance > > @Tanya Schlusser can you please explore this option and report out? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4354) Explore Codespeed feasibility and ease of customization
[ https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751189#comment-16751189 ] Tanya Schlusser commented on ARROW-4354: Nice! One of the contributors on this project, Antoine Pitrou (didn't `at` him but he has commented on the ARROW-4313), has contributed to [Airspeed Velocity (ASV)|https://github.com/airspeed-velocity/asv] which I have been looking at too, and which informed his initial comments on [the mailing list|https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E]. Here are a benchmarks for a [bunch of pandas-related projects|https://pandas.pydata.org/speed/] using Airspeed Velocity. Maybe we can take the best of both worlds, and use the database schema from Codespeed and the mostly static components of ASV. I am very impressed with the functionality of Airspeed Velocity. > Explore Codespeed feasibility and ease of customization > --- > > Key: ARROW-4354 > URL: https://issues.apache.org/jira/browse/ARROW-4354 > Project: Apache Arrow > Issue Type: Task > Components: Developer Tools >Reporter: Areg Melik-Adamyan >Priority: Major > Labels: performance > > @Tanya Schlusser can you please explore this option and report out? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748260#comment-16748260 ] Tanya Schlusser commented on ARROW-4313: Pinging [~aregm] who started the email discussion, and volunteering to help in what ways I can 👋. I said I'd mock a backend and will edit this comment with a hyperlink when a mock is up. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3324) [Python] Users reporting memory leaks using pa.pq.ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729131#comment-16729131 ] Tanya Schlusser commented on ARROW-3324: The file [arrow_3324_leak_on_write.py|https://issues.apache.org/jira/secure/attachment/12953078/arrow_3324_leak_on_write.py] contains a modified version of the stackoverflow post in Wes's comment above, with {{memory_profiler}} to show the memory use. The memory use does increase as the code cycles through multiple calls to {{write_table}}. > [Python] Users reporting memory leaks using pa.pq.ParquetDataset > > > Key: ARROW-3324 > URL: https://issues.apache.org/jira/browse/ARROW-3324 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.12.0 > > Attachments: arrow_3324_leak_on_write.py > > Time Spent: 50m > Remaining Estimate: 0h > > See: > * https://github.com/apache/arrow/issues/2614 > * https://github.com/apache/arrow/issues/2624 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3324) [Python] Users reporting memory leaks using pa.pq.ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-3324: --- Attachment: arrow_3324_leak_on_write.py > [Python] Users reporting memory leaks using pa.pq.ParquetDataset > > > Key: ARROW-3324 > URL: https://issues.apache.org/jira/browse/ARROW-3324 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.12.0 > > Attachments: arrow_3324_leak_on_write.py > > Time Spent: 40m > Remaining Estimate: 0h > > See: > * https://github.com/apache/arrow/issues/2614 > * https://github.com/apache/arrow/issues/2624 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3324) [Python] Users reporting memory leaks using pa.pq.ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728819#comment-16728819 ] Tanya Schlusser commented on ARROW-3324: I could not reproduce either of the two GitHub issues above, but could identify a leak using {{memory_profiler}} on the stackoverflow code (copied [this|https://github.com/apache/arrow/blob/master/python/scripts/test_leak.py]) I observed that {{FileSerializer.properties_.use_count()}} increments more than expected whenever {{FileSerializer.AppendRowGroup}} is called. The offending line is {{FileSerializer.metadata_->AppendRowGroup()}}. I believe that the count should only go up once per new row group, instead of once per column plus once per row group. I think the root cause is that in {{RowGroupMetaDataBuilder::RowGroupMetaDataBuilderImpl.Finish}}, the vector of {{column_builders_}} ought to be reset and cleared each time before it is repopulated. I hope to submit a pull request for this even though it may not address all of the issues stated here. Since the GitHub issues were about memory leaks on "read", and the fix is related only to "write", this observation certainly doesn't address everything in this JIRA issue. Even after the fix I'll post, my memory_profiler code still shows an increase in memory use upon additional calls to {{pq.ParquetWriter.write_table}}, which I think is OK because the row group is incrementing with each write too. So I may be wrong or have still missed something. Regardless, I hope these notes are useful to someone. > [Python] Users reporting memory leaks using pa.pq.ParquetDataset > > > Key: ARROW-3324 > URL: https://issues.apache.org/jira/browse/ARROW-3324 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > See: > * https://github.com/apache/arrow/issues/2614 > * https://github.com/apache/arrow/issues/2624 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4050) core dump on reading parquet file
[ https://issues.apache.org/jira/browse/ARROW-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16727509#comment-16727509 ] Tanya Schlusser commented on ARROW-4050: Hi [~cav71], maybe I can be useful. You're right that arrow (cpp) builds a library called {{libarrow_python}} which exposes the parts of arrow that the Python library will use. That is the first step, run with {{cmake}} inside the directory {{arrow/cpp/build}}. But to make the Python library there must also be a second step, run inside {{arrow/python}}: The pyarrow library uses Cython (I am learning Cython -- this [rectangle example|https://cython.readthedocs.io/en/latest/src/userguide/wrapping_CPlusPlus.html] was helpful) wraps all of these exposed objects in Python for the end user. h6. details / example: The [pyarrow.__init__.py|https://github.com/apache/arrow/blob/master/python/pyarrow/__init__.py] imports a ton of stuff from {{pyarrow.lib}}. But there is no {{pyarrow/lib.py}} file in the source code. Instead, there are * {{pyarrow/lib.pxd}} (corresponds to a c++ header file) * {{pyarrow/lib.pyx}} (corresponds to a c++ source file) which must be compiled using Cython. The {{setup.py build_ext --inplace}} uses Cython to # auto- generate C++ code ({{pyarrow/lib.cpp}}, {{pyarrow/lib_api.h}}) # compile it to a shared object (on my laptop, {{pyarrow/lib.cpython-36m-darwin.so}}) That shared object is the {{pyarrow.lib}} imported in {{pyarrow/__init__.py}}. I hope it is useful! P.S. The [script linked above|https://issues.apache.org/jira/secure/attachment/12952061/working_python37_build_on_osx.sh] successfully built the code on my laptop > core dump on reading parquet file > - > > Key: ARROW-4050 > URL: https://issues.apache.org/jira/browse/ARROW-4050 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Antonio Cavallo >Priority: Blocker > Labels: pull-request-available > Attachments: bug.parquet, working_python37_build_on_osx.sh > > Time Spent: 20m > Remaining Estimate: 0h > > Hi, > I've a crash when doing this: > {{import pyarrow.parquet as pq}} > {{pq.read_table('bug.parquet')}} > [^bug.parquet] > (this is the same generated by > arrow/python/pyarrow/tests/test_parquet.py(112)test_single_pylist_column_roundtrip()) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3020) [Python] Addition of option to allow empty Parquet row groups
[ https://issues.apache.org/jira/browse/ARROW-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726349#comment-16726349 ] Tanya Schlusser commented on ARROW-3020: I looked into this and do not believe the Parquet code permits this at the moment despite the comment in the OP's hyperlink saying they thought it did. pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and here's the current code (also [linked here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]). {code:title=from "parquet/arrow/writer.h"} Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) { if (chunk_size <= 0) { return Status::Invalid("chunk size per row_group must be greater than 0"); } else if (chunk_size > impl_->properties().max_row_group_length()) { chunk_size = impl_->properties().max_row_group_length(); } for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) { int64_t offset = chunk * chunk_size; int64_t size = std::min(chunk_size, table.num_rows() - offset); RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close())); for (int i = 0; i < table.num_columns(); i++) { auto chunked_data = table.column(i)->data(); RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size), PARQUET_IGNORE_NOT_OK(Close())); } } return Status::OK(); } {code} > [Python] Addition of option to allow empty Parquet row groups > - > > Key: ARROW-3020 > URL: https://issues.apache.org/jira/browse/ARROW-3020 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Alex Mendelson >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > While our use case is not common, I was able to find one related request from > roughly a year ago. Could this be added as a feature? > https://issues.apache.org/jira/browse/PARQUET-1047 > *Motivation* > We have an application where each row is associated with one of N contexts, > though a minority of contexts may have no associated rows. When encountering > the Nth context, we will wish to retrieve all the associated rows. Row groups > would provide a natural way to index the data, as the nth context could > naturally relate to the nth row group. > Unfortunately, this is not possible at the present time, as pyarrow does not > support writing empty row groups. If one writes a pyarrow.Table containing > zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final > file, and this distorts the indexing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3020) [Python] Addition of option to allow empty Parquet row groups
[ https://issues.apache.org/jira/browse/ARROW-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726349#comment-16726349 ] Tanya Schlusser edited comment on ARROW-3020 at 12/21/18 12:34 AM: --- I looked into this and do not believe the Parquet code permits this at the moment despite the comment in the OP's hyperlink saying they thought it did. pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and here's the current code (also [linked here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]). If {{table.num_rows()}} is zero, nothing will ever happen. {code:title=from "parquet/arrow/writer.cc"} Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) { if (chunk_size <= 0) { return Status::Invalid("chunk size per row_group must be greater than 0"); } else if (chunk_size > impl_->properties().max_row_group_length()) { chunk_size = impl_->properties().max_row_group_length(); } for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) { int64_t offset = chunk * chunk_size; int64_t size = std::min(chunk_size, table.num_rows() - offset); RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close())); for (int i = 0; i < table.num_columns(); i++) { auto chunked_data = table.column(i)->data(); RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size), PARQUET_IGNORE_NOT_OK(Close())); } } return Status::OK(); } {code} was (Author: tanya): I looked into this and do not believe the Parquet code permits this at the moment despite the comment in the OP's hyperlink saying they thought it did. pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and here's the current code (also [linked here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]). If {{table.num_rows()}} is zero nothing will ever happen. {code:title=from "parquet/arrow/writer.h"} Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) { if (chunk_size <= 0) { return Status::Invalid("chunk size per row_group must be greater than 0"); } else if (chunk_size > impl_->properties().max_row_group_length()) { chunk_size = impl_->properties().max_row_group_length(); } for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) { int64_t offset = chunk * chunk_size; int64_t size = std::min(chunk_size, table.num_rows() - offset); RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close())); for (int i = 0; i < table.num_columns(); i++) { auto chunked_data = table.column(i)->data(); RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size), PARQUET_IGNORE_NOT_OK(Close())); } } return Status::OK(); } {code} > [Python] Addition of option to allow empty Parquet row groups > - > > Key: ARROW-3020 > URL: https://issues.apache.org/jira/browse/ARROW-3020 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Alex Mendelson >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > While our use case is not common, I was able to find one related request from > roughly a year ago. Could this be added as a feature? > https://issues.apache.org/jira/browse/PARQUET-1047 > *Motivation* > We have an application where each row is associated with one of N contexts, > though a minority of contexts may have no associated rows. When encountering > the Nth context, we will wish to retrieve all the associated rows. Row groups > would provide a natural way to index the data, as the nth context could > naturally relate to the nth row group. > Unfortunately, this is not possible at the present time, as pyarrow does not > support writing empty row groups. If one writes a pyarrow.Table containing > zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final > file, and this distorts the indexing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3020) [Python] Addition of option to allow empty Parquet row groups
[ https://issues.apache.org/jira/browse/ARROW-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726349#comment-16726349 ] Tanya Schlusser edited comment on ARROW-3020 at 12/21/18 12:32 AM: --- I looked into this and do not believe the Parquet code permits this at the moment despite the comment in the OP's hyperlink saying they thought it did. pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and here's the current code (also [linked here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]). If {{table.num_rows()}} is zero nothing will ever happen. {code:title=from "parquet/arrow/writer.h"} Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) { if (chunk_size <= 0) { return Status::Invalid("chunk size per row_group must be greater than 0"); } else if (chunk_size > impl_->properties().max_row_group_length()) { chunk_size = impl_->properties().max_row_group_length(); } for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) { int64_t offset = chunk * chunk_size; int64_t size = std::min(chunk_size, table.num_rows() - offset); RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close())); for (int i = 0; i < table.num_columns(); i++) { auto chunked_data = table.column(i)->data(); RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size), PARQUET_IGNORE_NOT_OK(Close())); } } return Status::OK(); } {code} was (Author: tanya): I looked into this and do not believe the Parquet code permits this at the moment despite the comment in the OP's hyperlink saying they thought it did. pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and here's the current code (also [linked here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]). {code:title=from "parquet/arrow/writer.h"} Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) { if (chunk_size <= 0) { return Status::Invalid("chunk size per row_group must be greater than 0"); } else if (chunk_size > impl_->properties().max_row_group_length()) { chunk_size = impl_->properties().max_row_group_length(); } for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) { int64_t offset = chunk * chunk_size; int64_t size = std::min(chunk_size, table.num_rows() - offset); RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close())); for (int i = 0; i < table.num_columns(); i++) { auto chunked_data = table.column(i)->data(); RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size), PARQUET_IGNORE_NOT_OK(Close())); } } return Status::OK(); } {code} > [Python] Addition of option to allow empty Parquet row groups > - > > Key: ARROW-3020 > URL: https://issues.apache.org/jira/browse/ARROW-3020 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Alex Mendelson >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > While our use case is not common, I was able to find one related request from > roughly a year ago. Could this be added as a feature? > https://issues.apache.org/jira/browse/PARQUET-1047 > *Motivation* > We have an application where each row is associated with one of N contexts, > though a minority of contexts may have no associated rows. When encountering > the Nth context, we will wish to retrieve all the associated rows. Row groups > would provide a natural way to index the data, as the nth context could > naturally relate to the nth row group. > Unfortunately, this is not possible at the present time, as pyarrow does not > support writing empty row groups. If one writes a pyarrow.Table containing > zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final > file, and this distorts the indexing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4056) [C++] boost-cpp toolchain packages causing crashes on Xcode > 6.4
[ https://issues.apache.org/jira/browse/ARROW-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724160#comment-16724160 ] Tanya Schlusser commented on ARROW-4056: Not sure if it's useful, but adding a link to the Anaconda docs about their toolchain https://conda.io/docs/user-guide/tasks/build-packages/compiler-tools.html#using-the-compiler-packages > [C++] boost-cpp toolchain packages causing crashes on Xcode > 6.4 > - > > Key: ARROW-4056 > URL: https://issues.apache.org/jira/browse/ARROW-4056 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > EDIT: the issue has been present for a large portion of 2018. I found this > when merging the macOS C++ builds and changed the build type to Xcode 8.3: > https://travis-ci.org/wesm/arrow/jobs/469297420#L2856 > I reported the issue into conda-forge at > https://github.com/conda-forge/boost-cpp-feedstock/issues/40 > It seems that the Ray project worked around this earlier this year: > https://github.com/ray-project/ray/pull/1688 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4050) core dump on reading parquet file
[ https://issues.apache.org/jira/browse/ARROW-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723180#comment-16723180 ] Tanya Schlusser edited comment on ARROW-4050 at 12/17/18 5:26 PM: -- I can confirm building the environment with {{boost-cpp=1.68.0}} works. I also had to add {{make}} to successfully build {{jemalloc_ep}}, although without it the build and test still worked (the 'install' step right after it worked anyway; it was just disconcerting to have a failed step). I uploaded the entire sequence of commands that produces a successful python test in [{{working_python37_build_on_osx.sh}}|https://issues.apache.org/jira/secure/attachment/12952061/working_python37_build_on_osx.sh]. There are [numpy empty truth test deprecation warnings|https://github.com/numpy/numpy/issues/9583] but that's it. Hope it helps. was (Author: tanya): I can confirm building the environment with {{boost-cpp=1.68.0}} works. I also had to add {{make}} to successfully build {{jemalloc_ep}}, although without it the build and test still worked (the 'install' step right after it worked anyway; it was just disconcerting to have a failed step). I uploaded the entire sequence of commands that produces a successful python test in {{working_python37_build_on_osx.sh}}. There are [numpy empty truth test deprecation warnings|https://github.com/numpy/numpy/issues/9583] but that's it. Hope it helps. > core dump on reading parquet file > - > > Key: ARROW-4050 > URL: https://issues.apache.org/jira/browse/ARROW-4050 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Antonio Cavallo >Priority: Blocker > Attachments: bug.parquet, working_python37_build_on_osx.sh > > > Hi, > I've a crash when doing this: > {{import pyarrow.parquet as pq}} > {{pq.read_table('bug.parquet')}} > [^bug.parquet] > (this is the same generated by > arrow/python/pyarrow/tests/test_parquet.py(112)test_single_pylist_column_roundtrip()) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4050) core dump on reading parquet file
[ https://issues.apache.org/jira/browse/ARROW-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723180#comment-16723180 ] Tanya Schlusser commented on ARROW-4050: I can confirm building the environment with {{boost-cpp=1.68.0}} works. I also had to add {{make}} to successfully build {{jemalloc_ep}}, although without it the build and test still worked (the 'install' step right after it worked anyway; it was just disconcerting to have a failed step). I uploaded the entire sequence of commands that produces a successful python test in {{working_python37_build_on_osx.sh}}. There are [numpy empty truth test deprecation warnings|https://github.com/numpy/numpy/issues/9583] but that's it. Hope it helps. > core dump on reading parquet file > - > > Key: ARROW-4050 > URL: https://issues.apache.org/jira/browse/ARROW-4050 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Antonio Cavallo >Priority: Blocker > Attachments: bug.parquet, working_python37_build_on_osx.sh > > > Hi, > I've a crash when doing this: > {{import pyarrow.parquet as pq}} > {{pq.read_table('bug.parquet')}} > [^bug.parquet] > (this is the same generated by > arrow/python/pyarrow/tests/test_parquet.py(112)test_single_pylist_column_roundtrip()) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4050) core dump on reading parquet file
[ https://issues.apache.org/jira/browse/ARROW-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4050: --- Attachment: working_python37_build_on_osx.sh > core dump on reading parquet file > - > > Key: ARROW-4050 > URL: https://issues.apache.org/jira/browse/ARROW-4050 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Antonio Cavallo >Priority: Blocker > Attachments: bug.parquet, working_python37_build_on_osx.sh > > > Hi, > I've a crash when doing this: > {{import pyarrow.parquet as pq}} > {{pq.read_table('bug.parquet')}} > [^bug.parquet] > (this is the same generated by > arrow/python/pyarrow/tests/test_parquet.py(112)test_single_pylist_column_roundtrip()) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4050) core dump on reading parquet file
[ https://issues.apache.org/jira/browse/ARROW-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722929#comment-16722929 ] Tanya Schlusser commented on ARROW-4050: Hello [~cav71], I may be able to help – I'm new enough that I just went through the pain of setting up my environment too, and better my system sounds like yours: I have a mac with Xcode 10.1. I did the things you said: followed the documentation in a new Conda environment, and indeed have a segfault in the python parquet tests. However I can switch to a separate Conda environment and build and test just fine. I am currently going through my env to see what is different between the two and will report back when I figure out the relevant different thing. I had trouble setting up too, but was too shy to speak up about it and clearly the documentation could be improved – at least for us mac users! If you want to try and figure it out too, the thing I did to get stuff working was read through the Python dockerfile and the scripts in arrow/dev and arrow/ci. The problem is I tried 1000 things and didn't pay attention to what worked or I'd answer this with more useful information. > core dump on reading parquet file > - > > Key: ARROW-4050 > URL: https://issues.apache.org/jira/browse/ARROW-4050 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Antonio Cavallo >Priority: Blocker > Attachments: bug.parquet > > > Hi, > I've a crash when doing this: > {{import pyarrow.parquet as pq}} > {{pq.read_table('bug.parquet')}} > [^bug.parquet] > (this is the same generated by > arrow/python/pyarrow/tests/test_parquet.py(112)test_single_pylist_column_roundtrip()) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4039) Update link to 'development.rst' page from Python README.md
Tanya Schlusser created ARROW-4039: -- Summary: Update link to 'development.rst' page from Python README.md Key: ARROW-4039 URL: https://issues.apache.org/jira/browse/ARROW-4039 Project: Apache Arrow Issue Type: Task Components: Documentation, Python Reporter: Tanya Schlusser When the Sphinx docs were restructured, the link in the [README|https://github.com/apache/arrow/blob/master/python/README.md] changed from [https://github.com/apache/arrow/blob/master/python/doc/source/development.rst] to [https://github.com/apache/arrow/blob/master/docs/source/python/development.rst] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3230) [Python] Missing comparisons on ChunkedArray, Table
[ https://issues.apache.org/jira/browse/ARROW-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722174#comment-16722174 ] Tanya Schlusser commented on ARROW-3230: Woo, this looks like my level! Thank you [~kszucs], I will try it. > [Python] Missing comparisons on ChunkedArray, Table > --- > > Key: ARROW-3230 > URL: https://issues.apache.org/jira/browse/ARROW-3230 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.10.0 >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.13.0 > > > Table and ChunkedArray equality are not implemented, meaning they fall back > on identity. Instead they should invoke equals(), as on Column. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3230) [Python] Missing comparisons on ChunkedArray, Table
[ https://issues.apache.org/jira/browse/ARROW-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser reassigned ARROW-3230: -- Assignee: Tanya Schlusser > [Python] Missing comparisons on ChunkedArray, Table > --- > > Key: ARROW-3230 > URL: https://issues.apache.org/jira/browse/ARROW-3230 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.10.0 >Reporter: Antoine Pitrou >Assignee: Tanya Schlusser >Priority: Major > Fix For: 0.13.0 > > > Table and ChunkedArray equality are not implemented, meaning they fall back > on identity. Instead they should invoke equals(), as on Column. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python
[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719488#comment-16719488 ] Tanya Schlusser commented on ARROW-3543: I can confirm this bug is still present. Most recent commit in my pull is c0ac97f126c98fb29e81d6544adfea9d4ab74aff For others, the R libraries needed to re-run Olaf's code (in addition to arrow) are readr, dplyr, and lubridate. I will mess around but won't be hurt if a stronger R coder takes this before I finish. > [R] Time zone adjustment issue when reading Feather file written by Python > -- > > Key: ARROW-3543 > URL: https://issues.apache.org/jira/browse/ARROW-3543 > Project: Apache Arrow > Issue Type: Bug >Reporter: Olaf >Priority: Critical > Fix For: 0.12.0 > > > Hello the dream team, > Pasting from [https://github.com/wesm/feather/issues/351] > Thanks for this wonderful package. I was playing with feather and some > timestamps and I noticed some dangerous behavior. Maybe it is a bug. > Consider this > > {code:java} > import pandas as pd > import feather > import numpy as np > df = pd.DataFrame( > {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), > pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 > 14:01:02.200')]} > ) > df['timestamp_est'] = > pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) > df > Out[17]: > string_time_utc timestamp_est > 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 > 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > {code} > Here I create the corresponding `EST` timestamp of my original timestamps (in > `UTC` time). > Now saving the dataframe to `csv` or to `feather` will generate two > completely different results. > > {code:java} > df.to_csv('P://testing.csv') > df.to_feather('P://testing.feather') > {code} > Switching to R. > Using the good old `csv` gives me something a bit annoying, but expected. R > thinks my timezone is `UTC` by default, and wrongly attached this timezone to > `timestamp_est`. No big deal, I can always use `with_tz` or even better: > import as character and process as timestamp while in R. > > {code:java} > > dataframe <- read_csv('P://testing.csv') > Parsed with column specification: > cols( > X1 = col_integer(), > string_time_utc = col_datetime(format = ""), > timestamp_est = col_datetime(format = "") > ) > Warning message: > Missing column names filled in: 'X1' [1] > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 4 > X1 string_time_utc timestamp_est > > 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 > 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > mytimezone > > 1 UTC > 2 UTC > 3 UTC {code} > {code:java} > #Now look at what happens with feather: > > > dataframe <- read_feather('P://testing.feather') > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 3 > string_time_utc timestamp_est mytimezone > > 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" > 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" > 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} > My timestamps have been converted!!! pure insanity. > Am I missing something here? > Thanks!! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3866) [Python] Column metadata is not transferred to tables in pyarrow
[ https://issues.apache.org/jira/browse/ARROW-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718423#comment-16718423 ] Tanya Schlusser commented on ARROW-3866: Hello [~frutti93], would you mind if I give it a try? Since it has a "newbie" label it looks like my kind of thing :). > [Python] Column metadata is not transferred to tables in pyarrow > > > Key: ARROW-3866 > URL: https://issues.apache.org/jira/browse/ARROW-3866 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Seb Fru >Priority: Major > Labels: features, newbie > Fix For: 0.12.0 > > > Hello everyone, > transferring this from Github for Pyarrow. While working with pyarrow I > noticed that field metadata does not get carried foreward when creating a > table out of several columns. Is this intended behaviour or is there a way to > add column metadata later on? The last command in my example does not return > anything. > I also could not verify whether this data would be written to parquet later > on, because I could not find a way to add field metadata directly to a table. > > {code:java} > >>>import pyarrow as pa > >>>import pyarrow.parquet as pq > >>>arr1 = pa.array([1,2]) > >>>arr2 = pa.array([3,4]) > >>>field1 = pa.field('field1', pa.int64()) > >>>field2 = pa.field('field2', pa.int64()) > >>>field1 = field1.add_metadata({'foo1': 'bar1'}) > >>>field2 = field2.add_metadata({'foo2': 'bar2'}) > >>>field1.metadata {b'foo1': b'bar1'} > >>>field2.metadata {b'foo2': b'bar2'} > >>>col1 = pa.column(field1, arr1) > >>>col2 = pa.column(field2, arr2) > >>>col1.field.metadata {b'foo1': b'bar1'} > >>>tab = pa.Table.from_arrays([col1, col2]) > >>>tab pyarrow.Table field1: int64 field2: int64 > >>>tab.column(0).field.metadata > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3792) [Python] Segmentation fault when writing empty RecordBatches to Parquet
[ https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710926#comment-16710926 ] Tanya Schlusser commented on ARROW-3792: Sweet! I'll stop on this then :) > [Python] Segmentation fault when writing empty RecordBatches to Parquet > --- > > Key: ARROW-3792 > URL: https://issues.apache.org/jira/browse/ARROW-3792 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.11.1 > Environment: Fedora 28, pyarrow installed with pip > Fedora 29, pyarrow installed from conda-forge >Reporter: Suvayu Ali >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > Attachments: minimal_bug_arrow3792.py, pq-bug.py > > > h2. Background > I am trying to convert a very sparse dataset to parquet (~3% rows in a range > are populated). The file I am working with spans upto ~63M rows. I decided to > iterate in batches of 500k rows, 127 batches in total. Each row batch is a > {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file > incrementally. Something like this: > {code:python} > batches = [..] # 4 batches > tbl = pa.Table.from_batches(batches) > pqwriter.write_table(tbl, row_group_size=15000) > # same issue with pq.write_table(..) > {code} > I was getting a segmentation fault at the final step, I narrowed it down to a > specific iteration. I noticed that iteration had empty batches; specifically, > [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the > whole dataset is below: > {code:python} > [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799, > 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800, > 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167, > 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535, > 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878, > 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330, > 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634, > 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171, > 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122, > 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532, > 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248, > 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742, > 18807, 18789, 14258, 0, 0] > {code} > On excluding the empty {{RecordBatch}}-es, the segfault goes away, but > unfortunately I couldn't create a proper minimal example with synthetic data. > h2. Not quite minimal example > The data I am using is from the 1000 Genome project, which has been public > for many years, so we can be reasonably sure the data is good. The following > steps should help you replicate the issue. > # Download the data file (and index), about 330MB: > {code:bash} > $ wget > ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi} > {code} > # Install the Cython library {{pysam}}, a thin wrapper around the reference > implementation of the VCF file spec. You will need {{zlib}} headers, but > that's probably not a problem :) > {code:bash} > $ pip3 install --user pysam > {code} > # Now you can use the attached script to replicate the crash. > h2. Extra information > I have tried attaching gdb, the backtrace when the segfault occurs is shown > below (maybe it helps, this is how I realised empty batches could be the > reason). > {code} > (gdb) bt > #0 0x7f3e7676d670 in > parquet::TypedColumnWriter > >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray > const*) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #1 0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous > namespace)::ArrowColumnWriter::TypedWriteBatch, > arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #2 0x7f3e7673a3d4 in parquet::arrow::(anonymous > namespace)::ArrowColumnWriter::Write(arrow::Array const&) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #3 0x7f3e7673df09 in > parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr > const&, long, long) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #4 0x7f3e7673c74d in > parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr > const&, long, long) () >from > /home/user/miniconda3/lib/python3.6/site-pa
[jira] [Commented] (ARROW-3792) [Python] Segmentation fault when writing empty RecordBatches to Parquet
[ https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710914#comment-16710914 ] Tanya Schlusser commented on ARROW-3792: I can now reproduce the bug with a more minimal code (see the attached file [minimal_bug_arrow3792.py|https://issues.apache.org/jira/secure/attachment/12950784/minimal_bug_arrow3792.py]) – it is a problem with the column that contains a list -- I think the segfault occurs when dealing with the empty batch that is supposed to contain a column that contains a list. I'm still going to look at it more but in case I'm slow and someone else wants to do it faster you no longer need to download the genome dataset or {{pysam}}. > [Python] Segmentation fault when writing empty RecordBatches to Parquet > --- > > Key: ARROW-3792 > URL: https://issues.apache.org/jira/browse/ARROW-3792 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.11.1 > Environment: Fedora 28, pyarrow installed with pip > Fedora 29, pyarrow installed from conda-forge >Reporter: Suvayu Ali >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > Attachments: minimal_bug_arrow3792.py, pq-bug.py > > > h2. Background > I am trying to convert a very sparse dataset to parquet (~3% rows in a range > are populated). The file I am working with spans upto ~63M rows. I decided to > iterate in batches of 500k rows, 127 batches in total. Each row batch is a > {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file > incrementally. Something like this: > {code:python} > batches = [..] # 4 batches > tbl = pa.Table.from_batches(batches) > pqwriter.write_table(tbl, row_group_size=15000) > # same issue with pq.write_table(..) > {code} > I was getting a segmentation fault at the final step, I narrowed it down to a > specific iteration. I noticed that iteration had empty batches; specifically, > [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the > whole dataset is below: > {code:python} > [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799, > 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800, > 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167, > 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535, > 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878, > 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330, > 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634, > 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171, > 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122, > 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532, > 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248, > 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742, > 18807, 18789, 14258, 0, 0] > {code} > On excluding the empty {{RecordBatch}}-es, the segfault goes away, but > unfortunately I couldn't create a proper minimal example with synthetic data. > h2. Not quite minimal example > The data I am using is from the 1000 Genome project, which has been public > for many years, so we can be reasonably sure the data is good. The following > steps should help you replicate the issue. > # Download the data file (and index), about 330MB: > {code:bash} > $ wget > ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi} > {code} > # Install the Cython library {{pysam}}, a thin wrapper around the reference > implementation of the VCF file spec. You will need {{zlib}} headers, but > that's probably not a problem :) > {code:bash} > $ pip3 install --user pysam > {code} > # Now you can use the attached script to replicate the crash. > h2. Extra information > I have tried attaching gdb, the backtrace when the segfault occurs is shown > below (maybe it helps, this is how I realised empty batches could be the > reason). > {code} > (gdb) bt > #0 0x7f3e7676d670 in > parquet::TypedColumnWriter > >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray > const*) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #1 0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous > namespace)::ArrowColumnWriter::TypedWriteBatch, > arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #2 0x7f3e7673a3d4 in parquet::arrow::(anonymous > namespace)::ArrowColumnWriter::Write(arrow::Array const
[jira] [Updated] (ARROW-3792) [Python] Segmentation fault when writing empty RecordBatches to Parquet
[ https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-3792: --- Attachment: minimal_bug_arrow3792.py > [Python] Segmentation fault when writing empty RecordBatches to Parquet > --- > > Key: ARROW-3792 > URL: https://issues.apache.org/jira/browse/ARROW-3792 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.11.1 > Environment: Fedora 28, pyarrow installed with pip > Fedora 29, pyarrow installed from conda-forge >Reporter: Suvayu Ali >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > Attachments: minimal_bug_arrow3792.py, pq-bug.py > > > h2. Background > I am trying to convert a very sparse dataset to parquet (~3% rows in a range > are populated). The file I am working with spans upto ~63M rows. I decided to > iterate in batches of 500k rows, 127 batches in total. Each row batch is a > {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file > incrementally. Something like this: > {code:python} > batches = [..] # 4 batches > tbl = pa.Table.from_batches(batches) > pqwriter.write_table(tbl, row_group_size=15000) > # same issue with pq.write_table(..) > {code} > I was getting a segmentation fault at the final step, I narrowed it down to a > specific iteration. I noticed that iteration had empty batches; specifically, > [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the > whole dataset is below: > {code:python} > [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799, > 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800, > 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167, > 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535, > 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878, > 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330, > 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634, > 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171, > 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122, > 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532, > 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248, > 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742, > 18807, 18789, 14258, 0, 0] > {code} > On excluding the empty {{RecordBatch}}-es, the segfault goes away, but > unfortunately I couldn't create a proper minimal example with synthetic data. > h2. Not quite minimal example > The data I am using is from the 1000 Genome project, which has been public > for many years, so we can be reasonably sure the data is good. The following > steps should help you replicate the issue. > # Download the data file (and index), about 330MB: > {code:bash} > $ wget > ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi} > {code} > # Install the Cython library {{pysam}}, a thin wrapper around the reference > implementation of the VCF file spec. You will need {{zlib}} headers, but > that's probably not a problem :) > {code:bash} > $ pip3 install --user pysam > {code} > # Now you can use the attached script to replicate the crash. > h2. Extra information > I have tried attaching gdb, the backtrace when the segfault occurs is shown > below (maybe it helps, this is how I realised empty batches could be the > reason). > {code} > (gdb) bt > #0 0x7f3e7676d670 in > parquet::TypedColumnWriter > >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray > const*) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #1 0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous > namespace)::ArrowColumnWriter::TypedWriteBatch, > arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #2 0x7f3e7673a3d4 in parquet::arrow::(anonymous > namespace)::ArrowColumnWriter::Write(arrow::Array const&) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #3 0x7f3e7673df09 in > parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr > const&, long, long) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #4 0x7f3e7673c74d in > parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr > const&, long, long) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #5 0x0
[jira] [Commented] (ARROW-3792) [Python] Segmentation fault when writing empty RecordBatches to Parquet
[ https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710248#comment-16710248 ] Tanya Schlusser commented on ARROW-3792: I have followed Suvayu's instructions and can successfully reproduce the segfault. I am going to try working on this, thanks! > [Python] Segmentation fault when writing empty RecordBatches to Parquet > --- > > Key: ARROW-3792 > URL: https://issues.apache.org/jira/browse/ARROW-3792 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.11.1 > Environment: Fedora 28, pyarrow installed with pip > Fedora 29, pyarrow installed from conda-forge >Reporter: Suvayu Ali >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > Attachments: pq-bug.py > > > h2. Background > I am trying to convert a very sparse dataset to parquet (~3% rows in a range > are populated). The file I am working with spans upto ~63M rows. I decided to > iterate in batches of 500k rows, 127 batches in total. Each row batch is a > {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file > incrementally. Something like this: > {code:python} > batches = [..] # 4 batches > tbl = pa.Table.from_batches(batches) > pqwriter.write_table(tbl, row_group_size=15000) > # same issue with pq.write_table(..) > {code} > I was getting a segmentation fault at the final step, I narrowed it down to a > specific iteration. I noticed that iteration had empty batches; specifically, > [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the > whole dataset is below: > {code:python} > [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799, > 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800, > 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167, > 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535, > 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878, > 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330, > 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634, > 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171, > 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122, > 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532, > 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248, > 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742, > 18807, 18789, 14258, 0, 0] > {code} > On excluding the empty {{RecordBatch}}-es, the segfault goes away, but > unfortunately I couldn't create a proper minimal example with synthetic data. > h2. Not quite minimal example > The data I am using is from the 1000 Genome project, which has been public > for many years, so we can be reasonably sure the data is good. The following > steps should help you replicate the issue. > # Download the data file (and index), about 330MB: > {code:bash} > $ wget > ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi} > {code} > # Install the Cython library {{pysam}}, a thin wrapper around the reference > implementation of the VCF file spec. You will need {{zlib}} headers, but > that's probably not a problem :) > {code:bash} > $ pip3 install --user pysam > {code} > # Now you can use the attached script to replicate the crash. > h2. Extra information > I have tried attaching gdb, the backtrace when the segfault occurs is shown > below (maybe it helps, this is how I realised empty batches could be the > reason). > {code} > (gdb) bt > #0 0x7f3e7676d670 in > parquet::TypedColumnWriter > >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray > const*) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #1 0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous > namespace)::ArrowColumnWriter::TypedWriteBatch, > arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #2 0x7f3e7673a3d4 in parquet::arrow::(anonymous > namespace)::ArrowColumnWriter::Write(arrow::Array const&) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #3 0x7f3e7673df09 in > parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr > const&, long, long) () >from > /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11 > #4 0x7f3e7673c74d in > parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr > const&, lon
[jira] [Commented] (ARROW-3629) [Python] Add write_to_dataset to Python Sphinx API listing
[ https://issues.apache.org/jira/browse/ARROW-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709088#comment-16709088 ] Tanya Schlusser commented on ARROW-3629: Pull request [#3089|https://github.com/apache/arrow/pull/3089], provided I understood this correctly and it only entails adding a single line to the {color:#654982}{{python/doc/source/api.rst}}{color}. Comment: The doc build was difficult, but possibly because I'm a noob. I'm commenting rather than making a JIRA issue because I have no idea whether these are actual issues or just a newbie's lack of knowledge. Running {color:#654982}{{dev/gen_apidocs.sh}}{color} on a clean pull with my single line to {color:#654982}{{api.rst}}{color} changed failed: The {color:#654982}{{iwyu}}{color} image in {color:#654982}{{dev/docker-compose.yml}}{color} failed with this path issue: - {color:#654982}{{ERROR: build path /arrow/dev/iwyu either does not exist, is not accessible, or is not a valid URL.}}{color} - I commented it out and then could continue. The Java docs wouldn't compile either at first: - I think because there's a {color:#654982}{{conda install}}{color} for a second version of {color:#654982}{{maven}}{color} below the {color:#654982}{{apt-get install maven}}{color} in the [Dockerfile|https://github.com/apache/arrow/blob/master/dev/gen_apidocs/Dockerfile], which puts Java 11 in the front of the {color:#654982}{{PATH}}{color} breaking the lookup for class {color:#654982}{{javax.annotation.Generated}}{color} which moves from [Java 8|https://docs.oracle.com/javase/8/docs/api/javax/annotation/Generated.html] to [Java 9|https://docs.oracle.com/javase/9/docs/api/javax/annotation/processing/Generated.html] (and here is where it landed in [Java 11|https://docs.oracle.com/en/java/javase/11/docs/api/java.compiler/javax/annotation/processing/Generated.html]) - when I deleted that line in the Dockerfile, the Java code compiled but didn't pass a test, because of a different missing dependency (that I didn't note; happy to figure it out if it's actually meaningful) - so I commented out the Java build section in {color:#654982}{{dev/gen_apidocs/create_documents.sh}}{color} The Javascript docs failed on a dependency I didn't note (happy to; just didn't want to waste time if it's my noob problem) - so I commented it out too; then the remaining doc generation worked Please disregard if it's my lack of understanding. Otherwise I am happy to investigate further/add issues :). > [Python] Add write_to_dataset to Python Sphinx API listing > -- > > Key: ARROW-3629 > URL: https://issues.apache.org/jira/browse/ARROW-3629 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2860) [Python] Null values in a single partition of Parquet dataset, results in invalid schema on read
[ https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708137#comment-16708137 ] Tanya Schlusser commented on ARROW-2860: I think this was resolved with https://issues.apache.org/jira/browse/ARROW-2891 pull request 2302 [https://github.com/apache/arrow/pull/2302] When I run {{example_failure.py}} it does not fail and returns the expected result. > [Python] Null values in a single partition of Parquet dataset, results in > invalid schema on read > > > Key: ARROW-2860 > URL: https://issues.apache.org/jira/browse/ARROW-2860 > Project: Apache Arrow > Issue Type: Bug >Reporter: Sam Oluwalana >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > from datetime import datetime, timedelta > def generate_data(event_type, event_id, offset=0): > """Generate data.""" > now = datetime.utcnow() + timedelta(seconds=offset) > obj = { > 'event_type': event_type, > 'event_id': event_id, > 'event_date': now.date(), > 'foo': None, > 'bar': u'hello', > } > if event_type == 2: > obj['foo'] = 1 > obj['bar'] = u'world' > if event_type == 3: > obj['different'] = u'data' > obj['bar'] = u'event type 3' > else: > obj['different'] = None > return obj > data = [ > generate_data(1, 1, 1), > generate_data(1, 1, 3600 * 72), > generate_data(2, 1, 1), > generate_data(2, 1, 3600 * 72), > generate_data(3, 1, 1), > generate_data(3, 1, 3600 * 72), > ] > df = pd.DataFrame.from_records(data, index='event_id') > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, root_path='/tmp/events', > partition_cols=['event_type', 'event_date']) > dataset = pq.ParquetDataset('/tmp/events') > table = dataset.read() > print(table.num_rows) > {code} > Expected output: > {code:python} > 6 > {code} > Actual: > {code:python} > python example_failure.py > Traceback (most recent call last): > File "example_failure.py", line 43, in > dataset = pq.ParquetDataset('/tmp/events') > File > "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", > line 745, in __init__ > self.validate_schemas() > File > "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", > line 775, in validate_schemas > dataset_schema)) > ValueError: Schema in partition[event_type=2, event_date=0] > /tmp/events/event_type=3/event_date=2018-07-16 > 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different. > bar: string > different: string > foo: double > event_id: int64 > metadata > > {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], > "columns": [{"metadata": null, "field_name": "bar", "name": "bar", > "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, > "field_name": "different", "name": "different", "numpy_type": "object", > "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": > "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, > "field_name": "event_id", "name": "event_id", "numpy_type": "int64", > "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": > null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'} > vs > bar: string > different: null > foo: double > event_id: int64 > metadata > > {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], > "columns": [{"metadata": null, "field_name": "bar", "name": "bar", > "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, > "field_name": "different", "name": "different", "numpy_type": "object", > "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": > "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, > "field_name": "event_id", "name": "event_id", "numpy_type": "int64", > "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": > null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'} > {code} > Apparently what is happening is that pyarrow is interpreting the schema from > each of the partitions individually and the partitions for `event_type=3 / > event_date=*` both have values for the column `different` whereas the other > columns do not. The discrepancy causes the `None` values of the other > partitions to be labeled as `pandas_type` `empty` instead of `unicode`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2504) [Website] Add ApacheCon NA link
[ https://issues.apache.org/jira/browse/ARROW-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698493#comment-16698493 ] Tanya Schlusser commented on ARROW-2504: Newbie here – looks like a good first issue for me so I'm claiming it thank you! > [Website] Add ApacheCon NA link > --- > > Key: ARROW-2504 > URL: https://issues.apache.org/jira/browse/ARROW-2504 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Wes McKinney >Priority: Major > > See instructions in http://apache.org/events/README.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)