[jira] [Commented] (ARROW-4716) [Benchmarking] Make maching detection script cross-platform

2019-04-13 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816962#comment-16816962
 ] 

Tanya Schlusser commented on ARROW-4716:


Linux first is a great choice.

Python is my favorite language and I am happy to do this. The existing shell 
script was only because it was easy and quick.

I am very very sorry for dropping the ball the past couple of months. My mom 
passed away and I have been a total wreck (and moved back home from her house 
and got a job) but I still want to contribute and hope you will accept me back 
now. You all rock very much!

> [Benchmarking] Make maching detection script cross-platform
> ---
>
> Key: ARROW-4716
> URL: https://issues.apache.org/jira/browse/ARROW-4716
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Antoine Pitrou
>Priority: Major
>
> The machine detection script ({{make_machine_json.sh}}) currently looks like 
> it will only work properly on macOS. Ideally it should work more or less 
> correctly on all of them macOS, Linux and Windows (some values may remain 
> undetected on some platforms).
> This probably entails:
> - switching to Python rather than bash
> - using something like [psutil|https://psutil.readthedocs.io/en/latest/] to 
> grab useful machine information
> - calling {{nvidia-smi}} to query GPU characteristics (for example 
> "nvidia-smi -q -i 0 -x")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python

2019-03-06 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785957#comment-16785957
 ] 

Tanya Schlusser commented on ARROW-3543:


<3 Thank you Olaf!

> [R] Time zone adjustment issue when reading Feather file written by Python
> --
>
> Key: ARROW-3543
> URL: https://issues.apache.org/jira/browse/ARROW-3543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Olaf
>Priority: Critical
> Fix For: 0.13.0
>
>
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some 
> timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
> pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
> 14:01:02.200')]}
> )
> df['timestamp_est'] = 
> pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in 
> `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two 
> completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R 
> thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
> `timestamp_est`. No big deal, I can always use `with_tz` or even better: 
> import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>   
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure insanity. 
>  Am I missing something here?
> Thanks!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python

2019-03-06 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785529#comment-16785529
 ] 

Tanya Schlusser commented on ARROW-3543:


Hi [~Olafsson] I am still looking at this. My mom passed away last week and I 
have been listless and distracted since then. I am sorry for your inconvenience.

> [R] Time zone adjustment issue when reading Feather file written by Python
> --
>
> Key: ARROW-3543
> URL: https://issues.apache.org/jira/browse/ARROW-3543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Olaf
>Priority: Critical
> Fix For: 0.13.0
>
>
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some 
> timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
> pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
> 14:01:02.200')]}
> )
> df['timestamp_est'] = 
> pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in 
> `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two 
> completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R 
> thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
> `timestamp_est`. No big deal, I can always use `with_tz` or even better: 
> import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>   
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure insanity. 
>  Am I missing something here?
> Thanks!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4716) [Benchmarking] Make maching detection script cross-platform

2019-02-28 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780524#comment-16780524
 ] 

Tanya Schlusser commented on ARROW-4716:


(y)

> [Benchmarking] Make maching detection script cross-platform
> ---
>
> Key: ARROW-4716
> URL: https://issues.apache.org/jira/browse/ARROW-4716
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Antoine Pitrou
>Priority: Major
>
> The machine detection script ({{make_machine_json.sh}}) currently looks like 
> it will only work properly on macOS. Ideally it should work more or less 
> correctly on all of them macOS, Linux and Windows (some values may remain 
> undetected on some platforms).
> This probably entails:
> - switching to Python rather than bash
> - using something like [psutil|https://psutil.readthedocs.io/en/latest/] to 
> grab useful machine information
> - calling {{nvidia-smi}} to query GPU characteristics (for example 
> "nvidia-smi -q -i 0 -x")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-02-07 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: benchmark-data-model.png

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, 
> benchmark-data-model.erdplus, benchmark-data-model.png, 
> benchmark-data-model.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-02-07 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: (was: benchmark-data-model.png)

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-02-07 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763145#comment-16763145
 ] 

Tanya Schlusser commented on ARROW-4313:


Thank you Antoine! I missed this last comment. "actual frequency" is a good 
name, and I used it.
 * I did not understand the conversations about little-and-big-endian, and did 
not add fields related to that to the database.
 * I was surprised during testing about the behavior of nulls in the database, 
so some things don't yet work the way I'd like (the example script fails in one 
place.)

Thank you everyone for so much feedback. I have uploaded new files for the 
current data model and am happy to change things according to feedback. If you 
don't like something, it can be fixed :) 

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-02-07 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: (was: benchmark-data-model.erdplus)

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-02-07 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: benchmark-data-model.erdplus

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, 
> benchmark-data-model.erdplus, benchmark-data-model.png, 
> benchmark-data-model.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README

2019-01-30 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4425:
---
Description: 
It would be nice to link to the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 -Confluence page-(*EDIT) in the Sphinx docs directly from the main project 
[README|https://github.com/apache/arrow/blob/master/README.md] (in the already 
existing "Getting involved" section) because it's a bit hard to find right now.

"contributing" page: 
[https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]

main project README: [https://github.com/apache/arrow/blob/master/README.md] 

 

EDIT: Moving the "Contributing" wiki to a page in the actual [Arrow Sphinx docs 
(location in repo)|https://github.com/apache/arrow/tree/master/docs] would also 
make it easier to find and modify. An additional task, ARROW-4427  was added to 
do this.

  was:
It would be nice to link to the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 -Confluence page-(*EDIT) in the static docs directly from the main project 
[README|https://github.com/apache/arrow/blob/master/README.md] (in the already 
existing "Getting involved" section) because it's a bit hard to find right now.

"contributing" page: 
[https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]

main project README: [https://github.com/apache/arrow/blob/master/README.md] 

 

EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site 
(location in repo)|https://github.com/apache/arrow/tree/master/site] would also 
make it easier to find and modify. An additional task, ARROW-4427  was added to 
do this.


> Add link to 'Contributing' page in the top-level Arrow README
> -
>
> Key: ARROW-4425
> URL: https://issues.apache.org/jira/browse/ARROW-4425
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It would be nice to link to the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  -Confluence page-(*EDIT) in the Sphinx docs directly from the main project 
> [README|https://github.com/apache/arrow/blob/master/README.md] (in the 
> already existing "Getting involved" section) because it's a bit hard to find 
> right now.
> "contributing" page: 
> [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
> main project README: [https://github.com/apache/arrow/blob/master/README.md] 
>  
> EDIT: Moving the "Contributing" wiki to a page in the actual [Arrow Sphinx 
> docs (location in repo)|https://github.com/apache/arrow/tree/master/docs] 
> would also make it easier to find and modify. An additional task, ARROW-4427  
> was added to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4429) Add git rebase tips to the 'Contributing' page in the developer docs

2019-01-30 Thread Tanya Schlusser (JIRA)
Tanya Schlusser created ARROW-4429:
--

 Summary: Add git rebase tips to the 'Contributing' page in the 
developer docs
 Key: ARROW-4429
 URL: https://issues.apache.org/jira/browse/ARROW-4429
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Tanya Schlusser


A recent discussion on the listserv (link below) asked about how contributors 
should handle rebasing. It would be helpful if the tips made it into the 
developer documentation somehow. I suggest in the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 page—currently a wiki, but hopefully eventually part of the Sphinx docs 
ARROW-4427.

Here is the relevant thread:

[https://lists.apache.org/thread.html/c74d8027184550b8d9041e3f2414b517ffb76ccbc1d5aa4563d364b6@%3Cdev.arrow.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4427) Move Confluence Wiki pages to the Sphinx docs

2019-01-30 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4427:
---
Summary: Move Confluence Wiki pages to the Sphinx docs  (was: Move 
"Contributing to Apache Arrow" page to the static docs)

> Move Confluence Wiki pages to the Sphinx docs
> -
>
> Key: ARROW-4427
> URL: https://issues.apache.org/jira/browse/ARROW-4427
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It's hard to find and modify the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  and other developers' wiki pages in Confluence. If these were moved to 
> inside the project web page, that would make it easier.
> There are 5 steps to this:
>  # Create a new directory inside of `arrow/docs/source` to house the wiki 
> pages. (It will look like the 
> [cpp|https://github.com/apache/arrow/tree/master/docs/source/cpp] or 
> [python|https://github.com/apache/arrow/tree/master/docs/source/python] 
> directories.)
>  # Copy the wiki page contents to new `*.rst` pages inside this new directory.
>  # Add an `index.rst` that links to them all with enough description to help 
> navigation.
>  # Modify the Sphinx index page 
> [`arrow/docs/source/index.rst`|https://github.com/apache/arrow/blob/master/docs/source/index.rst]
>  to have an entry that points to the new index page made in step 3
>  # Modify the static site page 
> [`arrow/site/_includes/header.html`|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
>  to point to the newly created page instead of the wiki page.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4427) Move Confluence Wiki pages to the Sphinx docs

2019-01-30 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756190#comment-16756190
 ] 

Tanya Schlusser edited comment on ARROW-4427 at 1/30/19 3:06 PM:
-

Hoo boy. A big task! Modified the description + title per discussion above.


was (Author: tanya):
Hoo boy. And all of their child wiki pages.

> Move Confluence Wiki pages to the Sphinx docs
> -
>
> Key: ARROW-4427
> URL: https://issues.apache.org/jira/browse/ARROW-4427
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It's hard to find and modify the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  and other developers' wiki pages in Confluence. If these were moved to 
> inside the project web page, that would make it easier.
> There are 5 steps to this:
>  # Create a new directory inside of `arrow/docs/source` to house the wiki 
> pages. (It will look like the 
> [cpp|https://github.com/apache/arrow/tree/master/docs/source/cpp] or 
> [python|https://github.com/apache/arrow/tree/master/docs/source/python] 
> directories.)
>  # Copy the wiki page contents to new `*.rst` pages inside this new directory.
>  # Add an `index.rst` that links to them all with enough description to help 
> navigation.
>  # Modify the Sphinx index page 
> [`arrow/docs/source/index.rst`|https://github.com/apache/arrow/blob/master/docs/source/index.rst]
>  to have an entry that points to the new index page made in step 3
>  # Modify the static site page 
> [`arrow/site/_includes/header.html`|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
>  to point to the newly created page instead of the wiki page.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4427) Move "Contributing to Apache Arrow" page to the static docs

2019-01-30 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4427:
---
Description: 
It's hard to find and modify the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 and other developers' wiki pages in Confluence. If these were moved to inside 
the project web page, that would make it easier.

There are 5 steps to this:
 # Create a new directory inside of `arrow/docs/source` to house the wiki 
pages. (It will look like the 
[cpp|https://github.com/apache/arrow/tree/master/docs/source/cpp] or 
[python|https://github.com/apache/arrow/tree/master/docs/source/python] 
directories.)
 # Copy the wiki page contents to new `*.rst` pages inside this new directory.
 # Add an `index.rst` that links to them all with enough description to help 
navigation.
 # Modify the Sphinx index page 
[`arrow/docs/source/index.rst`|https://github.com/apache/arrow/blob/master/docs/source/index.rst]
 to have an entry that points to the new index page made in step 3
 # Modify the static site page 
[`arrow/site/_includes/header.html`|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
 to point to the newly created page instead of the wiki page.

 

  was:
It's hard to find and modify the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 wiki page in Confluence. If it were moved to inside the static web page, that 
would make it easier.

There are two steps to this:
 # Copy the wiki page contents to a new web page at the top "site" level (under 
arrow/site/ just like the [committers 
page|https://github.com/apache/arrow/blob/master/site/committers.html]) Maybe 
named "contributing.html" or something.
 # Modify the [navigation section in 
arrow/site/_includes/header.html|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
 to point to the newly created page instead of the wiki page.

The affected pages are all part of the Jekyll components, so there isn't a need 
to build the Sphinx part of the docs to check your work.
  


> Move "Contributing to Apache Arrow" page to the static docs
> ---
>
> Key: ARROW-4427
> URL: https://issues.apache.org/jira/browse/ARROW-4427
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It's hard to find and modify the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  and other developers' wiki pages in Confluence. If these were moved to 
> inside the project web page, that would make it easier.
> There are 5 steps to this:
>  # Create a new directory inside of `arrow/docs/source` to house the wiki 
> pages. (It will look like the 
> [cpp|https://github.com/apache/arrow/tree/master/docs/source/cpp] or 
> [python|https://github.com/apache/arrow/tree/master/docs/source/python] 
> directories.)
>  # Copy the wiki page contents to new `*.rst` pages inside this new directory.
>  # Add an `index.rst` that links to them all with enough description to help 
> navigation.
>  # Modify the Sphinx index page 
> [`arrow/docs/source/index.rst`|https://github.com/apache/arrow/blob/master/docs/source/index.rst]
>  to have an entry that points to the new index page made in step 3
>  # Modify the static site page 
> [`arrow/site/_includes/header.html`|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
>  to point to the newly created page instead of the wiki page.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4427) Move "Contributing to Apache Arrow" page to the static docs

2019-01-30 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756190#comment-16756190
 ] 

Tanya Schlusser commented on ARROW-4427:


Hoo boy. And all of their child wiki pages.

> Move "Contributing to Apache Arrow" page to the static docs
> ---
>
> Key: ARROW-4427
> URL: https://issues.apache.org/jira/browse/ARROW-4427
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It's hard to find and modify the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  wiki page in Confluence. If it were moved to inside the static web page, 
> that would make it easier.
> There are two steps to this:
>  # Copy the wiki page contents to a new web page at the top "site" level 
> (under arrow/site/ just like the [committers 
> page|https://github.com/apache/arrow/blob/master/site/committers.html]) Maybe 
> named "contributing.html" or something.
>  # Modify the [navigation section in 
> arrow/site/_includes/header.html|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
>  to point to the newly created page instead of the wiki page.
> The affected pages are all part of the Jekyll components, so there isn't a 
> need to build the Sphinx part of the docs to check your work.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4427) Move "Contributing to Apache Arrow" page to the static docs

2019-01-30 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756188#comment-16756188
 ] 

Tanya Schlusser commented on ARROW-4427:


Ok. Am I understanding that maybe a number of the wiki pages should be 
moved—anything not directly related to Jira? So:
 * [Contributing to Apache 
Arrow|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow?src=contextnavpagetreemode]
 * [Guide for Committers and Project 
Maintainers|https://cwiki.apache.org/confluence/display/ARROW/Guide+for+Committers+and+Project+Maintainers]
 * [HDFS Filesystem 
Support|https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support]
 * [How to Verify Release 
Candidates|https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates]
 * [Product 
Requirements|https://cwiki.apache.org/confluence/display/ARROW/Product+requirements]
 (possibly not this one as it's empty)
 * [Release Management 
Guide|https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide]

What do you think of another directory, then, in `arrow/docs/source` where all 
of the listed pages reside...say `arrow/docs/source/dev` or something?

> Move "Contributing to Apache Arrow" page to the static docs
> ---
>
> Key: ARROW-4427
> URL: https://issues.apache.org/jira/browse/ARROW-4427
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It's hard to find and modify the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  wiki page in Confluence. If it were moved to inside the static web page, 
> that would make it easier.
> There are two steps to this:
>  # Copy the wiki page contents to a new web page at the top "site" level 
> (under arrow/site/ just like the [committers 
> page|https://github.com/apache/arrow/blob/master/site/committers.html]) Maybe 
> named "contributing.html" or something.
>  # Modify the [navigation section in 
> arrow/site/_includes/header.html|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
>  to point to the newly created page instead of the wiki page.
> The affected pages are all part of the Jekyll components, so there isn't a 
> need to build the Sphinx part of the docs to check your work.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README

2019-01-30 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4425:
---
Description: 
It would be nice to link to the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 -Confluence- page in the Sphinx docs directly from the main project 
[README|https://github.com/apache/arrow/blob/master/README.md] (in the already 
existing "Getting involved" section) because it's a bit hard to find right now.

"contributing" page: 
[https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]

main project README: [https://github.com/apache/arrow/blob/master/README.md] 

 

EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site 
(location in repo)|https://github.com/apache/arrow/tree/master/site] would also 
make it easier to find and modify. An additional task  was added to do this.

  was:
It would be nice to link to the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 -Confluence- page in the Sphinx docs directly from the main project 
[README|https://github.com/apache/arrow/blob/master/README.md] (in the already 
existing "Getting involved" section) because it's a bit hard to find right now.

"contributing" page: 
[https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]

main project README: [https://github.com/apache/arrow/blob/master/README.md] 

 

EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site 
(location in repo)|https://github.com/apache/arrow/tree/master/site] would also 
make it easier to find and modify. A sub-task was added to do this.


> Add link to 'Contributing' page in the top-level Arrow README
> -
>
> Key: ARROW-4425
> URL: https://issues.apache.org/jira/browse/ARROW-4425
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It would be nice to link to the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  -Confluence- page in the Sphinx docs directly from the main project 
> [README|https://github.com/apache/arrow/blob/master/README.md] (in the 
> already existing "Getting involved" section) because it's a bit hard to find 
> right now.
> "contributing" page: 
> [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
> main project README: [https://github.com/apache/arrow/blob/master/README.md] 
>  
> EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow 
> site (location in repo)|https://github.com/apache/arrow/tree/master/site] 
> would also make it easier to find and modify. An additional task  was added 
> to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README

2019-01-30 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4425:
---
Description: 
It would be nice to link to the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 -Confluence page-(*EDIT) in the static docs directly from the main project 
[README|https://github.com/apache/arrow/blob/master/README.md] (in the already 
existing "Getting involved" section) because it's a bit hard to find right now.

"contributing" page: 
[https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]

main project README: [https://github.com/apache/arrow/blob/master/README.md] 

 

EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site 
(location in repo)|https://github.com/apache/arrow/tree/master/site] would also 
make it easier to find and modify. An additional task, ARROW-4427  was added to 
do this.

  was:
It would be nice to link to the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 -Confluence- page in the Sphinx docs directly from the main project 
[README|https://github.com/apache/arrow/blob/master/README.md] (in the already 
existing "Getting involved" section) because it's a bit hard to find right now.

"contributing" page: 
[https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]

main project README: [https://github.com/apache/arrow/blob/master/README.md] 

 

EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site 
(location in repo)|https://github.com/apache/arrow/tree/master/site] would also 
make it easier to find and modify. An additional task  was added to do this.


> Add link to 'Contributing' page in the top-level Arrow README
> -
>
> Key: ARROW-4425
> URL: https://issues.apache.org/jira/browse/ARROW-4425
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It would be nice to link to the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  -Confluence page-(*EDIT) in the static docs directly from the main project 
> [README|https://github.com/apache/arrow/blob/master/README.md] (in the 
> already existing "Getting involved" section) because it's a bit hard to find 
> right now.
> "contributing" page: 
> [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
> main project README: [https://github.com/apache/arrow/blob/master/README.md] 
>  
> EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow 
> site (location in repo)|https://github.com/apache/arrow/tree/master/site] 
> would also make it easier to find and modify. An additional task, ARROW-4427  
> was added to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4427) Move "Contributing to Apache Arrow" page to the static docs

2019-01-30 Thread Tanya Schlusser (JIRA)
Tanya Schlusser created ARROW-4427:
--

 Summary: Move "Contributing to Apache Arrow" page to the static 
docs
 Key: ARROW-4427
 URL: https://issues.apache.org/jira/browse/ARROW-4427
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Tanya Schlusser


It's hard to find and modify the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 wiki page in Confluence. If it were moved to inside the static web page, that 
would make it easier.

There are two steps to this:
 # Copy the wiki page contents to a new web page at the top "site" level (under 
arrow/site/ just like the [committers 
page|https://github.com/apache/arrow/blob/master/site/committers.html]) Maybe 
named "contributing.html" or something.
 # Modify the [navigation section in 
arrow/site/_includes/header.html|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
 to point to the newly created page instead of the wiki page.

The affected pages are all part of the Jekyll components, so there isn't a need 
to build the Sphinx part of the docs to check your work.
  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README

2019-01-30 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4425:
---
Description: 
It would be nice to link to the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 -Confluence- page in the Sphinx docs directly from the main project 
[README|https://github.com/apache/arrow/blob/master/README.md] (in the already 
existing "Getting involved" section) because it's a bit hard to find right now.

"contributing" page: 
[https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]

main project README: [https://github.com/apache/arrow/blob/master/README.md] 

 

EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow site 
(location in repo)|https://github.com/apache/arrow/tree/master/site] would also 
make it easier to find and modify. A sub-task was added to do this.

  was:
It would be nice to add a link to the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 Confluence page directly from the main project 
[README|https://github.com/apache/arrow/blob/master/README.md] (in the already 
existing "Getting involved" section) because it's a bit hard to find right now.

"contributing" page: 
[https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]

main project README: [https://github.com/apache/arrow/blob/master/README.md] 


> Add link to 'Contributing' page in the top-level Arrow README
> -
>
> Key: ARROW-4425
> URL: https://issues.apache.org/jira/browse/ARROW-4425
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It would be nice to link to the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  -Confluence- page in the Sphinx docs directly from the main project 
> [README|https://github.com/apache/arrow/blob/master/README.md] (in the 
> already existing "Getting involved" section) because it's a bit hard to find 
> right now.
> "contributing" page: 
> [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
> main project README: [https://github.com/apache/arrow/blob/master/README.md] 
>  
> EDIT: Moving the "Contributing" wiki to a static page in the actual [Arrow 
> site (location in repo)|https://github.com/apache/arrow/tree/master/site] 
> would also make it easier to find and modify. A sub-task was added to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README

2019-01-30 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756158#comment-16756158
 ] 

Tanya Schlusser commented on ARROW-4425:


Fair statement. Confluence is really hard for me to navigate. Updating and 
adding a sub-task.

> Add link to 'Contributing' page in the top-level Arrow README
> -
>
> Key: ARROW-4425
> URL: https://issues.apache.org/jira/browse/ARROW-4425
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It would be nice to add a link to the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  Confluence page directly from the main project 
> [README|https://github.com/apache/arrow/blob/master/README.md] (in the 
> already existing "Getting involved" section) because it's a bit hard to find 
> right now.
> "contributing" page: 
> [https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
> main project README: [https://github.com/apache/arrow/blob/master/README.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README

2019-01-30 Thread Tanya Schlusser (JIRA)
Tanya Schlusser created ARROW-4425:
--

 Summary: Add link to 'Contributing' page in the top-level Arrow 
README
 Key: ARROW-4425
 URL: https://issues.apache.org/jira/browse/ARROW-4425
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Tanya Schlusser


It would be nice to add a link to the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 Confluence page directly from the main project 
[README|https://github.com/apache/arrow/blob/master/README.md] (in the already 
existing "Getting involved" section) because it's a bit hard to find right now.

"contributing" page: 
[https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]

main project README: [https://github.com/apache/arrow/blob/master/README.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755570#comment-16755570
 ] 

Tanya Schlusser commented on ARROW-4313:


I think part of this was to allow anybody to contribute benchmarks from their 
own machine. And while dedicated benchmarking machines like the ones you will 
set up will have all parameters set for optimal benchmarking, benchmarks run on 
other machines may give different results. Collecting details about the machine 
that might explain those differences (in case someone cares to explore the 
dataset) is part of the goal of the data model.

One concern, of course, is that people get wildly different results than a 
benchmark says, and may say "Oh boo–the representative person from the company 
made fake results that I can't replicate on my machine" ... and with details 
about a system, performance differences can maybe be traced back to differences 
in setup, because they were recorded.

Not all fields need to be filled out all the time. My priorities are:
 # Identifying which fields flat-out wrong
 # Differentiating between necessary columns and extraneous ones that can be 
left null


To me, it is not a big deal to have an extra column dangling around that almost 
nobody uses. No harm. (Unless it's mislabeled or otherwise wrong; that's what 
I'm interested in getting out of the discussion here.)

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755509#comment-16755509
 ] 

Tanya Schlusser commented on ARROW-4313:


[~aregm] I do not know. I am depending on the other people commenting here to 
make sure the hardware tables make sense because honestly I don't ever pay 
attention to hardware because my use cases never stress my system. At one point 
Wes suggested it. I am glad there is a debate.

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: benchmark-data-model.erdplus

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: (was: benchmark-data-model.png)

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755504#comment-16755504
 ] 

Tanya Schlusser commented on ARROW-4313:


Thank you very much for everyone's detailed feedback. I absolutely need 
guidance with the Machine / CPU / GPU specs. I have updated the 
[^benchmark-data-model.png] and the [^benchmark-data-model.erdplus], and added 
all of the recommended columns.

 

*Summary of changes:*
 * All the dimension tables have been renamed to exclude the `_dim`. (It was to 
distinguish dimension vs. fact tables.)

 * `cpu`
 ** Added a `cpu_thread_count`. 
 ** Changed `cpu.speed_Hz` to two columns: `frequency_max_Hz` and 
`frequency_min_Hz` and also added a column `machine.overclock_frequency_Hz` to 
the `machine` table to allow for overclocking like Wes mentioned in the 
beginning.

 * `os`
 ** Added both `os.architecture_name` and `os.architecture_bits`, the latter 
forced to be in \{32, 64}, and pulled from the architecture name (maybe it will 
become just a computed column in the joined view...). I think it's a good idea.

 * `project`
 ** Added a `project.project_name` (oversight before)

 * `benchmark_language`
 ** Split out `language` to `language_name` and `language_version` because 
maybe people will want to compare between them (e.g. Python 2.7, 3.5+)

 * `environment`
 ** Removed foreign key for `machine_id` — that should be in the benchmark 
report separately. Many machines will have the same environment.

 * `benchmark`
 ** Added foreign key for `benchmark_language_id`—a benchmark with the same 
name may exist for different languages.
 ** Added foreign key for `project_id`—moved it from table `benchmark_result`

 * `benchmark_result`
 ** Added foreign key for `machine_id` (was removed from `environment`)
 ** Deleted foreign key for `project_id`, placing it in `benchmark` (as stated 
above)

*Questions*
 * `cpu` and `gpu` dimension
 ** Is it a mistake to make `cpu.cpu_model_name` unique? I mean, are the LX 
cache levels, core counts, or any other attribute ever different for the same 
CPU model string?
 ** The same for GPU.
 ** I have commented the columns to say that  `cpu_thread_count` corresponds to 
`sysctl -n hw.logicalcpu` and `cpu_core_count` corresponds to `sysctl -n 
hw.physicalcpu`; corrections gratefully accepted.
 ** Would it be less confusing to make the column names the exact same strings 
as correspond to their value from `sysctl`, e.g. change `cpu.cpu_model_name` to 
`cpu.cpu_brand_string` to correspond to the output of `sysctl -n 
machdep.cpu.brand_string`?
 ** On that note is CPU RAM the same thing as `sysctl -n 
machdep.cpu.cache.size`?
 * `environment`
 ** I'm worried I'm doing something inelegant with the dependency list. It will 
hold everything – Conda / virtualenv; versions of Numpy; all permutations of 
the various dependencies in what in ASV is the dependency matrix.

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: (was: benchmark-data-model.erdplus)

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: benchmark-data-model.png

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4354) Explore Codespeed feasibility and ease of customization

2019-01-27 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753511#comment-16753511
 ] 

Tanya Schlusser commented on ARROW-4354:


I have attached a drawing of the codespeed data model, 
[^codespeed-data-model.png].
 * Codespeed provides a data model and a web UI.
 * ASV provides a benchmark framework, a file-directory-based data model, and a 
static frontend.

Both have a lot in common, and both would require revision to work with the 
additional machine specifications (GPU; CPU cache sizes) and multiple benchmark 
languages we are interested in.

There is no benefit to using ASV from a web service perspective because with a 
database we will need either an API or a web interface, or both—either would 
require some front-end work; I'm ambivalent. Anyway we can think about that 
later.

Once we have decided on a data model and spun up the database, it may be nice 
to enable interaction with the database via HTTP, I am interested in exploring 
[postgrapgile|https://www.graphile.org/postgraphile/] for an API; it literally 
parses the Postgres public schema and presents it as a GraphQL interface with 
no additional work, and has an existing build ready for [AWS 
lambda|https://github.com/graphile/postgraphile-lambda-example]. Then, all the 
data manipulation could be directly in the database. There is a {{--cors}} 
command-line option to enable use on a static site if we go that route. 
Codespeed also provides (I think) a JSON REST API so, again, the only question 
is whether we care to have a separate repo for a dynamic webpage.

> Explore Codespeed feasibility and ease of customization
> ---
>
> Key: ARROW-4354
> URL: https://issues.apache.org/jira/browse/ARROW-4354
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Developer Tools
>Reporter: Areg Melik-Adamyan
>Assignee: Tanya Schlusser
>Priority: Major
>  Labels: performance
> Attachments: codespeed-data-model.png
>
>
> @Tanya Schlusser can you please explore this option and report out?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4354) Explore Codespeed feasibility and ease of customization

2019-01-27 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4354:
---
Attachment: codespeed-data-model.png

> Explore Codespeed feasibility and ease of customization
> ---
>
> Key: ARROW-4354
> URL: https://issues.apache.org/jira/browse/ARROW-4354
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Developer Tools
>Reporter: Areg Melik-Adamyan
>Assignee: Tanya Schlusser
>Priority: Major
>  Labels: performance
> Attachments: codespeed-data-model.png
>
>
> @Tanya Schlusser can you please explore this option and report out?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-26 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753301#comment-16753301
 ] 

Tanya Schlusser commented on ARROW-4313:


I've attached a diagram 
[benchmark-data-model.png|https://issues.apache.org/jira/secure/attachment/12956481/benchmark-data-model.png]
 and a corresponding {{.erdplus}} file 
[benchmark-data-model.erdplus|https://issues.apache.org/jira/secure/attachment/12956482/benchmark-data-model.erdplus]
 (JSON--viewable and editable by getting a free account on 
[erdplus.com|https://erdplus.com/#/]) with a draft data model for everyone's 
consideration.  I tried to incorporate elements of both the codespeed and the 
ASV projects.

Happy to modify per feedback—or leave this to a more experienced person if I'm 
becoming the slow link.
Of course there will be a view with all of the relevant information joined.

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-01-26 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: benchmark-data-model.png

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-01-26 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: benchmark-data-model.erdplus

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4354) Explore Codespeed feasibility and ease of customization

2019-01-25 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752464#comment-16752464
 ] 

Tanya Schlusser edited comment on ARROW-4354 at 1/25/19 5:15 PM:
-

Thank you for the clarifications, Wes and Areg :)!


was (Author: tanya):
Thank you for the clarifications, Wes and Arek :)!

> Explore Codespeed feasibility and ease of customization
> ---
>
> Key: ARROW-4354
> URL: https://issues.apache.org/jira/browse/ARROW-4354
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Developer Tools
>Reporter: Areg Melik-Adamyan
>Assignee: Tanya Schlusser
>Priority: Major
>  Labels: performance
>
> @Tanya Schlusser can you please explore this option and report out?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4354) Explore Codespeed feasibility and ease of customization

2019-01-25 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752464#comment-16752464
 ] 

Tanya Schlusser commented on ARROW-4354:


Thank you for the clarifications, Wes and Arek :)!

> Explore Codespeed feasibility and ease of customization
> ---
>
> Key: ARROW-4354
> URL: https://issues.apache.org/jira/browse/ARROW-4354
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Developer Tools
>Reporter: Areg Melik-Adamyan
>Assignee: Tanya Schlusser
>Priority: Major
>  Labels: performance
>
> @Tanya Schlusser can you please explore this option and report out?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4354) Explore Codespeed feasibility and ease of customization

2019-01-24 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751203#comment-16751203
 ] 

Tanya Schlusser commented on ARROW-4354:


Codespeed is nice too. [Link to codespeed 
repo|https://github.com/tobami/codespeed].

However it looks like codespeed is licensed under LGPL v 2.1 — I believe LGPL 3 
is the first one that is compatible with Apache; which only means their 
codebase can't be in the Arrow codebase...maybe not a big deal. I agree the 
backend is nice and simple is a good thing.

> Explore Codespeed feasibility and ease of customization
> ---
>
> Key: ARROW-4354
> URL: https://issues.apache.org/jira/browse/ARROW-4354
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Developer Tools
>Reporter: Areg Melik-Adamyan
>Priority: Major
>  Labels: performance
>
> @Tanya Schlusser can you please explore this option and report out?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4354) Explore Codespeed feasibility and ease of customization

2019-01-24 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751189#comment-16751189
 ] 

Tanya Schlusser commented on ARROW-4354:


Nice! One of the contributors on this project, Antoine Pitrou (didn't `at` him 
but he has commented on the ARROW-4313), has contributed to [Airspeed Velocity 
(ASV)|https://github.com/airspeed-velocity/asv] which I have been looking at 
too, and which informed his initial comments on [the mailing 
list|https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E].
 Here are a benchmarks for a [bunch of pandas-related 
projects|https://pandas.pydata.org/speed/] using Airspeed Velocity.

Maybe we can take the best of both worlds, and use the database schema from 
Codespeed and the mostly static components of ASV. I am very impressed with the 
functionality of Airspeed Velocity.

> Explore Codespeed feasibility and ease of customization
> ---
>
> Key: ARROW-4354
> URL: https://issues.apache.org/jira/browse/ARROW-4354
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Developer Tools
>Reporter: Areg Melik-Adamyan
>Priority: Major
>  Labels: performance
>
> @Tanya Schlusser can you please explore this option and report out?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-21 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748260#comment-16748260
 ] 

Tanya Schlusser commented on ARROW-4313:


Pinging [~aregm] who started the email discussion, and volunteering to help in 
what ways I can 👋. I said I'd mock a backend and will edit this comment with a 
hyperlink when a mock is up.

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3324) [Python] Users reporting memory leaks using pa.pq.ParquetDataset

2018-12-26 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729131#comment-16729131
 ] 

Tanya Schlusser commented on ARROW-3324:


The file 
[arrow_3324_leak_on_write.py|https://issues.apache.org/jira/secure/attachment/12953078/arrow_3324_leak_on_write.py]
 contains a modified version of the stackoverflow post in Wes's comment above, 
with {{memory_profiler}} to show the memory use. The memory use does increase 
as the code cycles through multiple calls to {{write_table}}.

> [Python] Users reporting memory leaks using pa.pq.ParquetDataset
> 
>
> Key: ARROW-3324
> URL: https://issues.apache.org/jira/browse/ARROW-3324
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.12.0
>
> Attachments: arrow_3324_leak_on_write.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See:
> * https://github.com/apache/arrow/issues/2614
> * https://github.com/apache/arrow/issues/2624



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3324) [Python] Users reporting memory leaks using pa.pq.ParquetDataset

2018-12-26 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-3324:
---
Attachment: arrow_3324_leak_on_write.py

> [Python] Users reporting memory leaks using pa.pq.ParquetDataset
> 
>
> Key: ARROW-3324
> URL: https://issues.apache.org/jira/browse/ARROW-3324
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.12.0
>
> Attachments: arrow_3324_leak_on_write.py
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> See:
> * https://github.com/apache/arrow/issues/2614
> * https://github.com/apache/arrow/issues/2624



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3324) [Python] Users reporting memory leaks using pa.pq.ParquetDataset

2018-12-25 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728819#comment-16728819
 ] 

Tanya Schlusser commented on ARROW-3324:


I could not reproduce either of the two GitHub issues above, but could identify 
a leak using {{memory_profiler}} on the stackoverflow code (copied 
[this|https://github.com/apache/arrow/blob/master/python/scripts/test_leak.py])

I observed that {{FileSerializer.properties_.use_count()}} increments more than 
expected whenever {{FileSerializer.AppendRowGroup}} is called. The offending 
line is {{FileSerializer.metadata_->AppendRowGroup()}}. I believe that the 
count should only go up once per new row group, instead of once per column plus 
once per row group.

I think the root cause is that in 
{{RowGroupMetaDataBuilder::RowGroupMetaDataBuilderImpl.Finish}}, the vector of 
{{column_builders_}} ought to be reset and cleared each time before it is 
repopulated. I hope to submit a pull request for this even though it may not 
address all of the issues stated here. Since the GitHub issues were about 
memory leaks on "read", and the fix is related only to "write", this 
observation certainly doesn't address everything in this JIRA issue.

Even after the fix I'll post, my memory_profiler code still shows an increase 
in memory use upon additional calls to {{pq.ParquetWriter.write_table}}, which 
I think is OK because the row group is incrementing with each write too. So I 
may be wrong or have still missed something. Regardless, I hope these notes are 
useful to someone.

> [Python] Users reporting memory leaks using pa.pq.ParquetDataset
> 
>
> Key: ARROW-3324
> URL: https://issues.apache.org/jira/browse/ARROW-3324
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> See:
> * https://github.com/apache/arrow/issues/2614
> * https://github.com/apache/arrow/issues/2624



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4050) core dump on reading parquet file

2018-12-22 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16727509#comment-16727509
 ] 

Tanya Schlusser commented on ARROW-4050:


Hi [~cav71], maybe I can be useful.

You're right that arrow (cpp) builds a library called {{libarrow_python}} which 
exposes the parts of arrow that the Python library will use. That is the first 
step, run with {{cmake}} inside the directory {{arrow/cpp/build}}.

But to make the Python library there must also be a second step, run inside 
{{arrow/python}}:
The pyarrow library uses Cython (I am learning Cython -- this [rectangle 
example|https://cython.readthedocs.io/en/latest/src/userguide/wrapping_CPlusPlus.html]
 was helpful) wraps all of these exposed objects in Python for the end user.

h6. details / example:
The 
[pyarrow.__init__.py|https://github.com/apache/arrow/blob/master/python/pyarrow/__init__.py]
 imports a ton of stuff from {{pyarrow.lib}}. But there is no 
{{pyarrow/lib.py}} file in the source code. Instead, there are
* {{pyarrow/lib.pxd}} (corresponds to a c++ header file)
*  {{pyarrow/lib.pyx}} (corresponds to a c++ source file)

which must be compiled using Cython. The {{setup.py build_ext --inplace}} uses 
Cython to
# auto- generate C++ code ({{pyarrow/lib.cpp}}, {{pyarrow/lib_api.h}})
# compile it to a shared object  (on my laptop, 
{{pyarrow/lib.cpython-36m-darwin.so}})

That shared object is the {{pyarrow.lib}} imported in {{pyarrow/__init__.py}}.
I hope it is useful!

P.S. The [script linked 
above|https://issues.apache.org/jira/secure/attachment/12952061/working_python37_build_on_osx.sh]
 successfully built the code on my laptop

> core dump on reading parquet file
> -
>
> Key: ARROW-4050
> URL: https://issues.apache.org/jira/browse/ARROW-4050
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antonio Cavallo
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: bug.parquet, working_python37_build_on_osx.sh
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hi,
> I've a crash when doing this:
> {{import pyarrow.parquet as pq}}
> {{pq.read_table('bug.parquet')}}
> [^bug.parquet]
> (this is the same generated by 
> arrow/python/pyarrow/tests/test_parquet.py(112)test_single_pylist_column_roundtrip())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3020) [Python] Addition of option to allow empty Parquet row groups

2018-12-20 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726349#comment-16726349
 ] 

Tanya Schlusser commented on ARROW-3020:


I looked into this and do not believe the Parquet code permits this at the 
moment despite the comment in the OP's hyperlink saying they thought it did. 
pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and 
here's the current code (also [linked 
here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]).
 

{code:title=from "parquet/arrow/writer.h"}
Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) {
  if (chunk_size <= 0) {
return Status::Invalid("chunk size per row_group must be greater than 0");
  } else if (chunk_size > impl_->properties().max_row_group_length()) {
chunk_size = impl_->properties().max_row_group_length();
  }


  for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) {
int64_t offset = chunk * chunk_size;
int64_t size = std::min(chunk_size, table.num_rows() - offset);


RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close()));
for (int i = 0; i < table.num_columns(); i++) {
  auto chunked_data = table.column(i)->data();
  RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size),
 PARQUET_IGNORE_NOT_OK(Close()));
}
  }
  return Status::OK();
}
{code}

> [Python] Addition of option to allow empty Parquet row groups
> -
>
> Key: ARROW-3020
> URL: https://issues.apache.org/jira/browse/ARROW-3020
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Alex Mendelson
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> While our use case is not common, I was able to find one related request from 
> roughly a year ago. Could this be added as a feature?
> https://issues.apache.org/jira/browse/PARQUET-1047
> *Motivation*
> We have an application where each row is associated with one of N contexts, 
> though a minority of contexts may have no associated rows. When encountering 
> the Nth context, we will wish to retrieve all the associated rows. Row groups 
> would provide a natural way to index the data, as the nth context could 
> naturally relate to the nth row group.
> Unfortunately, this is not possible at the present time, as pyarrow does not 
> support writing empty row groups. If one writes a pyarrow.Table containing 
> zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final 
> file, and this distorts the indexing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3020) [Python] Addition of option to allow empty Parquet row groups

2018-12-20 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726349#comment-16726349
 ] 

Tanya Schlusser edited comment on ARROW-3020 at 12/21/18 12:34 AM:
---

I looked into this and do not believe the Parquet code permits this at the 
moment despite the comment in the OP's hyperlink saying they thought it did. 
pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and 
here's the current code (also [linked 
here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]).
 If {{table.num_rows()}} is zero, nothing will ever happen.

{code:title=from "parquet/arrow/writer.cc"}
Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) {
  if (chunk_size <= 0) {
return Status::Invalid("chunk size per row_group must be greater than 0");
  } else if (chunk_size > impl_->properties().max_row_group_length()) {
chunk_size = impl_->properties().max_row_group_length();
  }


  for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) {
int64_t offset = chunk * chunk_size;
int64_t size = std::min(chunk_size, table.num_rows() - offset);


RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close()));
for (int i = 0; i < table.num_columns(); i++) {
  auto chunked_data = table.column(i)->data();
  RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size),
 PARQUET_IGNORE_NOT_OK(Close()));
}
  }
  return Status::OK();
}
{code}


was (Author: tanya):
I looked into this and do not believe the Parquet code permits this at the 
moment despite the comment in the OP's hyperlink saying they thought it did. 
pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and 
here's the current code (also [linked 
here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]).
 If {{table.num_rows()}} is zero nothing will ever happen.

{code:title=from "parquet/arrow/writer.h"}
Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) {
  if (chunk_size <= 0) {
return Status::Invalid("chunk size per row_group must be greater than 0");
  } else if (chunk_size > impl_->properties().max_row_group_length()) {
chunk_size = impl_->properties().max_row_group_length();
  }


  for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) {
int64_t offset = chunk * chunk_size;
int64_t size = std::min(chunk_size, table.num_rows() - offset);


RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close()));
for (int i = 0; i < table.num_columns(); i++) {
  auto chunked_data = table.column(i)->data();
  RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size),
 PARQUET_IGNORE_NOT_OK(Close()));
}
  }
  return Status::OK();
}
{code}

> [Python] Addition of option to allow empty Parquet row groups
> -
>
> Key: ARROW-3020
> URL: https://issues.apache.org/jira/browse/ARROW-3020
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Alex Mendelson
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> While our use case is not common, I was able to find one related request from 
> roughly a year ago. Could this be added as a feature?
> https://issues.apache.org/jira/browse/PARQUET-1047
> *Motivation*
> We have an application where each row is associated with one of N contexts, 
> though a minority of contexts may have no associated rows. When encountering 
> the Nth context, we will wish to retrieve all the associated rows. Row groups 
> would provide a natural way to index the data, as the nth context could 
> naturally relate to the nth row group.
> Unfortunately, this is not possible at the present time, as pyarrow does not 
> support writing empty row groups. If one writes a pyarrow.Table containing 
> zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final 
> file, and this distorts the indexing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3020) [Python] Addition of option to allow empty Parquet row groups

2018-12-20 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726349#comment-16726349
 ] 

Tanya Schlusser edited comment on ARROW-3020 at 12/21/18 12:32 AM:
---

I looked into this and do not believe the Parquet code permits this at the 
moment despite the comment in the OP's hyperlink saying they thought it did. 
pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and 
here's the current code (also [linked 
here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]).
 If {{table.num_rows()}} is zero nothing will ever happen.

{code:title=from "parquet/arrow/writer.h"}
Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) {
  if (chunk_size <= 0) {
return Status::Invalid("chunk size per row_group must be greater than 0");
  } else if (chunk_size > impl_->properties().max_row_group_length()) {
chunk_size = impl_->properties().max_row_group_length();
  }


  for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) {
int64_t offset = chunk * chunk_size;
int64_t size = std::min(chunk_size, table.num_rows() - offset);


RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close()));
for (int i = 0; i < table.num_columns(); i++) {
  auto chunked_data = table.column(i)->data();
  RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size),
 PARQUET_IGNORE_NOT_OK(Close()));
}
  }
  return Status::OK();
}
{code}


was (Author: tanya):
I looked into this and do not believe the Parquet code permits this at the 
moment despite the comment in the OP's hyperlink saying they thought it did. 
pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and 
here's the current code (also [linked 
here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]).
 

{code:title=from "parquet/arrow/writer.h"}
Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) {
  if (chunk_size <= 0) {
return Status::Invalid("chunk size per row_group must be greater than 0");
  } else if (chunk_size > impl_->properties().max_row_group_length()) {
chunk_size = impl_->properties().max_row_group_length();
  }


  for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) {
int64_t offset = chunk * chunk_size;
int64_t size = std::min(chunk_size, table.num_rows() - offset);


RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close()));
for (int i = 0; i < table.num_columns(); i++) {
  auto chunked_data = table.column(i)->data();
  RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size),
 PARQUET_IGNORE_NOT_OK(Close()));
}
  }
  return Status::OK();
}
{code}

> [Python] Addition of option to allow empty Parquet row groups
> -
>
> Key: ARROW-3020
> URL: https://issues.apache.org/jira/browse/ARROW-3020
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Alex Mendelson
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> While our use case is not common, I was able to find one related request from 
> roughly a year ago. Could this be added as a feature?
> https://issues.apache.org/jira/browse/PARQUET-1047
> *Motivation*
> We have an application where each row is associated with one of N contexts, 
> though a minority of contexts may have no associated rows. When encountering 
> the Nth context, we will wish to retrieve all the associated rows. Row groups 
> would provide a natural way to index the data, as the nth context could 
> naturally relate to the nth row group.
> Unfortunately, this is not possible at the present time, as pyarrow does not 
> support writing empty row groups. If one writes a pyarrow.Table containing 
> zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final 
> file, and this distorts the indexing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4056) [C++] boost-cpp toolchain packages causing crashes on Xcode > 6.4

2018-12-18 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724160#comment-16724160
 ] 

Tanya Schlusser commented on ARROW-4056:


Not sure if it's useful, but adding a link to the Anaconda docs about their 
toolchain

https://conda.io/docs/user-guide/tasks/build-packages/compiler-tools.html#using-the-compiler-packages

> [C++] boost-cpp toolchain packages causing crashes on Xcode > 6.4
> -
>
> Key: ARROW-4056
> URL: https://issues.apache.org/jira/browse/ARROW-4056
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> EDIT: the issue has been present for a large portion of 2018. I found this 
> when merging the macOS C++ builds and changed the build type to Xcode 8.3:
> https://travis-ci.org/wesm/arrow/jobs/469297420#L2856
> I reported the issue into conda-forge at 
> https://github.com/conda-forge/boost-cpp-feedstock/issues/40
> It seems that the Ray project worked around this earlier this year: 
> https://github.com/ray-project/ray/pull/1688



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4050) core dump on reading parquet file

2018-12-17 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723180#comment-16723180
 ] 

Tanya Schlusser edited comment on ARROW-4050 at 12/17/18 5:26 PM:
--

I can confirm building the environment with {{boost-cpp=1.68.0}} works. I also 
had to add {{make}} to successfully build {{jemalloc_ep}}, although without it 
the build and test still worked (the 'install' step right after it worked 
anyway; it was just disconcerting to have a failed step). I uploaded the entire 
sequence of commands that produces a successful python test in 
[{{working_python37_build_on_osx.sh}}|https://issues.apache.org/jira/secure/attachment/12952061/working_python37_build_on_osx.sh].

There are [numpy empty truth test deprecation 
warnings|https://github.com/numpy/numpy/issues/9583] but that's it. Hope it 
helps.


was (Author: tanya):
I can confirm building the environment with {{boost-cpp=1.68.0}} works. I also 
had to add {{make}} to successfully build {{jemalloc_ep}}, although without it 
the build and test still worked (the 'install' step right after it worked 
anyway; it was just disconcerting to have a failed step). I uploaded the entire 
sequence of commands that produces a successful python test in 
{{working_python37_build_on_osx.sh}}.

There are [numpy empty truth test deprecation 
warnings|https://github.com/numpy/numpy/issues/9583] but that's it. Hope it 
helps.

> core dump on reading parquet file
> -
>
> Key: ARROW-4050
> URL: https://issues.apache.org/jira/browse/ARROW-4050
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antonio Cavallo
>Priority: Blocker
> Attachments: bug.parquet, working_python37_build_on_osx.sh
>
>
> Hi,
> I've a crash when doing this:
> {{import pyarrow.parquet as pq}}
> {{pq.read_table('bug.parquet')}}
> [^bug.parquet]
> (this is the same generated by 
> arrow/python/pyarrow/tests/test_parquet.py(112)test_single_pylist_column_roundtrip())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4050) core dump on reading parquet file

2018-12-17 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723180#comment-16723180
 ] 

Tanya Schlusser commented on ARROW-4050:


I can confirm building the environment with {{boost-cpp=1.68.0}} works. I also 
had to add {{make}} to successfully build {{jemalloc_ep}}, although without it 
the build and test still worked (the 'install' step right after it worked 
anyway; it was just disconcerting to have a failed step). I uploaded the entire 
sequence of commands that produces a successful python test in 
{{working_python37_build_on_osx.sh}}.

There are [numpy empty truth test deprecation 
warnings|https://github.com/numpy/numpy/issues/9583] but that's it. Hope it 
helps.

> core dump on reading parquet file
> -
>
> Key: ARROW-4050
> URL: https://issues.apache.org/jira/browse/ARROW-4050
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antonio Cavallo
>Priority: Blocker
> Attachments: bug.parquet, working_python37_build_on_osx.sh
>
>
> Hi,
> I've a crash when doing this:
> {{import pyarrow.parquet as pq}}
> {{pq.read_table('bug.parquet')}}
> [^bug.parquet]
> (this is the same generated by 
> arrow/python/pyarrow/tests/test_parquet.py(112)test_single_pylist_column_roundtrip())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4050) core dump on reading parquet file

2018-12-17 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4050:
---
Attachment: working_python37_build_on_osx.sh

> core dump on reading parquet file
> -
>
> Key: ARROW-4050
> URL: https://issues.apache.org/jira/browse/ARROW-4050
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antonio Cavallo
>Priority: Blocker
> Attachments: bug.parquet, working_python37_build_on_osx.sh
>
>
> Hi,
> I've a crash when doing this:
> {{import pyarrow.parquet as pq}}
> {{pq.read_table('bug.parquet')}}
> [^bug.parquet]
> (this is the same generated by 
> arrow/python/pyarrow/tests/test_parquet.py(112)test_single_pylist_column_roundtrip())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4050) core dump on reading parquet file

2018-12-17 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722929#comment-16722929
 ] 

Tanya Schlusser commented on ARROW-4050:


Hello [~cav71], I may be able to help – I'm new enough that I just went through 
the pain of setting up my environment too, and better my system sounds like 
yours: I have a mac with Xcode 10.1.

I did the things you said: followed the documentation in a new Conda 
environment, and indeed have a segfault in the python parquet tests.

However I can switch to a separate Conda environment and build and test just 
fine. I am currently going through my env to see what is different between the 
two and will report back when I figure out the relevant different thing. I had 
trouble setting up too, but was too shy to speak up about it and clearly the 
documentation could be improved – at least for us mac users!

If you want to try and figure it out too, the thing I did to get stuff working 
was read through the Python dockerfile and the scripts in arrow/dev and 
arrow/ci. The problem is I tried 1000 things and didn't pay attention to what 
worked or I'd answer this with more useful information.

 

> core dump on reading parquet file
> -
>
> Key: ARROW-4050
> URL: https://issues.apache.org/jira/browse/ARROW-4050
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antonio Cavallo
>Priority: Blocker
> Attachments: bug.parquet
>
>
> Hi,
> I've a crash when doing this:
> {{import pyarrow.parquet as pq}}
> {{pq.read_table('bug.parquet')}}
> [^bug.parquet]
> (this is the same generated by 
> arrow/python/pyarrow/tests/test_parquet.py(112)test_single_pylist_column_roundtrip())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4039) Update link to 'development.rst' page from Python README.md

2018-12-15 Thread Tanya Schlusser (JIRA)
Tanya Schlusser created ARROW-4039:
--

 Summary: Update link to 'development.rst' page from Python 
README.md
 Key: ARROW-4039
 URL: https://issues.apache.org/jira/browse/ARROW-4039
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation, Python
Reporter: Tanya Schlusser


When the Sphinx docs were restructured, the link in the 
[README|https://github.com/apache/arrow/blob/master/python/README.md]  changed 
from

[https://github.com/apache/arrow/blob/master/python/doc/source/development.rst]

to

[https://github.com/apache/arrow/blob/master/docs/source/python/development.rst]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3230) [Python] Missing comparisons on ChunkedArray, Table

2018-12-15 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722174#comment-16722174
 ] 

Tanya Schlusser commented on ARROW-3230:


Woo, this looks like my level! Thank you [~kszucs], I will try it.

> [Python] Missing comparisons on ChunkedArray, Table
> ---
>
> Key: ARROW-3230
> URL: https://issues.apache.org/jira/browse/ARROW-3230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.13.0
>
>
> Table and ChunkedArray equality are not implemented, meaning they fall back 
> on identity. Instead they should invoke equals(), as on Column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3230) [Python] Missing comparisons on ChunkedArray, Table

2018-12-15 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser reassigned ARROW-3230:
--

Assignee: Tanya Schlusser

> [Python] Missing comparisons on ChunkedArray, Table
> ---
>
> Key: ARROW-3230
> URL: https://issues.apache.org/jira/browse/ARROW-3230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Antoine Pitrou
>Assignee: Tanya Schlusser
>Priority: Major
> Fix For: 0.13.0
>
>
> Table and ChunkedArray equality are not implemented, meaning they fall back 
> on identity. Instead they should invoke equals(), as on Column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python

2018-12-12 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719488#comment-16719488
 ] 

Tanya Schlusser commented on ARROW-3543:


I can confirm this bug is still present. Most recent commit in my pull is

c0ac97f126c98fb29e81d6544adfea9d4ab74aff

 

For others, the R libraries needed to re-run Olaf's code (in addition to arrow) 
are readr, dplyr, and lubridate.

I will mess around but won't be hurt if a stronger R coder takes this before I 
finish.

 

 

> [R] Time zone adjustment issue when reading Feather file written by Python
> --
>
> Key: ARROW-3543
> URL: https://issues.apache.org/jira/browse/ARROW-3543
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Olaf
>Priority: Critical
> Fix For: 0.12.0
>
>
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some 
> timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
> pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
> 14:01:02.200')]}
> )
> df['timestamp_est'] = 
> pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in 
> `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two 
> completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R 
> thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
> `timestamp_est`. No big deal, I can always use `with_tz` or even better: 
> import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>   
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure insanity. 
>  Am I missing something here?
> Thanks!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3866) [Python] Column metadata is not transferred to tables in pyarrow

2018-12-11 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718423#comment-16718423
 ] 

Tanya Schlusser commented on ARROW-3866:


Hello [~frutti93], would you mind if I give it a try? Since it has a "newbie" 
label it looks like my kind of thing :).

> [Python] Column metadata is not transferred to tables in pyarrow
> 
>
> Key: ARROW-3866
> URL: https://issues.apache.org/jira/browse/ARROW-3866
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Seb Fru
>Priority: Major
>  Labels: features, newbie
> Fix For: 0.12.0
>
>
> Hello everyone,
> transferring this from Github for Pyarrow. While working with pyarrow I 
> noticed that field metadata does not get carried foreward when creating a 
> table out of several columns. Is this intended behaviour or is there a way to 
> add column metadata later on? The last command in my example does not return 
> anything.
>  I also could not verify whether this data would be written to parquet later 
> on, because I could not find a way to add field metadata directly to a table.
>  
> {code:java}
> >>>import pyarrow as pa
> >>>import pyarrow.parquet as pq 
> >>>arr1 = pa.array([1,2]) 
> >>>arr2 = pa.array([3,4]) 
> >>>field1 = pa.field('field1', pa.int64()) 
> >>>field2 = pa.field('field2', pa.int64()) 
> >>>field1 = field1.add_metadata({'foo1': 'bar1'}) 
> >>>field2 = field2.add_metadata({'foo2': 'bar2'}) 
> >>>field1.metadata {b'foo1': b'bar1'} 
> >>>field2.metadata {b'foo2': b'bar2'} 
> >>>col1 = pa.column(field1, arr1) 
> >>>col2 = pa.column(field2, arr2) 
> >>>col1.field.metadata {b'foo1': b'bar1'} 
> >>>tab = pa.Table.from_arrays([col1, col2]) 
> >>>tab pyarrow.Table field1: int64 field2: int64 
> >>>tab.column(0).field.metadata
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3792) [Python] Segmentation fault when writing empty RecordBatches to Parquet

2018-12-05 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710926#comment-16710926
 ] 

Tanya Schlusser commented on ARROW-3792:


Sweet! I'll stop on this then :)

> [Python] Segmentation fault when writing empty RecordBatches to Parquet
> ---
>
> Key: ARROW-3792
> URL: https://issues.apache.org/jira/browse/ARROW-3792
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.11.1
> Environment: Fedora 28, pyarrow installed with pip
> Fedora 29, pyarrow installed from conda-forge
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
> Attachments: minimal_bug_arrow3792.py, pq-bug.py
>
>
> h2. Background
> I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
> are populated). The file I am working with spans upto ~63M rows. I decided to 
> iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
> {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file 
> incrementally. Something like this:
> {code:python}
> batches = [..]  # 4 batches
> tbl = pa.Table.from_batches(batches)
> pqwriter.write_table(tbl, row_group_size=15000)
> # same issue with pq.write_table(..)
> {code}
> I was getting a segmentation fault at the final step, I narrowed it down to a 
> specific iteration. I noticed that iteration had empty batches; specifically, 
> [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the 
> whole dataset is below:
> {code:python}
> [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
> 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
> 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
> 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
> 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
> 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
> 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
> 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
> 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
> 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
> 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
> 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
> 18807, 18789, 14258, 0, 0]
> {code}
> On excluding the empty {{RecordBatch}}-es, the segfault goes away, but 
> unfortunately I couldn't create a proper minimal example with synthetic data.
> h2. Not quite minimal example
> The data I am using is from the 1000 Genome project, which has been public 
> for many years, so we can be reasonably sure the data is good. The following 
> steps should help you replicate the issue.
> # Download the data file (and index), about 330MB:
> {code:bash}
> $ wget 
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
> {code}
> # Install the Cython library {{pysam}}, a thin wrapper around the reference 
> implementation of the VCF file spec. You will need {{zlib}} headers, but 
> that's probably not a problem :)
> {code:bash}
> $ pip3 install --user pysam
> {code}
> # Now you can use the attached script to replicate the crash.
> h2. Extra information
> I have tried attaching gdb, the backtrace when the segfault occurs is shown 
> below (maybe it helps, this is how I realised empty batches could be the 
> reason).
> {code}
> (gdb) bt
> #0  0x7f3e7676d670 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #1  0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::TypedWriteBatch,
>  arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #2  0x7f3e7673a3d4 in parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #3  0x7f3e7673df09 in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr
>  const&, long, long) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #4  0x7f3e7673c74d in 
> parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr
>  const&, long, long) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-pa

[jira] [Commented] (ARROW-3792) [Python] Segmentation fault when writing empty RecordBatches to Parquet

2018-12-05 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710914#comment-16710914
 ] 

Tanya Schlusser commented on ARROW-3792:


I can now reproduce the bug with a more minimal code (see the attached file 
[minimal_bug_arrow3792.py|https://issues.apache.org/jira/secure/attachment/12950784/minimal_bug_arrow3792.py])
 – it is a problem with the column that contains a list -- I think the segfault 
occurs when dealing with the empty batch that is supposed to contain a column 
that contains a list. I'm still going to look at it more but in case I'm slow 
and someone else wants to do it faster you no longer need to download the 
genome dataset or {{pysam}}.

> [Python] Segmentation fault when writing empty RecordBatches to Parquet
> ---
>
> Key: ARROW-3792
> URL: https://issues.apache.org/jira/browse/ARROW-3792
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.11.1
> Environment: Fedora 28, pyarrow installed with pip
> Fedora 29, pyarrow installed from conda-forge
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
> Attachments: minimal_bug_arrow3792.py, pq-bug.py
>
>
> h2. Background
> I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
> are populated). The file I am working with spans upto ~63M rows. I decided to 
> iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
> {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file 
> incrementally. Something like this:
> {code:python}
> batches = [..]  # 4 batches
> tbl = pa.Table.from_batches(batches)
> pqwriter.write_table(tbl, row_group_size=15000)
> # same issue with pq.write_table(..)
> {code}
> I was getting a segmentation fault at the final step, I narrowed it down to a 
> specific iteration. I noticed that iteration had empty batches; specifically, 
> [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the 
> whole dataset is below:
> {code:python}
> [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
> 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
> 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
> 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
> 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
> 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
> 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
> 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
> 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
> 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
> 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
> 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
> 18807, 18789, 14258, 0, 0]
> {code}
> On excluding the empty {{RecordBatch}}-es, the segfault goes away, but 
> unfortunately I couldn't create a proper minimal example with synthetic data.
> h2. Not quite minimal example
> The data I am using is from the 1000 Genome project, which has been public 
> for many years, so we can be reasonably sure the data is good. The following 
> steps should help you replicate the issue.
> # Download the data file (and index), about 330MB:
> {code:bash}
> $ wget 
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
> {code}
> # Install the Cython library {{pysam}}, a thin wrapper around the reference 
> implementation of the VCF file spec. You will need {{zlib}} headers, but 
> that's probably not a problem :)
> {code:bash}
> $ pip3 install --user pysam
> {code}
> # Now you can use the attached script to replicate the crash.
> h2. Extra information
> I have tried attaching gdb, the backtrace when the segfault occurs is shown 
> below (maybe it helps, this is how I realised empty batches could be the 
> reason).
> {code}
> (gdb) bt
> #0  0x7f3e7676d670 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #1  0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::TypedWriteBatch,
>  arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #2  0x7f3e7673a3d4 in parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::Write(arrow::Array const

[jira] [Updated] (ARROW-3792) [Python] Segmentation fault when writing empty RecordBatches to Parquet

2018-12-05 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-3792:
---
Attachment: minimal_bug_arrow3792.py

> [Python] Segmentation fault when writing empty RecordBatches to Parquet
> ---
>
> Key: ARROW-3792
> URL: https://issues.apache.org/jira/browse/ARROW-3792
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.11.1
> Environment: Fedora 28, pyarrow installed with pip
> Fedora 29, pyarrow installed from conda-forge
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
> Attachments: minimal_bug_arrow3792.py, pq-bug.py
>
>
> h2. Background
> I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
> are populated). The file I am working with spans upto ~63M rows. I decided to 
> iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
> {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file 
> incrementally. Something like this:
> {code:python}
> batches = [..]  # 4 batches
> tbl = pa.Table.from_batches(batches)
> pqwriter.write_table(tbl, row_group_size=15000)
> # same issue with pq.write_table(..)
> {code}
> I was getting a segmentation fault at the final step, I narrowed it down to a 
> specific iteration. I noticed that iteration had empty batches; specifically, 
> [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the 
> whole dataset is below:
> {code:python}
> [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
> 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
> 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
> 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
> 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
> 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
> 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
> 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
> 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
> 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
> 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
> 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
> 18807, 18789, 14258, 0, 0]
> {code}
> On excluding the empty {{RecordBatch}}-es, the segfault goes away, but 
> unfortunately I couldn't create a proper minimal example with synthetic data.
> h2. Not quite minimal example
> The data I am using is from the 1000 Genome project, which has been public 
> for many years, so we can be reasonably sure the data is good. The following 
> steps should help you replicate the issue.
> # Download the data file (and index), about 330MB:
> {code:bash}
> $ wget 
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
> {code}
> # Install the Cython library {{pysam}}, a thin wrapper around the reference 
> implementation of the VCF file spec. You will need {{zlib}} headers, but 
> that's probably not a problem :)
> {code:bash}
> $ pip3 install --user pysam
> {code}
> # Now you can use the attached script to replicate the crash.
> h2. Extra information
> I have tried attaching gdb, the backtrace when the segfault occurs is shown 
> below (maybe it helps, this is how I realised empty batches could be the 
> reason).
> {code}
> (gdb) bt
> #0  0x7f3e7676d670 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #1  0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::TypedWriteBatch,
>  arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #2  0x7f3e7673a3d4 in parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #3  0x7f3e7673df09 in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr
>  const&, long, long) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #4  0x7f3e7673c74d in 
> parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr
>  const&, long, long) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #5  0x0

[jira] [Commented] (ARROW-3792) [Python] Segmentation fault when writing empty RecordBatches to Parquet

2018-12-05 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710248#comment-16710248
 ] 

Tanya Schlusser commented on ARROW-3792:


I have followed Suvayu's instructions and can successfully reproduce the 
segfault. I am going to try working on this, thanks!

> [Python] Segmentation fault when writing empty RecordBatches to Parquet
> ---
>
> Key: ARROW-3792
> URL: https://issues.apache.org/jira/browse/ARROW-3792
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.11.1
> Environment: Fedora 28, pyarrow installed with pip
> Fedora 29, pyarrow installed from conda-forge
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
> Attachments: pq-bug.py
>
>
> h2. Background
> I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
> are populated). The file I am working with spans upto ~63M rows. I decided to 
> iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
> {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file 
> incrementally. Something like this:
> {code:python}
> batches = [..]  # 4 batches
> tbl = pa.Table.from_batches(batches)
> pqwriter.write_table(tbl, row_group_size=15000)
> # same issue with pq.write_table(..)
> {code}
> I was getting a segmentation fault at the final step, I narrowed it down to a 
> specific iteration. I noticed that iteration had empty batches; specifically, 
> [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the 
> whole dataset is below:
> {code:python}
> [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
> 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
> 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
> 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
> 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
> 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
> 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
> 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
> 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
> 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
> 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
> 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
> 18807, 18789, 14258, 0, 0]
> {code}
> On excluding the empty {{RecordBatch}}-es, the segfault goes away, but 
> unfortunately I couldn't create a proper minimal example with synthetic data.
> h2. Not quite minimal example
> The data I am using is from the 1000 Genome project, which has been public 
> for many years, so we can be reasonably sure the data is good. The following 
> steps should help you replicate the issue.
> # Download the data file (and index), about 330MB:
> {code:bash}
> $ wget 
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
> {code}
> # Install the Cython library {{pysam}}, a thin wrapper around the reference 
> implementation of the VCF file spec. You will need {{zlib}} headers, but 
> that's probably not a problem :)
> {code:bash}
> $ pip3 install --user pysam
> {code}
> # Now you can use the attached script to replicate the crash.
> h2. Extra information
> I have tried attaching gdb, the backtrace when the segfault occurs is shown 
> below (maybe it helps, this is how I realised empty batches could be the 
> reason).
> {code}
> (gdb) bt
> #0  0x7f3e7676d670 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #1  0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::TypedWriteBatch,
>  arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #2  0x7f3e7673a3d4 in parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #3  0x7f3e7673df09 in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr
>  const&, long, long) ()
>from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #4  0x7f3e7673c74d in 
> parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr
>  const&, lon

[jira] [Commented] (ARROW-3629) [Python] Add write_to_dataset to Python Sphinx API listing

2018-12-04 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709088#comment-16709088
 ] 

Tanya Schlusser commented on ARROW-3629:


Pull request [#3089|https://github.com/apache/arrow/pull/3089], provided I 
understood this correctly and it only entails adding a single line to the 
{color:#654982}{{python/doc/source/api.rst}}{color}.



Comment:

The doc build was difficult, but possibly because I'm a noob. I'm commenting 
rather than making a JIRA issue because I have no idea whether these are actual 
issues or just a newbie's lack of knowledge. Running 
{color:#654982}{{dev/gen_apidocs.sh}}{color} on a clean pull with my single 
line to {color:#654982}{{api.rst}}{color} changed failed:

The {color:#654982}{{iwyu}}{color} image in 
{color:#654982}{{dev/docker-compose.yml}}{color} failed with this path issue:
 - {color:#654982}{{ERROR: build path /arrow/dev/iwyu either does 
not exist, is not accessible, or is not a valid URL.}}{color}
 - I commented it out and then could continue.


The Java docs wouldn't compile either at first:
- I think because there's a {color:#654982}{{conda install}}{color} for a 
second version of {color:#654982}{{maven}}{color} below the 
{color:#654982}{{apt-get install maven}}{color} in the 
[Dockerfile|https://github.com/apache/arrow/blob/master/dev/gen_apidocs/Dockerfile],
 which puts Java 11 in the front of the {color:#654982}{{PATH}}{color} breaking 
the lookup for class {color:#654982}{{javax.annotation.Generated}}{color} which 
moves from [Java 
8|https://docs.oracle.com/javase/8/docs/api/javax/annotation/Generated.html] to 
[Java 
9|https://docs.oracle.com/javase/9/docs/api/javax/annotation/processing/Generated.html]
 (and here is where it landed in [Java 
11|https://docs.oracle.com/en/java/javase/11/docs/api/java.compiler/javax/annotation/processing/Generated.html])
- when I deleted that line in the Dockerfile, the Java code compiled but 
didn't pass a test, because of a different missing dependency (that I didn't 
note; happy to figure it out if it's actually meaningful)
- so I commented out the Java build section in 
{color:#654982}{{dev/gen_apidocs/create_documents.sh}}{color}


The Javascript docs failed on a dependency I didn't note (happy to; just didn't 
want to waste time if it's my noob problem)
 - so I commented it out too; then the remaining doc generation worked

Please disregard if it's my lack of understanding. Otherwise I am happy to 
investigate further/add issues :).

> [Python] Add write_to_dataset to Python Sphinx API listing
> --
>
> Key: ARROW-3629
> URL: https://issues.apache.org/jira/browse/ARROW-3629
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2860) [Python] Null values in a single partition of Parquet dataset, results in invalid schema on read

2018-12-03 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708137#comment-16708137
 ] 

Tanya Schlusser commented on ARROW-2860:


I think this was resolved with https://issues.apache.org/jira/browse/ARROW-2891

pull request 2302
[https://github.com/apache/arrow/pull/2302]

When I run {{example_failure.py}} it does not fail and returns the expected 
result.

> [Python] Null values in a single partition of Parquet dataset, results in 
> invalid schema on read
> 
>
> Key: ARROW-2860
> URL: https://issues.apache.org/jira/browse/ARROW-2860
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Sam Oluwalana
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> from datetime import datetime, timedelta
> def generate_data(event_type, event_id, offset=0):
> """Generate data."""
> now = datetime.utcnow() + timedelta(seconds=offset)
> obj = {
> 'event_type': event_type,
> 'event_id': event_id,
> 'event_date': now.date(),
> 'foo': None,
> 'bar': u'hello',
> }
> if event_type == 2:
> obj['foo'] = 1
> obj['bar'] = u'world'
> if event_type == 3:
> obj['different'] = u'data'
> obj['bar'] = u'event type 3'
> else:
> obj['different'] = None
> return obj
> data = [
> generate_data(1, 1, 1),
> generate_data(1, 1, 3600 * 72),
> generate_data(2, 1, 1),
> generate_data(2, 1, 3600 * 72),
> generate_data(3, 1, 1),
> generate_data(3, 1, 3600 * 72),
> ]
> df = pd.DataFrame.from_records(data, index='event_id')
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, root_path='/tmp/events', 
> partition_cols=['event_type', 'event_date'])
> dataset = pq.ParquetDataset('/tmp/events')
> table = dataset.read()
> print(table.num_rows)
> {code}
> Expected output:
> {code:python}
> 6
> {code}
> Actual:
> {code:python}
> python example_failure.py
> Traceback (most recent call last):
>   File "example_failure.py", line 43, in 
> dataset = pq.ParquetDataset('/tmp/events')
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 745, in __init__
> self.validate_schemas()
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 775, in validate_schemas
> dataset_schema))
> ValueError: Schema in partition[event_type=2, event_date=0] 
> /tmp/events/event_type=3/event_date=2018-07-16 
> 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different.
> bar: string
> different: string
> foo: double
> event_id: int64
> metadata
> 
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> vs
> bar: string
> different: null
> foo: double
> event_id: int64
> metadata
> 
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> {code}
> Apparently what is happening is that pyarrow is interpreting the schema from 
> each of the partitions individually and the partitions for `event_type=3 / 
> event_date=*`  both have values for the column `different` whereas the other 
> columns do not. The discrepancy causes the `None` values of the other 
> partitions to be labeled as `pandas_type` `empty` instead of `unicode`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2504) [Website] Add ApacheCon NA link

2018-11-25 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698493#comment-16698493
 ] 

Tanya Schlusser commented on ARROW-2504:


Newbie here – looks like a good first issue for me so I'm claiming it thank you!

> [Website] Add ApacheCon NA link
> ---
>
> Key: ARROW-2504
> URL: https://issues.apache.org/jira/browse/ARROW-2504
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
>
> See instructions in http://apache.org/events/README.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)