Re: Benchmarking dashboard proposal

2019-02-20 Thread Tanya Schlusser
>
> Side question: is it expected to be able to connect to the DB directly
> from the outside?  I don't have any clue about the possible security
> implications.


This is do-able by creating different database accounts. Also, Wes's
solution was to back up the database periodically (daily?) to protect
against accidents. The current setup has a root user (full permission),
`arrow_anonymous` user (select + insert only), and `arrow_admin` (select,
insert, update, delete).

On Wed, Feb 20, 2019 at 12:19 PM Antoine Pitrou  wrote:

>
> Side question: is it expected to be able to connect to the DB directly
> from the outside?  I don't have any clue about the possible security
> implications.
>
> Regards
>
> Antoine.
>
>
>
> Le 20/02/2019 à 18:55, Melik-Adamyan, Areg a écrit :
> > There is a lot of discussion going in the PR ARROW-4313 itself; I would
> like to bring some of the high-level questions here to discuss. First of
> all many thanks to Tanya for the work you are doing.
> > Related to the dashboard intrinsics, I would like to set some scope and
> stick to that so we would not waste any job and get maximum efficiency from
> the work we are doing on the dashboard.
> > One thing that IMHO we are missing is against which requirements the
> work (DDL) is being done and in which scope? For me there are several
> things:
> > 1. We want continuous *validated* performance tracking against checkins
> to catch performance regressions and progressions. Validated means that the
> running environment is isolated enough so the stddev (assuming the
> distribution is normal) is as close to 0 as possible. It means both
> hardware and software should be fixed and not changeable to have only one
> variable to measure.
> > 2. The unit-tests framework (google/benchmark) allows to effectively
> report in textual format the needed data on benchmark with preamble
> containing information about the machine on which the benchmarks are run.
> > 3. So with environments set and regular runs you have all the artifacts,
> though not in a very comprehensible format. So the reason to set a
> dashboard is to allow to consume data and be able to track performance of
> various parts on a historical perspective and much more nicely with
> visualizations.
> > And here are the scope restrictions I have in mind:
> > - Disallow to enter data to the central repo any single benchmarks run,
> as they do not mean much in the case of continuous and statistically
> relevant measurements. What information you will get if someone reports
> some single run? You do not know how clean it was done, and more
> importantly is it possible to reproduce elsewhere. That is why even if it
> is better, worse or the same you cannot compare with the data already in
> the DB.
> > - Mandate the contributors to have dedicated environment for
> measurements. Otherwise they can use the TeamCity to run and parse data and
> publish on their site. Data that enters Arrow performance DB becomes Arrow
> community owned data. And it becomes community's job to answer why certain
> things are better or worse.
> > -  Because the numbers and flavors for CPU/GPU/accelerators are huge we
> cannot satisfy all the needs upfront and create DB that covers all the
> possible variants. I think we should have simple CPU and GPU configs now,
> even if they will not be perfect. By simple I mean basic brand string. That
> should be enough. Having all the detailed info in the DB does not make
> sense, as my experience is telling, you never use them, you use the
> CPUID/brandname to get the info needed.
> > - Scope and reqs will change during the time and going huge now will
> make things complicated later. So I think it will be beneficial to have
> something quick up and running, get better understanding of our needs and
> gaps, and go from there.
> > The needed infra is already up on AWS, so as soon as we resolve DNS and
> key exchange issues we can launch.
> >
> > -Areg.
> >
> > -Original Message-
> > From: Tanya Schlusser [mailto:ta...@tickel.net]
> > Sent: Thursday, February 7, 2019 4:40 PM
> > To: dev@arrow.apache.org
> > Subject: Re: Benchmarking dashboard proposal
> >
> > Late, but there's a PR now with first-draft DDL (
> https://github.com/apache/arrow/pull/3586).
> > Happy to receive any feedback!
> >
> > I tried to think about how people would submit benchmarks, and added a
> Postgraphile container for http-via-GraphQL.
> > If others have strong opinions on the data modeling please speak up
> because I'm more a database user than a designer.
> >
> > I can also help with benchmarking work in R/Python given guidance/a
> roadmap/examples from so

Re: Google Summer of Code 2019 for Apache Arrow

2019-02-18 Thread Tanya Schlusser
Would developing an open standard for in-memory records qualify as 'GSoC'
worthy?

In reference to this placeholder in the Confluence wiki:

https://cwiki.apache.org/confluence/display/ARROW/Apache+Arrow+Home#ApacheArrowHome-Developinganopenstandardforin-memoryrecords
which links to ARROW-1790
  https://issues.apache.org/jira/browse/ARROW-1790
and to this thread

https://lists.apache.org/thread.html/4818cb3d2ffb4677b24a4279c329fc518a1ac1c9d3017399a4269199@%3Cdev.arrow.apache.org%3E

Developing a standard, or even just starting a standard working group would
be quite a contribution, and allow a grad student the opportunity to
contact multiple leaders in the field. (I am thinking of something along
the lines of the Data Mining Group http://dmg.org/, which I believe is run
by a local professor here in Chicago).

I don't know many people but can ping that professor and maybe some others
locally if other people think this seems like a GSoC - worthy project.

Best,
Tanya

On Fri, Feb 1, 2019 at 8:16 AM Wes McKinney  wrote:

> hi folks,
>
> We are looking for project ideas and mentors for GSoC 2019. I created a
> JIRA
>
> https://issues.apache.org/jira/browse/COMDEV-309
>
> about a couple of project ideas for the C++ library.
>
> Since Arrow isn't the _easiest_ project to contribute to, we probably
> need to calibrate expectations for what a new contributor can
> accomplish in a 3 month GSoC project. A good chunk of time at the
> beginning will be spent ramping up. If anyone has project ideas (which
> need not be in C++) or wants to be a mentor, the deadline is fast
> approaching.
>
> Thanks,
> Wes
>


Re: Benchmarking dashboard proposal

2019-02-07 Thread Tanya Schlusser
Late, but there's a PR now with first-draft DDL (
https://github.com/apache/arrow/pull/3586).
Happy to receive any feedback!

I tried to think about how people would submit benchmarks, and added a
Postgraphile container for http-via-GraphQL.
If others have strong opinions on the data modeling please speak up because
I'm more a database user than a designer.

I can also help with benchmarking work in R/Python given guidance/a
roadmap/examples from someone else.

Best,
Tanya

On Mon, Feb 4, 2019 at 12:37 PM Tanya Schlusser  wrote:

> I hope to make a PR with the DDL by tomorrow or Wednesday night—DDL along
> with a README in a new directory `arrow/dev/benchmarking` unless directed
> otherwise.
>
> A "C++ Benchmark Collector" script would be super. I expect some
> back-and-forth on this to identify naïve assumptions in the data model.
>
> Attempting to submit actual benchmarks is how to get a handle on that. I
> recognize I'm blocking downstream work. Better to get an initial PR and
> some discussion going.
>
> Best,
> Tanya
>
> On Mon, Feb 4, 2019 at 10:10 AM Wes McKinney  wrote:
>
>> hi folks,
>>
>> I'm curious where we currently stand on this project. I see the
>> discussion in https://issues.apache.org/jira/browse/ARROW-4313 --
>> would the next step be to have a pull request with .sql files
>> containing the DDL required to create the schema in PostgreSQL?
>>
>> I could volunteer to write the "C++ Benchmark Collector" script that
>> will run all the benchmarks on Linux and collect their data to be
>> inserted into the database.
>>
>> Thanks
>> Wes
>>
>> On Sun, Jan 27, 2019 at 12:20 AM Tanya Schlusser 
>> wrote:
>> >
>> > I don't want to be the bottleneck and have posted an initial draft data
>> > model in the JIRA issue
>> https://issues.apache.org/jira/browse/ARROW-4313
>> >
>> > It should not be a problem to get content into a form that would be
>> > acceptable for either a static site like ASV (via CORS queries to a
>> > GraphQL/REST interface) or a codespeed-style site (via a separate schema
>> > organized for Django)
>> >
>> > I don't think I'm experienced enough to actually write any benchmarks
>> > though, so all I can contribute is backend work for this task.
>> >
>> > Best,
>> > Tanya
>> >
>> > On Sat, Jan 26, 2019 at 7:37 PM Wes McKinney 
>> wrote:
>> >
>> > > hi folks,
>> > >
>> > > I'd like to propose some kind of timeline for getting a first
>> > > iteration of a benchmark database developed and live, with scripts to
>> > > enable one or more initial agents to start adding new data on a daily
>> > > / per-commit basis. I have at least 3 physical machines where I could
>> > > immediately set up cron jobs to start adding new data, and I could
>> > > attempt to backfill data as far back as possible.
>> > >
>> > > Personally, I would like to see this done by the end of February if
>> > > not sooner -- if we don't have the volunteers to push the work to
>> > > completion by then please let me know as I will rearrange my
>> > > priorities to make sure that it happens. Does that sounds reasonable?
>> > >
>> > > Please let me know if this plan sounds reasonable:
>> > >
>> > > * Set up a hosted PostgreSQL instance, configure backups
>> > > * Propose and adopt a database schema for storing benchmark results
>> > > * For C++, write script (or Dockerfile) to execute all
>> > > google-benchmarks, output results to JSON, then adapter script
>> > > (Python) to ingest into database
>> > > * For Python, similar script that invokes ASV, then inserts ASV
>> > > results into benchmark database
>> > >
>> > > This seems to be a pre-requisite for having a front-end to visualize
>> > > the results, but the dashboard/front end can hopefully be implemented
>> > > in such a way that the details of the benchmark database are not too
>> > > tightly coupled
>> > >
>> > > (Do we have any other benchmarks in the project that would need to be
>> > > inserted initially?)
>> > >
>> > > Related work to trigger benchmarks on agents when new commits land in
>> > > master can happen concurrently -- one task need not block the other
>> > >
>> > > Thanks
>> > > Wes
>> > >
>> > > On Mon, Jan 21, 2019 at 11:14 AM Wes McKinney 
>>

Re: Benchmarking dashboard proposal

2019-02-04 Thread Tanya Schlusser
I hope to make a PR with the DDL by tomorrow or Wednesday night—DDL along
with a README in a new directory `arrow/dev/benchmarking` unless directed
otherwise.

A "C++ Benchmark Collector" script would be super. I expect some
back-and-forth on this to identify naïve assumptions in the data model.

Attempting to submit actual benchmarks is how to get a handle on that. I
recognize I'm blocking downstream work. Better to get an initial PR and
some discussion going.

Best,
Tanya

On Mon, Feb 4, 2019 at 10:10 AM Wes McKinney  wrote:

> hi folks,
>
> I'm curious where we currently stand on this project. I see the
> discussion in https://issues.apache.org/jira/browse/ARROW-4313 --
> would the next step be to have a pull request with .sql files
> containing the DDL required to create the schema in PostgreSQL?
>
> I could volunteer to write the "C++ Benchmark Collector" script that
> will run all the benchmarks on Linux and collect their data to be
> inserted into the database.
>
> Thanks
> Wes
>
> On Sun, Jan 27, 2019 at 12:20 AM Tanya Schlusser  wrote:
> >
> > I don't want to be the bottleneck and have posted an initial draft data
> > model in the JIRA issue https://issues.apache.org/jira/browse/ARROW-4313
> >
> > It should not be a problem to get content into a form that would be
> > acceptable for either a static site like ASV (via CORS queries to a
> > GraphQL/REST interface) or a codespeed-style site (via a separate schema
> > organized for Django)
> >
> > I don't think I'm experienced enough to actually write any benchmarks
> > though, so all I can contribute is backend work for this task.
> >
> > Best,
> > Tanya
> >
> > On Sat, Jan 26, 2019 at 7:37 PM Wes McKinney 
> wrote:
> >
> > > hi folks,
> > >
> > > I'd like to propose some kind of timeline for getting a first
> > > iteration of a benchmark database developed and live, with scripts to
> > > enable one or more initial agents to start adding new data on a daily
> > > / per-commit basis. I have at least 3 physical machines where I could
> > > immediately set up cron jobs to start adding new data, and I could
> > > attempt to backfill data as far back as possible.
> > >
> > > Personally, I would like to see this done by the end of February if
> > > not sooner -- if we don't have the volunteers to push the work to
> > > completion by then please let me know as I will rearrange my
> > > priorities to make sure that it happens. Does that sounds reasonable?
> > >
> > > Please let me know if this plan sounds reasonable:
> > >
> > > * Set up a hosted PostgreSQL instance, configure backups
> > > * Propose and adopt a database schema for storing benchmark results
> > > * For C++, write script (or Dockerfile) to execute all
> > > google-benchmarks, output results to JSON, then adapter script
> > > (Python) to ingest into database
> > > * For Python, similar script that invokes ASV, then inserts ASV
> > > results into benchmark database
> > >
> > > This seems to be a pre-requisite for having a front-end to visualize
> > > the results, but the dashboard/front end can hopefully be implemented
> > > in such a way that the details of the benchmark database are not too
> > > tightly coupled
> > >
> > > (Do we have any other benchmarks in the project that would need to be
> > > inserted initially?)
> > >
> > > Related work to trigger benchmarks on agents when new commits land in
> > > master can happen concurrently -- one task need not block the other
> > >
> > > Thanks
> > > Wes
> > >
> > > On Mon, Jan 21, 2019 at 11:14 AM Wes McKinney 
> wrote:
> > > >
> > > > Sorry, copy-paste failure:
> > > https://issues.apache.org/jira/browse/ARROW-4313
> > > >
> > > > On Mon, Jan 21, 2019 at 11:14 AM Wes McKinney 
> > > wrote:
> > > > >
> > > > > I don't think there is one but I just created
> > > > >
> > >
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E
> > > > >
> > > > > On Mon, Jan 21, 2019 at 10:35 AM Tanya Schlusser  >
> > > wrote:
> > > > > >
> > > > > > Areg,
> > > > > >
> > > > > > If you'd like help, I volunteer! No experience benchmarking but
> tons
> > > > > > experience databasing—I can mock the backend (database + ht

[jira] [Created] (ARROW-4429) Add git rebase tips to the 'Contributing' page in the developer docs

2019-01-30 Thread Tanya Schlusser (JIRA)
Tanya Schlusser created ARROW-4429:
--

 Summary: Add git rebase tips to the 'Contributing' page in the 
developer docs
 Key: ARROW-4429
 URL: https://issues.apache.org/jira/browse/ARROW-4429
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Tanya Schlusser


A recent discussion on the listserv (link below) asked about how contributors 
should handle rebasing. It would be helpful if the tips made it into the 
developer documentation somehow. I suggest in the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 page—currently a wiki, but hopefully eventually part of the Sphinx docs 
ARROW-4427.

Here is the relevant thread:

[https://lists.apache.org/thread.html/c74d8027184550b8d9041e3f2414b517ffb76ccbc1d5aa4563d364b6@%3Cdev.arrow.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4427) Move "Contributing to Apache Arrow" page to the static docs

2019-01-30 Thread Tanya Schlusser (JIRA)
Tanya Schlusser created ARROW-4427:
--

 Summary: Move "Contributing to Apache Arrow" page to the static 
docs
 Key: ARROW-4427
 URL: https://issues.apache.org/jira/browse/ARROW-4427
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Tanya Schlusser


It's hard to find and modify the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 wiki page in Confluence. If it were moved to inside the static web page, that 
would make it easier.

There are two steps to this:
 # Copy the wiki page contents to a new web page at the top "site" level (under 
arrow/site/ just like the [committers 
page|https://github.com/apache/arrow/blob/master/site/committers.html]) Maybe 
named "contributing.html" or something.
 # Modify the [navigation section in 
arrow/site/_includes/header.html|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
 to point to the newly created page instead of the wiki page.

The affected pages are all part of the Jekyll components, so there isn't a need 
to build the Sphinx part of the docs to check your work.
  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4425) Add link to 'Contributing' page in the top-level Arrow README

2019-01-30 Thread Tanya Schlusser (JIRA)
Tanya Schlusser created ARROW-4425:
--

 Summary: Add link to 'Contributing' page in the top-level Arrow 
README
 Key: ARROW-4425
 URL: https://issues.apache.org/jira/browse/ARROW-4425
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Tanya Schlusser


It would be nice to add a link to the ["Contributing to Apache 
Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
 Confluence page directly from the main project 
[README|https://github.com/apache/arrow/blob/master/README.md] (in the already 
existing "Getting involved" section) because it's a bit hard to find right now.

"contributing" page: 
[https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]

main project README: [https://github.com/apache/arrow/blob/master/README.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Git workflow question

2019-01-30 Thread Tanya Schlusser
This information might be useful to put on the 'contributing' page:
https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow
I attempted to add it but don't have permission. It was one of my stumbling
points too and I'm thankful someone else asked about it.

On Wed, Jan 30, 2019 at 12:00 AM Ravindra Pindikura 
wrote:

>
>
>
> > On Jan 30, 2019, at 11:05 AM, Andy Grove  wrote:
> >
> > Got it. Thanks for the clarification.
> >
> > On Tue, Jan 29, 2019 at 10:30 PM Wes McKinney 
> wrote:
> >
> >> hi Andy,
> >>
> >> yes, in this project I recommend never using "git merge". Merge
> >> commits just make branches harder to maintain when master is not using
> >> "merge" for merging patches.
> >>
> >> It is semantically simpler in the case of conflicts with master to use
> >> "git rebase -i" to combine your changes into a single commit, then
> >> "git rebase master" and resolve the conflicts then.
>
> Here’s the workflow that I use :
>
> git fetch upstream
> git log -> count my local commits, and remember it as ‘X'
> git rebase -i HEAD~x
> git rebase upstream/master
> git push -f
>
>
> I’m not able to avoid the ‘-f’ in the last step. But, Wes had recommended
> that we avoid the force option. Is there a better way to do this ?
>
> Thanks & regards,
> Ravindra,
>
> >>
> >> A linear commit history, with all patches landing in master as single
> >> commits, significantly eases downstream users who may be cherry
> >> picking fixes into maintenance branches. The alternative -- trying to
> >> sift the changes you want out of a tangled web of merge commits --
> >> would be utter madness.
> >>
> >> - Wes
> >>
> >> On Tue, Jan 29, 2019 at 11:20 PM Andy Grove 
> wrote:
> >>>
> >>> I've been struggling a bit with the workflow and I think I see what I'm
> >>> doing wrong now but wanted to confirm.
> >>>
> >>> I've been running the following to keep my fork up to date:
> >>>
> >>> git checkout master
> >>> git fetch upstream
> >>> git merge upstream/master
> >>> git push origin
> >>>
> >>> And then to update my branch I have been doing:
> >>>
> >>> git checkout ARROW-
> >>> git merge master
> >>> git push origin
> >>>
> >>> This generally has worked but sometimes I seem to pick up random
> commits
> >> on
> >>> my branch.
> >>>
> >>> Reading the github fork workflow docs again it looks like I should have
> >>> been running "git rebase master" instead of "git merge master" ?
> >>>
> >>> Is that the only mistake I'm making?
> >>>
> >>> Thanks,
> >>>
> >>> Andy.
> >>
>
>


Re: Benchmarking dashboard proposal

2019-01-26 Thread Tanya Schlusser
I don't want to be the bottleneck and have posted an initial draft data
model in the JIRA issue https://issues.apache.org/jira/browse/ARROW-4313

It should not be a problem to get content into a form that would be
acceptable for either a static site like ASV (via CORS queries to a
GraphQL/REST interface) or a codespeed-style site (via a separate schema
organized for Django)

I don't think I'm experienced enough to actually write any benchmarks
though, so all I can contribute is backend work for this task.

Best,
Tanya

On Sat, Jan 26, 2019 at 7:37 PM Wes McKinney  wrote:

> hi folks,
>
> I'd like to propose some kind of timeline for getting a first
> iteration of a benchmark database developed and live, with scripts to
> enable one or more initial agents to start adding new data on a daily
> / per-commit basis. I have at least 3 physical machines where I could
> immediately set up cron jobs to start adding new data, and I could
> attempt to backfill data as far back as possible.
>
> Personally, I would like to see this done by the end of February if
> not sooner -- if we don't have the volunteers to push the work to
> completion by then please let me know as I will rearrange my
> priorities to make sure that it happens. Does that sounds reasonable?
>
> Please let me know if this plan sounds reasonable:
>
> * Set up a hosted PostgreSQL instance, configure backups
> * Propose and adopt a database schema for storing benchmark results
> * For C++, write script (or Dockerfile) to execute all
> google-benchmarks, output results to JSON, then adapter script
> (Python) to ingest into database
> * For Python, similar script that invokes ASV, then inserts ASV
> results into benchmark database
>
> This seems to be a pre-requisite for having a front-end to visualize
> the results, but the dashboard/front end can hopefully be implemented
> in such a way that the details of the benchmark database are not too
> tightly coupled
>
> (Do we have any other benchmarks in the project that would need to be
> inserted initially?)
>
> Related work to trigger benchmarks on agents when new commits land in
> master can happen concurrently -- one task need not block the other
>
> Thanks
> Wes
>
> On Mon, Jan 21, 2019 at 11:14 AM Wes McKinney  wrote:
> >
> > Sorry, copy-paste failure:
> https://issues.apache.org/jira/browse/ARROW-4313
> >
> > On Mon, Jan 21, 2019 at 11:14 AM Wes McKinney 
> wrote:
> > >
> > > I don't think there is one but I just created
> > >
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E
> > >
> > > On Mon, Jan 21, 2019 at 10:35 AM Tanya Schlusser 
> wrote:
> > > >
> > > > Areg,
> > > >
> > > > If you'd like help, I volunteer! No experience benchmarking but tons
> > > > experience databasing—I can mock the backend (database + http) as a
> > > > starting point for discussion if this is the way people want to go.
> > > >
> > > > Is there a Jira ticket for this that i can jump into?
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Jan 20, 2019 at 3:24 PM Wes McKinney 
> wrote:
> > > >
> > > > > hi Areg,
> > > > >
> > > > > This sounds great -- we've discussed building a more full-featured
> > > > > benchmark automation system in the past but nothing has been
> developed
> > > > > yet.
> > > > >
> > > > > Your proposal about the details sounds OK; the single most
> important
> > > > > thing to me is that we build and maintain a very general purpose
> > > > > database schema for building the historical benchmark database
> > > > >
> > > > > The benchmark database should keep track of:
> > > > >
> > > > > * Timestamp of benchmark run
> > > > > * Git commit hash of codebase
> > > > > * Machine unique name (sort of the "user id")
> > > > > * CPU identification for machine, and clock frequency (in case of
> > > > > overclocking)
> > > > > * CPU cache sizes (L1/L2/L3)
> > > > > * Whether or not CPU throttling is enabled (if it can be easily
> determined)
> > > > > * RAM size
> > > > > * GPU identification (if any)
> > > > > * Benchmark unique name
> > > > > * Programming language(s) associated with benchmark (e.g. a
> benchmark
> > > > > may involve both C++ and Python)
> > > > > * Benchma

Re: Benchmarking dashboard proposal

2019-01-21 Thread Tanya Schlusser
Areg,

If you'd like help, I volunteer! No experience benchmarking but tons
experience databasing—I can mock the backend (database + http) as a
starting point for discussion if this is the way people want to go.

Is there a Jira ticket for this that i can jump into?




On Sun, Jan 20, 2019 at 3:24 PM Wes McKinney  wrote:

> hi Areg,
>
> This sounds great -- we've discussed building a more full-featured
> benchmark automation system in the past but nothing has been developed
> yet.
>
> Your proposal about the details sounds OK; the single most important
> thing to me is that we build and maintain a very general purpose
> database schema for building the historical benchmark database
>
> The benchmark database should keep track of:
>
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
>
> (maybe some other things)
>
> I would rather not be locked into the internal database schema of a
> particular benchmarking tool. So people in the community can just run
> SQL queries against the database and use the data however they like.
> We'll just have to be careful that people don't DROP TABLE or DELETE
> (but we should have daily backups so we can recover from such cases)
>
> So while we may make use of TeamCity to schedule the runs on the cloud
> and physical hardware, we should also provide a path for other people
> in the community to add data to the benchmark database on their
> hardware on an ad hoc basis. For example, I have several machines in
> my home on all operating systems (Windows / macOS / Linux, and soon
> also ARM64) and I'd like to set up scheduled tasks / cron jobs to
> report in to the database at least on a daily basis.
>
> Ideally the benchmark database would just be a PostgreSQL server with
> a schema we write down and keep backed up etc. Hosted PostgreSQL is
> inexpensive ($200+ per year depending on size of instance; this
> probably doesn't need to be a crazy big machine)
>
> I suspect there will be a manageable amount of development involved to
> glue each of the benchmarking frameworks together with the benchmark
> database. This can also handle querying the operating system for the
> system information listed above
>
> Thanks
> Wes
>
> On Fri, Jan 18, 2019 at 12:14 AM Melik-Adamyan, Areg
>  wrote:
> >
> > Hello,
> >
> > I want to restart/attach to the discussions for creating Arrow
> benchmarking dashboard. I want to propose performance benchmark run per
> commit to track the changes.
> > The proposal includes building infrastructure for per-commit tracking
> comprising of the following parts:
> > - Hosted JetBrains for OSS https://teamcity.jetbrains.com/ as a build
> system
> > - Agents running in cloud both VM/container (DigitalOcean, or others)
> and bare-metal (Packet.net/AWS) and on-premise(Nvidia boxes?)
> > - JFrog artifactory storage and management for OSS projects
> https://jfrog.com/open-source/#artifactory2
> > - Codespeed as a frontend https://github.com/tobami/codespeed
> >
> > I am volunteering to build such system (if needed more Intel folks will
> be involved) so we can start tracking performance on various platforms and
> understand how changes affect it.
> >
> > Please, let me know your thoughts!
> >
> > Thanks,
> > -Areg.
> >
> >
> >
>


[jira] [Created] (ARROW-4039) Update link to 'development.rst' page from Python README.md

2018-12-15 Thread Tanya Schlusser (JIRA)
Tanya Schlusser created ARROW-4039:
--

 Summary: Update link to 'development.rst' page from Python 
README.md
 Key: ARROW-4039
 URL: https://issues.apache.org/jira/browse/ARROW-4039
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation, Python
Reporter: Tanya Schlusser


When the Sphinx docs were restructured, the link in the 
[README|https://github.com/apache/arrow/blob/master/python/README.md]  changed 
from

[https://github.com/apache/arrow/blob/master/python/doc/source/development.rst]

to

[https://github.com/apache/arrow/blob/master/docs/source/python/development.rst]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)