[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748409#comment-16748409
 ] 

Areg Melik-Adamyan commented on ARROW-4313:
-------------------------------------------

 I think it will be easy if we keep it a little bit simple in the beginning, 
not to redo a lot in the future.

So replies to original comments:
 * Timestamp of benchmark run - *We should be careful, as this is helpful, but 
you cannot rely on this timestamp as, there is no guarantee that systems are 
synced in time. So for purely informational purposes, it is fine.* 
 * Git commit hash of codebase 
 * Machine unique name (sort of the "user id") - *Machine ID and machine 
information should go to a different database, as they can change, come and go, 
you do not want to keep that info tied to benchmarks*
 * CPU identification for machine, and clock frequency (in case of overclocking)
 * CPU cache sizes (L1/L2/L3)
 * Whether or not CPU throttling is enabled (if it can be easily determined) - 
*for benchmarking you should always set it to max, not fixing the governor will 
add additional unpredictable flakiness to the benchmarks. Also you need to lock 
machine when the benchmarks are running to prevent noise.* 
 * RAM size
 * GPU identification (if any)
 * Benchmark unique name - *For the start I would say yes, but it can quickly 
get out of control, as you have e.g. TestFeatureA, then it gets flavors, like 
input size, and you start naming it TestFeatureA5GB, then* 
*TestFeatureA5GB-CPU,* *TestFeatureA5GB-GPU-Nvidia,* 
*TestFeatureA5GB-GPU-Radeon, and it gets out of control. The best know method 
to control is hierarchical name or unique id with benchmark table, which is 
kind of overkill for now.***
 * Programming language(s) associated with benchmark (e.g. a benchmark
may involve both C++ and Python)  - *Why would you need this? Maybe put into 
hierarchical name?*
 * Benchmark time, plus mean and standard deviation if available, else NULL  ** 

> Define general benchmark database schema
> ----------------------------------------
>
>                 Key: ARROW-4313
>                 URL: https://issues.apache.org/jira/browse/ARROW-4313
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Benchmarking
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 0.13.0
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to