Thanks Lyndon. For the benchmarking framework itself, can it take in any 
dataset in csv file format, or only the specific dataset you've generated?

Also I think if we want to standardize on a set of sample data for users, the 
best way may be to host the data on ASF, and It does look like ASF has a server 
to host files: https://nightlies.apache.org/. Thoughts?

Cheers,

Yang

On 2024/11/19 17:46:18 Lyndon Bauto wrote:
> I was planning on contributing the entire framework which is basically a
> framework that's set up in a way where you can add additional benchmarks to
> it similar to adding a test in junit, so if people want to make new
> benchmarks on different things, it's easy. Not sure on the exact place it'd
> like, gremlin-tools seems reasonable though.
> 
> The dataset we used was basically an identity graph dataset (csv format)
> that is generated from 350 GB, 3.5 TB, and 15TB. Aerospike will host this
> data in a public gcp bucket and others are welcome to use it.
> 
> We will likely generate some other datasets and add to this over time, this
> is not something that TinkerPop would be on the hook for hosting or
> providing in any way.
> 
> -Lyndon
> 
> On Tue, Nov 19, 2024 at 8:33 AM Yang Xia <xia...@apache.org> wrote:
> 
> > Hi Lyndon,
> >
> > I also think this is something that can benefit users. I just have some
> > quick questions.
> >
> > Could you clarify what you plan to contribute into TinkerPop? Is it the
> > benchmarking framework, the dataset, or both? For the benchmarking
> > framework, are you looking to PR something into the gremlin-tools module?
> > For the dataset, what type of data does it have, how is it generated? Since
> > there are quite a few benchmarking dataset that exists out there already, I
> > feel like we've usually kept datasets external.
> >
> > Cheers,
> >
> > Yang
> >
> > On 2024/11/15 20:25:03 Lyndon Bauto wrote:
> > > Reviving this thread.
> > >
> > > I think I have exposed a bottleneck in the Java driver. Not sure what it
> > > is, but if I scale the client machine up to 128 cores and 2:1 thread:core
> > > ratio, I get no additional performance over a say, 16 core machine.
> > However
> > > if I create additional JVM's running the benchmark I get additional
> > > performance. It's still unclear whether this is the driver or maybe the
> > JVM
> > > running the driver.
> > >
> > > Anyway I would like to move forward with getting this into TinkerPop. The
> > > plan I have is to make the datasets we have generated for it so far
> > public,
> > > then anyway can load the dataset into any graph provider's graph and run
> > > the benchmark. Additionally we will look into other datasets separately.
> > >
> > > let me know if anyone has any concerns
> > > -Lyndon
> > >
> > > On Thu, Jul 11, 2024 at 8:54 AM Ken Hu <kenhu...@gmail.com> wrote:
> > >
> > > > I think this would be very useful for the 3.x line that uses
> > WebSockets.
> > > > There's difficulty in recommending what the best connection settings
> > are to
> > > > increase performance for different workloads and an automated tool to
> > > > discover that would be helpful to users. On a side note, a goal during
> > the
> > > > transition to HTTP should be to make the connection settings simpler so
> > > > that it is easier to figure out what the settings should be for a
> > specific
> > > > workload.
> > > >
> > > > I feel like there are definitely some users that will benefit from a
> > > > benchmarking tool like this.
> > > >
> > > > On Wed, Jul 10, 2024 at 12:42 PM Lyndon Bauto
> > <lba...@aerospike.com.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > Right now the dataset and benchmarking setup is really simple.
> > > > >
> > > > > It does some mergeV's, edge insertion, get vertex by id
> > > > (g.V(<id>).next()),
> > > > > and then some g.V(<id>).out().out().out().out().
> > > > >
> > > > > The idea being to get results for queries that require a decent bit
> > of
> > > > > processing, as well as quick lookup and return queries that will
> > allow us
> > > > > to test the driver when it's under high throughput load that is
> > highly
> > > > > concurrent. We could also add a query that returns a lot of data
> > without
> > > > a
> > > > > lot of processing, so we could test the driver under a scenario
> > where a
> > > > lot
> > > > > of data is coming back.
> > > > >
> > > > > This would help users identify what would be most beneficial for
> > their
> > > > use
> > > > > case, for example, maybe few connections and many in process per
> > > > connection
> > > > > gets better use of resources when the data returned is minimal but
> > the
> > > > > number of queries running is very high, meanwhile more connections
> > with
> > > > > less in process per connection might achieve better results when
> > queries
> > > > > are returning more data.
> > > > >
> > > > > Down the road adding things like identity graph use cases, fraud
> > > > detection
> > > > > use cases, and others with datasets included and queries to
> > benchmark in
> > > > > there would be a great way for providers to opt into providing
> > benchmarks
> > > > > that are relevant to their target customers but that is a later
> > thing.
> > > > >
> > > > > - Lyndon
> > > > >
> > > > > On Tue, Jul 9, 2024 at 4:14 PM Ken Hu <kenhu...@gmail.com> wrote:
> > > > >
> > > > > > Hey Lyndon,
> > > > > >
> > > > > > This is a very interesting idea. You mentioned throughput testing
> > but
> > > > how
> > > > > > does this compare to other graph testing that use specific
> > generated
> > > > > > datasets and specific queries? Asked another way, what kind of
> > queries
> > > > > are
> > > > > > you using to test in this system?
> > > > > >
> > > > > > Regards,
> > > > > > Ken
> > > > > >
> > > > > > On Tue, Jul 9, 2024 at 2:00 PM Lyndon Bauto
> > > > <lba...@aerospike.com.invalid
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi devs,
> > > > > > >
> > > > > > > I've been working on a benchmarking framework for tinkerpop,
> > > > > specifically
> > > > > > > the Java driver.
> > > > > > >
> > > > > > > The idea is to have a benchmarking framework that a TinkerPop
> > user
> > > > can
> > > > > > > target their instance of gremlin-server with (can be any
> > provider)
> > > > and
> > > > > > what
> > > > > > > this will allow them to do is fix some of their configs of the
> > driver
> > > > > > while
> > > > > > > having others as variables. The framework will then run through a
> > > > bunch
> > > > > > of
> > > > > > > different settings, recording latency and throughput.
> > > > > > >
> > > > > > > The output of the benchmarking framework would be guidance for
> > the
> > > > user
> > > > > > of
> > > > > > > the Java driver for optimal configuration for both latency and
> > > > > > throughput,
> > > > > > > that they can then use to optimize their workload outside the
> > > > > framework.
> > > > > > >
> > > > > > > A provider could also use this to manually adjust
> > > > > > > gremlinPool/threadPoolWorkers/etc and run the framework under
> > > > different
> > > > > > > settings to optimize throughput and latency there as well.
> > > > > > >
> > > > > > > The benchmark is built on JMH and is build into a docker
> > container so
> > > > > it
> > > > > > is
> > > > > > > very easy to use. The configs are passed at runtime, so a user
> > would
> > > > > just
> > > > > > > call a docker build then docker run script, with the configs
> > setup in
> > > > > the
> > > > > > > docker config.
> > > > > > >
> > > > > > > We could also add other benchmarks at any scale to the framework
> > that
> > > > > > allow
> > > > > > > benchmark publishing from providers who wish to participate.
> > > > > > >
> > > > > > > Anyone have any thoughts on this?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Lyndon
> > > > > > > --
> > > > > > >
> > > > > > > *Lyndon Bauto*
> > > > > > > *Senior Software Engineer*
> > > > > > > *Aerospike, Inc.*
> > > > > > > www.aerospike.com
> > > > > > > lba...@aerospike.com
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Lyndon Bauto*
> > > > > *Senior Software Engineer*
> > > > > *Aerospike, Inc.*
> > > > > www.aerospike.com
> > > > > lba...@aerospike.com
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > *Lyndon Bauto*
> > > *Senior Software Engineer*
> > > *Aerospike, Inc.*
> > > www.aerospike.com
> > > lba...@aerospike.com
> > >
> >
> 
> 
> -- 
> 
> *Lyndon Bauto*
> *Senior Software Engineer*
> *Aerospike, Inc.*
> www.aerospike.com
> lba...@aerospike.com
> 

Reply via email to