In the enterprise, most folks use either Java Mission Control, or just Java
VisualVM profiler.  Seeing sleeping Threads is often good to start with,
and just taking a snapshot or even Heap Dump when things are really
grinding slow would be useful, you can later share those snapshots/heap
dump with the community or Java profiling experts to analyze later.

https://visualvm.github.io/index.html

Thad
https://www.linkedin.com/in/thadguidry/


On Thu, Nov 14, 2019 at 1:46 PM Guillaume Lederrey <gleder...@wikimedia.org>
wrote:

> Hello!
>
> Thanks for the suggestions!
>
> On Thu, Nov 14, 2019 at 5:02 PM Thad Guidry <thadgui...@gmail.com> wrote:
>
>> Is the Write Retention Queue adequate?
>> Is the branching factor for the lexicon indices too large, resulting in a
>> non-linear slowdown in the write rate over tim?
>> Did you look into Small Slot Optimization?
>> Are the Write Cache Buffers adequate?
>> Is there a lot of Heap pressure?
>> Is the MemoryManager have the maximum amount of RAM it can handle?  4TB?
>> Is the RWStore handling the recycling well?
>> Is the SAIL Buffer Capacity adequate?
>> Are you not using exact range counts where you could be using fast range
>> counts?
>>
>>
> Start at the Hardware side first however.
>> Is the disk activity for writes really low...and CPU is very high?  You
>> have identified a bottleneck in that case, discover WHY that would be the
>> case looking into any of the above.
>>
>
> Sounds like good questions, but outside of my area of expertise. I've
> created https://phabricator.wikimedia.org/T238362 to track it, and I'll
> see if someone can have a look. I know that we did multiple passes at
> tuning Blazegraph properties, with limited success so far.
>
>
>> and a 100+ other things that should be looked at that all affect WRITE
>> performance during UPDATES.
>>
>> https://wiki.blazegraph.com/wiki/index.php/IOOptimization
>> https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization
>>
>> I would also suggest you start monitoring some of the internals of
>> Blazegraph (JAVA) while in production with tools such as XRebel or
>> AppDynamics.
>>
>
> Both XRebel and AppDynamics are proprietary, so no way that we'll deploy
> them in our environment. We are tracking a few JMX based metrics, but so
> far, we don't really know what to look for.
>
> Thanks!
>
>   Guillaume
>
> Thad
>> https://www.linkedin.com/in/thadguidry/
>>
>>
>> On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey <
>> gleder...@wikimedia.org> wrote:
>>
>>> Thanks for the feedback!
>>>
>>> On Thu, Nov 14, 2019 at 11:11 AM <f...@imm.dtu.dk> wrote:
>>>
>>>>
>>>> Besides waiting for the new updater, it may be useful to tell us, what
>>>> we as users can do too. It is unclear to me what the problem is. For
>>>> instance, at one point I was worried that the many parallel requests to
>>>> the SPARQL endpoint that we make in Scholia is a problem. As far as I
>>>> understand it is not a problem at all. Another issue could be the way
>>>> that we use Magnus Manske's Quickstatements and approve bots for high
>>>> frequency editing. Perhaps a better overview and constraints on
>>>> large-scale editing could be discussed?
>>>>
>>>
>>> To be (again) completely honest, we don't entirely understand the issue
>>> either. There are clearly multiple related issues. In high level terms, we
>>> have at least:
>>>
>>> * Some part of the update process on Blazegraph is CPU bound and single
>>> threaded. Even with low query load, if we have a high edit rate, Blazegraph
>>> can't keep up, and saturates a single CPU (with plenty of available
>>> resources on other CPUs). This is a hard issue to fix, requiring either
>>> splitting the processing over multiple CPU or sharding the data over
>>> multiple servers. Neither of which Blazegraph supports (at least not in our
>>> current configuration).
>>> * There is a race for resources between edits and queries: a high query
>>> load will impact the update rate. This could to some extent be mitigated by
>>> reducing the query load: if no one is using the service, it works great!
>>> Obviously that's not much of a solution.
>>>
>>> What you can do (short term):
>>>
>>> * Keep bots usage well behaved (don't do parallel queries, provide a
>>> meaningful user agent, smooth the load over time if possible, ...). As far
>>> as I can see, most usage are already well behaved.
>>> * Optimize your queries: better queries will use less resources, which
>>> should help. Time to completion is a good approximation of the resources
>>> used. I don't really have any more specific advice, SPARQL is not my area
>>> of expertise.
>>>
>>> What you can do (longer term):
>>>
>>> * Help us think out of the box. Can we identify higher level use cases?
>>> Could we implement some of our workflows on a higher level API than SPARQL,
>>> which might allow for more internal optimizations?
>>> * Help us better understand the constraints. Document use cases on [1].
>>>
>>> Sadly, we don't have the bandwidth right now to engage meaningfully in
>>> this conversation. Feel free to send thoughts already, but don't expect any
>>> timely response.
>>>
>>> Yet another thought is the large discrepancy between Virginia and Texas
>>>> data centers as I could see on Grafana [1]. As far as I understand the
>>>> hardware (and software) are the same. So why is there this large
>>>> difference? Rather than editing or BlazeGraph, could the issue be some
>>>> form of network issue?
>>>>
>>>
>>> As pointed out by Lucas, this is expected. Due to how our GeoDNS works,
>>> we see more traffic on eqiad than on codfw.
>>>
>>> Thanks for the help!
>>>
>>>    Guillaume
>>>
>>> [1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage
>>>
>>>
>>>
>>>>
>>>>
>>>> [1]
>>>>
>>>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&fullscreen&orgId=1&from=now-7d&to=now
>>>>
>>>> /Finn
>>>>
>>>>
>>>>
>>>> On 14/11/2019 10:50, Guillaume Lederrey wrote:
>>>> > Hello all!
>>>> >
>>>> > As you've probably noticed, the update lag on the public WDQS
>>>> endpoint
>>>> > [1] is not doing well [2], with lag climbing to > 12h for some
>>>> servers.
>>>> > We are tracking this on phabricator [3], subscribe to that task if
>>>> you
>>>> > want to stay informed.
>>>> >
>>>> > To be perfectly honest, we don't have a good short term solution. The
>>>> > graph database that we are using at the moment (Blazegraph [4]) does
>>>> not
>>>> > easily support sharding, so even throwing hardware at the problem
>>>> isn't
>>>> > really an option.
>>>> >
>>>> > We are working on a few medium term improvements:
>>>> >
>>>> > * A dedicated updater service in Blazegraph, which should help
>>>> increase
>>>> > the update throughput [5]. Finger crossed, this should be ready for
>>>> > initial deployment and testing by next week (no promise, we're doing
>>>> the
>>>> > best we can).
>>>> > * Some improvement in the parallelism of the updater [6]. This has
>>>> just
>>>> > been identified. While it will probably also provide some improvement
>>>> in
>>>> > throughput, we haven't actually started working on that and we don't
>>>> > have any numbers at this point.
>>>> >
>>>> > Longer term:
>>>> >
>>>> > We are hiring a new team member to work on WDQS. It will take some
>>>> time
>>>> > to get this person up to speed, but we should have more capacity to
>>>> > address the deeper issues of WDQS by January.
>>>> >
>>>> > The 2 main points we want to address are:
>>>> >
>>>> > * Finding a triple store that scales better than our current solution.
>>>> > * Better understand what are the use cases on WDQS and see if we can
>>>> > provide a technical solution that is better suited. Our intuition is
>>>> > that some of the use cases that require synchronous (or quasi
>>>> > synchronous) updates would be better implemented outside of a triple
>>>> > store. Honestly, we have no idea yet if this makes sense and what
>>>> those
>>>> > alternate solutions might be.
>>>> >
>>>> > Thanks a lot for your patience during this tough time!
>>>> >
>>>> >     Guillaume
>>>> >
>>>> >
>>>> > [1] https://query.wikidata.org/
>>>> > [2]
>>>> >
>>>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1571131796906&to=1573723796906&var-cluster_name=wdqs&panelId=8&fullscreen
>>>> > [3] https://phabricator.wikimedia.org/T238229
>>>> > [4] https://blazegraph.com/
>>>> > [5] https://phabricator.wikimedia.org/T212826
>>>> > [6] https://phabricator.wikimedia.org/T238045
>>>> >
>>>> > --
>>>> > Guillaume Lederrey
>>>> > Engineering Manager, Search Platform
>>>> > Wikimedia Foundation
>>>> > UTC+1 / CET
>>>> >
>>>> > _______________________________________________
>>>> > Wikidata mailing list
>>>> > Wikidata@lists.wikimedia.org
>>>> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>> >
>>>>
>>>> _______________________________________________
>>>> Wikidata mailing list
>>>> Wikidata@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>
>>>
>>>
>>> --
>>> Guillaume Lederrey
>>> Engineering Manager, Search Platform
>>> Wikimedia Foundation
>>> UTC+1 / CET
>>> _______________________________________________
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> --
> Guillaume Lederrey
> Engineering Manager, Search Platform
> Wikimedia Foundation
> UTC+1 / CET
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to