In the enterprise, most folks use either Java Mission Control, or just Java VisualVM profiler. Seeing sleeping Threads is often good to start with, and just taking a snapshot or even Heap Dump when things are really grinding slow would be useful, you can later share those snapshots/heap dump with the community or Java profiling experts to analyze later.
https://visualvm.github.io/index.html Thad https://www.linkedin.com/in/thadguidry/ On Thu, Nov 14, 2019 at 1:46 PM Guillaume Lederrey <gleder...@wikimedia.org> wrote: > Hello! > > Thanks for the suggestions! > > On Thu, Nov 14, 2019 at 5:02 PM Thad Guidry <thadgui...@gmail.com> wrote: > >> Is the Write Retention Queue adequate? >> Is the branching factor for the lexicon indices too large, resulting in a >> non-linear slowdown in the write rate over tim? >> Did you look into Small Slot Optimization? >> Are the Write Cache Buffers adequate? >> Is there a lot of Heap pressure? >> Is the MemoryManager have the maximum amount of RAM it can handle? 4TB? >> Is the RWStore handling the recycling well? >> Is the SAIL Buffer Capacity adequate? >> Are you not using exact range counts where you could be using fast range >> counts? >> >> > Start at the Hardware side first however. >> Is the disk activity for writes really low...and CPU is very high? You >> have identified a bottleneck in that case, discover WHY that would be the >> case looking into any of the above. >> > > Sounds like good questions, but outside of my area of expertise. I've > created https://phabricator.wikimedia.org/T238362 to track it, and I'll > see if someone can have a look. I know that we did multiple passes at > tuning Blazegraph properties, with limited success so far. > > >> and a 100+ other things that should be looked at that all affect WRITE >> performance during UPDATES. >> >> https://wiki.blazegraph.com/wiki/index.php/IOOptimization >> https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization >> >> I would also suggest you start monitoring some of the internals of >> Blazegraph (JAVA) while in production with tools such as XRebel or >> AppDynamics. >> > > Both XRebel and AppDynamics are proprietary, so no way that we'll deploy > them in our environment. We are tracking a few JMX based metrics, but so > far, we don't really know what to look for. > > Thanks! > > Guillaume > > Thad >> https://www.linkedin.com/in/thadguidry/ >> >> >> On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey < >> gleder...@wikimedia.org> wrote: >> >>> Thanks for the feedback! >>> >>> On Thu, Nov 14, 2019 at 11:11 AM <f...@imm.dtu.dk> wrote: >>> >>>> >>>> Besides waiting for the new updater, it may be useful to tell us, what >>>> we as users can do too. It is unclear to me what the problem is. For >>>> instance, at one point I was worried that the many parallel requests to >>>> the SPARQL endpoint that we make in Scholia is a problem. As far as I >>>> understand it is not a problem at all. Another issue could be the way >>>> that we use Magnus Manske's Quickstatements and approve bots for high >>>> frequency editing. Perhaps a better overview and constraints on >>>> large-scale editing could be discussed? >>>> >>> >>> To be (again) completely honest, we don't entirely understand the issue >>> either. There are clearly multiple related issues. In high level terms, we >>> have at least: >>> >>> * Some part of the update process on Blazegraph is CPU bound and single >>> threaded. Even with low query load, if we have a high edit rate, Blazegraph >>> can't keep up, and saturates a single CPU (with plenty of available >>> resources on other CPUs). This is a hard issue to fix, requiring either >>> splitting the processing over multiple CPU or sharding the data over >>> multiple servers. Neither of which Blazegraph supports (at least not in our >>> current configuration). >>> * There is a race for resources between edits and queries: a high query >>> load will impact the update rate. This could to some extent be mitigated by >>> reducing the query load: if no one is using the service, it works great! >>> Obviously that's not much of a solution. >>> >>> What you can do (short term): >>> >>> * Keep bots usage well behaved (don't do parallel queries, provide a >>> meaningful user agent, smooth the load over time if possible, ...). As far >>> as I can see, most usage are already well behaved. >>> * Optimize your queries: better queries will use less resources, which >>> should help. Time to completion is a good approximation of the resources >>> used. I don't really have any more specific advice, SPARQL is not my area >>> of expertise. >>> >>> What you can do (longer term): >>> >>> * Help us think out of the box. Can we identify higher level use cases? >>> Could we implement some of our workflows on a higher level API than SPARQL, >>> which might allow for more internal optimizations? >>> * Help us better understand the constraints. Document use cases on [1]. >>> >>> Sadly, we don't have the bandwidth right now to engage meaningfully in >>> this conversation. Feel free to send thoughts already, but don't expect any >>> timely response. >>> >>> Yet another thought is the large discrepancy between Virginia and Texas >>>> data centers as I could see on Grafana [1]. As far as I understand the >>>> hardware (and software) are the same. So why is there this large >>>> difference? Rather than editing or BlazeGraph, could the issue be some >>>> form of network issue? >>>> >>> >>> As pointed out by Lucas, this is expected. Due to how our GeoDNS works, >>> we see more traffic on eqiad than on codfw. >>> >>> Thanks for the help! >>> >>> Guillaume >>> >>> [1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage >>> >>> >>> >>>> >>>> >>>> [1] >>>> >>>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&fullscreen&orgId=1&from=now-7d&to=now >>>> >>>> /Finn >>>> >>>> >>>> >>>> On 14/11/2019 10:50, Guillaume Lederrey wrote: >>>> > Hello all! >>>> > >>>> > As you've probably noticed, the update lag on the public WDQS >>>> endpoint >>>> > [1] is not doing well [2], with lag climbing to > 12h for some >>>> servers. >>>> > We are tracking this on phabricator [3], subscribe to that task if >>>> you >>>> > want to stay informed. >>>> > >>>> > To be perfectly honest, we don't have a good short term solution. The >>>> > graph database that we are using at the moment (Blazegraph [4]) does >>>> not >>>> > easily support sharding, so even throwing hardware at the problem >>>> isn't >>>> > really an option. >>>> > >>>> > We are working on a few medium term improvements: >>>> > >>>> > * A dedicated updater service in Blazegraph, which should help >>>> increase >>>> > the update throughput [5]. Finger crossed, this should be ready for >>>> > initial deployment and testing by next week (no promise, we're doing >>>> the >>>> > best we can). >>>> > * Some improvement in the parallelism of the updater [6]. This has >>>> just >>>> > been identified. While it will probably also provide some improvement >>>> in >>>> > throughput, we haven't actually started working on that and we don't >>>> > have any numbers at this point. >>>> > >>>> > Longer term: >>>> > >>>> > We are hiring a new team member to work on WDQS. It will take some >>>> time >>>> > to get this person up to speed, but we should have more capacity to >>>> > address the deeper issues of WDQS by January. >>>> > >>>> > The 2 main points we want to address are: >>>> > >>>> > * Finding a triple store that scales better than our current solution. >>>> > * Better understand what are the use cases on WDQS and see if we can >>>> > provide a technical solution that is better suited. Our intuition is >>>> > that some of the use cases that require synchronous (or quasi >>>> > synchronous) updates would be better implemented outside of a triple >>>> > store. Honestly, we have no idea yet if this makes sense and what >>>> those >>>> > alternate solutions might be. >>>> > >>>> > Thanks a lot for your patience during this tough time! >>>> > >>>> > Guillaume >>>> > >>>> > >>>> > [1] https://query.wikidata.org/ >>>> > [2] >>>> > >>>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1571131796906&to=1573723796906&var-cluster_name=wdqs&panelId=8&fullscreen >>>> > [3] https://phabricator.wikimedia.org/T238229 >>>> > [4] https://blazegraph.com/ >>>> > [5] https://phabricator.wikimedia.org/T212826 >>>> > [6] https://phabricator.wikimedia.org/T238045 >>>> > >>>> > -- >>>> > Guillaume Lederrey >>>> > Engineering Manager, Search Platform >>>> > Wikimedia Foundation >>>> > UTC+1 / CET >>>> > >>>> > _______________________________________________ >>>> > Wikidata mailing list >>>> > Wikidata@lists.wikimedia.org >>>> > https://lists.wikimedia.org/mailman/listinfo/wikidata >>>> > >>>> >>>> _______________________________________________ >>>> Wikidata mailing list >>>> Wikidata@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>>> >>> >>> >>> -- >>> Guillaume Lederrey >>> Engineering Manager, Search Platform >>> Wikimedia Foundation >>> UTC+1 / CET >>> _______________________________________________ >>> Wikidata mailing list >>> Wikidata@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>> >> _______________________________________________ >> Wikidata mailing list >> Wikidata@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> > > > -- > Guillaume Lederrey > Engineering Manager, Search Platform > Wikimedia Foundation > UTC+1 / CET > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata >
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata