Re: [translate-pootle] Super fast indexing in Pootle (with a quick appearance from Virtaal)

F Wolff Tue, 10 Mar 2009 08:06:54 -0700

Op Di, 2009-03-10 om 11:26 +0200 skryf Wynand Winterbach:
> Hi everyone
> 
> Or rather, hi programmers. This is a nice technical post for keen hackers.
> 
> Virtaal makes an appearance quite late in the show, but it's useful to 
> read the entire post to understand where I'm going with this.
>


Thanks, Wynand. I think there are some interesting ideas here.

My general feedback is that we should see where the bottlenecks are in
the really big projects where this might matter, and make sure we are
addressing the most important issues first. But that doesn't mean we
can't plan for the future.

More inline...

> **
> 
> In Pootle, you can filter information in lots of interesting ways. Here 
> are three scenarios:
> 
>    * Display all files in a goal,
>    * Go through all units which fail a particular quality check,
>    * Search for a string in the targets of all units in a directory.
> 
> Some of these can be combined, and in the future, it might be useful to 
> allow the user to combine all of these to search for data.
> 
> **
> 
> Today, we store our indexing information in three places:
> 
>    * Goals and assignments are stored in the Django database,
>    * Quality checks and stats are stored in the stats database,
>    * Text indices are stored in Xapian/Lucene.
> 
> This leads to inefficiencies:
> 
>    * When searching for a string within a goal, Pootle gets the list of 
> filename-unit pairs from the text indexer in which to search, but then, 
> for each filename, it has to hit the Django database to check whether 
> the filename is part of the current goal.
>    * When searching for all units that fail a quality check within a 
> goal, Pootle gets a list of filenames that fall within the goal, and 
> then, for each file, has to hit the stats database to check whether the 
> file contains any units that fail the current quality check.
> 
> **
> 
> How do we solve this? It depends which component we're focusing on.
> 
> DENORMALIZATION
> 
> The text indexing engine (i.e. Xapian, Lucene, etc.) doesn't need to be 
> 100% consistent with our data (we want consistency of course, but Pootle 
> won't break horribly if things are a bit out of sync).
> 
> Thus, we can duplicate stats, goal and assignment data into the text 
> indexing engine. This is very convenient from a search perspective, 
> since the user can do very complex searches which will only hit the text 
> indexing engine.

This is probably reasonably easy. We might be hampered by the fact that
lots of people are probably not installing the indexing engines yet,
because Pootle works without it. A big warning on startup might help to
change that behaviour :-)

> MERGING DATABASES
> 
> By storing stats information in the database used by Django, we can do 
> complicated stats queries directly in Django's database.
> 
> The current model where we have stats associated with individual 
> filenames breaks this model - it's not possible to do queries over 
> groups of files. It's also expensive, since we need to hit the stats 
> database multiple times to get stats information for multiple files.

We probably need to do this anyway at some stage to allow other database
systems to be used. The current system might make things slightly easier
for a small admin that will also run pocount from the command line and
might want to do that as part of scripts to update stats, but that is
probably not so important for big deployments where we want to support
mysql, etc.


> CACHING TRANSLATION FILES IN THE DATABASE
> 
> If we go further and store parsed PO and XLIFF files in the database, we 
> can associate stats information directly with units.
> 
> Thus, we'd create a subclass of TranslationStore (in storage/base.py in 
> the Toolkit) which would store its units in the database. This would 
> allow our existing tools to operate on database-backed translation 
> stores as normal stores.
> 
> It also means that stats information for a unit will be associated using 
> foreign key relations.

I think we've discussed this at several occasions, and it will probably
happen at some stage anyway.  It is probably not soon, though.


> **
> 
> BINDING EVERYTHING TOGETHER NEATLY
> 
> We'd have to design a query API that's very database design centered 
> (something that takes the above ideas into account). This API should 
> allow complicated queries including:
> 
>    * String searches,
>    * Filtering by goal and assignments,
>    * Filtering by quality checks.
> 
> The API should also provide unit update services. If a unit is updated, 
> it should update the text indexing engine with the quality check 
> information as well as the text content. If goal or assignment data is 
> changed, the text indexing engine should also be updated.
> 
> KEEP AGGREGATION IN MIND
> 
> Since the API is database-centered, it should make aggregated queries 
> easy and efficient. Thus, we need to stop thinking in terms of stats 
> that are associated with a single filename.


My idea for now is to see how we can just optimise the current queries
in the indexers (Lucene/Xapian) and see in what situation / at what
scale they actually become slow.

Doing set intersection on the two sets of files (the one obtained from
the goal, and the one obtained from the indexer) sounds simple, and
might be quick enough for many purposes - perhaps that is where we can
start.



> **
> 
> WHERE DOES VIRTAAL FIT IN?
> 
> If all of the above is implemented, Virtaal could always directly used 
> database-backed translation stores to do its work. Virtaal would 
> instantly benefit from the fast indexing code which would make Pootle 
> fast. Complex queries on big files would be very fast. And Virtaal would 
> use much less memory when dealing with large files.
> 
> This would mean excellent re-use of code between Virtaal & Pootle.

This sounds nice, but the good performance you mention comes at the
price of pre-calculating lots of things, which is a price to pay at
startup time. We need to look carefully at our trade-offs between
startup speed and query speed. Of course, there are ways to get both,
but that is just a little bit more work.


Thanks for the nice ideas.

Friedel



--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/video-virtaals-functionality


------------------------------------------------------------------------------
_______________________________________________
Translate-pootle mailing list
Translate-pootle@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/translate-pootle

Re: [translate-pootle] Super fast indexing in Pootle (with a quick appearance from Virtaal)

Reply via email to