[Wikitech-l] null revisions

2010-12-02 Thread Dmitriy Sintsov
Hi!
>From looking at DB scheme I cannot find an efficient way of getting the 
list of null revisions or opposite (no null revisions list). With LIMIT 
paging (for custom API). When I GROUP then ORDER and LIMIT, it behaves 
extremly slow.
It seems that I should use very inefficient GROUP BY rev_text_id (and 
also MySQL not offering FIRST / LAST aggregate functions) and also there 
is no index on rev_text_id by default :-( I wish there was a field like 
rev_minor_edit but for detection of null revisions, such as these 
generated by XML import / export. They confuse the logic of my wiki 
synchronization script. However, even if I were able to persuade to 
include these features into the scheme, 1.15 which customers use, was 
already released some time ago, anyway :-( So probably the core patch is 
the only efficient way to solve my problem?
Dmitriy

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] null revisions

2010-12-02 Thread Bryan Tong Minh
On Thu, Dec 2, 2010 at 6:23 PM, Dmitriy Sintsov  wrote:
> So probably the core patch is
> the only efficient way to solve my problem?
>
You can always supply a database patch with your extension to add
indices you need to core tables.


Bryan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] null revisions

2010-12-02 Thread Dmitriy Sintsov
* Bryan Tong Minh  [Thu, 2 Dec 2010 19:38:47 
+0100]:
> On Thu, Dec 2, 2010 at 6:23 PM, Dmitriy Sintsov 
> wrote:
> > So probably the core patch is
> > the only efficient way to solve my problem?
> >
> You can always supply a database patch with your extension to add
> indices you need to core tables.
>
Indices are not hard to add, that's true. However, even with indexes the 
GROUP BY rev_text_id query on large revision set is slow. I probably 
will have to patch Revision::newNullRevision to add a new field value 
there (for the existing it is possible to fill the new field with 
UPDATE, however there will be new null revisions).
Dmitriy

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] null revisions

2010-12-02 Thread Brion Vibber
On Thu, Dec 2, 2010 at 10:43 AM, Dmitriy Sintsov  wrote:

> Indices are not hard to add, that's true. However, even with indexes the
> GROUP BY rev_text_id query on large revision set is slow. I probably
> will have to patch Revision::newNullRevision to add a new field value
> there (for the existing it is possible to fill the new field with
> UPDATE, however there will be new null revisions).
>

What is it that your system actually needs to be able to do this for? Is
there an issue with loading up the previous text items, or are you trying to
optimize storage on your end by not storing text twice when it happened to
use the same text blob on the origin site?

Beware that there's not anything that really distinguishes null revisions
from their predecessors, other than that they come later than the previous
ones. Note that it's also possible for the earlier revision to get deleted
while a later revision using the same text blob still remains.

The previously referenced text blob might also have originally come in in a
much older revision, not the immediately preceding one; this may be legit
for certain kinds of reverts, for instance.

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] null revisions

2010-12-02 Thread Dmitriy Sintsov
* Brion Vibber  [Thu, 2 Dec 2010 12:15:18 -0800]:
> What is it that your system actually needs to be able to do this for? 
Is
> there an issue with loading up the previous text items, or are you
> trying to
> optimize storage on your end by not storing text twice when it 
happened
> to
> use the same text blob on the origin site?
>
I try to synchronize "recent changes" of two wiki sites via XML chunks 
(consequtive groups of 10 revisions), created by WikiExporter. It mostly 
works (however I am still haven't checked all throughly, what will 
happen if an revision with earlier timestamp is trying to import over 
revision with older timestamp?), however, ImportReporter::reportPage 
also creates an extra null revision for every revision page imported for 
"informational purposes" ("Imported by WikiSync" in my case). 
Unfortunately, at the next run of synchronization, such revision becomes 
a difference between sites and synchronization reports that sites are 
not equal (even though there really was no changes, except for 
informational null revision).

> Beware that there's not anything that really distinguishes null
> revisions
> from their predecessors, other than that they come later than the
> previous
> ones. Note that it's also possible for the earlier revision to get
> deleted
> while a later revision using the same text blob still remains.
>
That's really bad for me - I probably should patch the deletion as well, 
to remove a flag field of rev_null from null revision row, when it's 
non-null match of rev_text_id was deleted :-( Too much of patches of the 
core and I am even not sure that I can intercept all kinds of revision 
deletion - should check that).

With GROUP BY on large set being slow and FIRST / LAST aggregators 
unavailable, it probably would be easier to me just not to call 
ImportReporter from by derived WikiImporter class? Informational null 
revisions won't be simply created in such case. They are nice to end 
user, that's why I have tried to keep them.

> The previously referenced text blob might also have originally come in
> in a
> much older revision, not the immediately preceding one; this may be
> legit
> for certain kinds of reverts, for instance.
>
Thanks for explanation.
Dmitriy

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] null revisions

2010-12-02 Thread Brion Vibber
On Thu, Dec 2, 2010 at 8:09 PM, Dmitriy Sintsov  wrote:

> * Brion Vibber  [Thu, 2 Dec 2010 12:15:18 -0800]:
> > What is it that your system actually needs to be able to do this for?
> Is
> > there an issue with loading up the previous text items, or are you
> > trying to
> > optimize storage on your end by not storing text twice when it
> happened
> > to
> > use the same text blob on the origin site?
> >
> I try to synchronize "recent changes" of two wiki sites via XML chunks
> (consequtive groups of 10 revisions), created by WikiExporter. It mostly
> works (however I am still haven't checked all throughly, what will
> happen if an revision with earlier timestamp is trying to import over
> revision with older timestamp?), however, ImportReporter::reportPage
> also creates an extra null revision for every revision page imported for
> "informational purposes" ("Imported by WikiSync" in my case).
> Unfortunately, at the next run of synchronization, such revision becomes
> a difference between sites and synchronization reports that sites are
> not equal (even though there really was no changes, except for
> informational null revision).
>

It sounds to me like what you need to do is recognize and skip your tool's
edits, not null edits generally.

If these are all created by a particular user account, for instance, that
should be pretty straightforward: compare the user ID value and skip the
revision.

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] null revisions

2010-12-02 Thread Dmitriy Sintsov
* Brion Vibber  [Thu, 2 Dec 2010 20:45:16 -0800]:
> It sounds to me like what you need to do is recognize and skip your
> tool's
> edits, not null edits generally.
>
> If these are all created by a particular user account, for instance,
> that
> should be pretty straightforward: compare the user ID value and skip 
the
> revision.
>
A good idea, I'll make a mandatory account name for synchronization. 
Probably should work, however is there any way to disable "interactive" 
edits for some particular account while allowing it to use Import / 
Export and API in general? I'll check whether denying 'edit' action for 
synchronization account still would allow to perform "automatic" import. 
Otherwise, one can only hope that synchronization bot account will not 
be misused for ordinary edits (which should not be skipped from 
synchronization). Anyway, I can provide such warning at the extension 
page, at least.
Dmitriy

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l