[Wikitech-l] Non-linear search in an XML dump

2018-09-03 Thread Bináris
Hi,

As far as I understand, pages in an XML dump are in the order of their
original creation.
This does not correspond to the page ID, because if a page gets a new id
after deletion and restore or renaming to that title or anything, the order
still remains the original.
But this sortkey itself is not stored. In other words, a dump is not sorted
by any key one could finf in the dump, and behaves as an unosorted
structure.

Is this true? Can I use any non-linear (e.g. binary) search in a dump?

-- 
Bináris
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Non-linear search in an XML dump

2018-09-03 Thread Jaime Crespo
Not that this is offtopic here, but you will find probably more
knowledgeable people and probably a quicker response at the specialized
list https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

On Mon, Sep 3, 2018 at 3:06 PM Bináris  wrote:

> Hi,
>
> As far as I understand, pages in an XML dump are in the order of their
> original creation.
> This does not correspond to the page ID, because if a page gets a new id
> after deletion and restore or renaming to that title or anything, the order
> still remains the original.
> But this sortkey itself is not stored. In other words, a dump is not sorted
> by any key one could finf in the dump, and behaves as an unosorted
> structure.
>
> Is this true? Can I use any non-linear (e.g. binary) search in a dump?
>
> --
> Bináris
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Jaime Crespo

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Non-linear search in an XML dump

2018-09-03 Thread Daniel Kinzler
If I read the code in WikiExporter.php correctly,
dumps are currently ordered by page ID.

However, I would not consider this a guarantee.
I'd recommend to assume that the content of a dump are in no particular order,
and that the order is subject to change without notice.

-- daniel

Am 03.09.2018 um 15:05 schrieb Bináris:
> Hi,
> 
> As far as I understand, pages in an XML dump are in the order of their
> original creation.
> This does not correspond to the page ID, because if a page gets a new id
> after deletion and restore or renaming to that title or anything, the order
> still remains the original.
> But this sortkey itself is not stored. In other words, a dump is not sorted
> by any key one could finf in the dump, and behaves as an unosorted
> structure.
> 
> Is this true? Can I use any non-linear (e.g. binary) search in a dump?
> 


-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l