Re: [Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-08 Thread oren bochman
-Original Message-
From: wikitech-l-boun...@lists.wikimedia.org 
[mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Brion Vibber
Sent: Thursday, March 7, 2013 9:59 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] Indexing non-text content in LuceneSearch

On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler  wrote:
> 1) create a specialized XML dump that contains the text generated by
> getTextForSearchIndex() instead of actual page content.

That probably makes the most sense; alternately, make a dump that includes both 
"raw" data and "text for search". This also allows for indexing extra stuff for 
files -- such as extracted text from a PDF of DjVu or metadata from a JPEG -- 
if the dump process etc can produce appropriate indexable data.

> However, that only works
> if the dump is created using the PHP dumper. How are the regular dumps 
> currently generated on WMF infrastructure? Also, would be be feasible 
> to make an extra dump just for LuceneSearch (at least for wikidata.org)?

The dumps are indeed created via MediaWiki. I think Ariel or someone can 
comment with more detail on how it currently runs, it's been a while since I 
was in the thick of it.

> 2) We could re-implement the ContentHandler facility in Java, and 
> require extensions that define their own content types to provide a 
> Java based handler in addition to the PHP one. That seems like a 
> pretty massive undertaking of dubious value. But it would allow maximum 
> control over what is indexed how.

No don't do it :)

> 3) The indexer code (without plugins) should not know about Wikibase, 
> but it may have hard coded knowledge about JSON. It could have a 
> special indexing mode for JSON, in which the structure is deserialized 
> and traversed, and any values are added to the index (while the keys 
> used in the structure would be ignored). We may still be indexing 
> useless interna from the JSON, but at least there would be a lot fewer false 
> negatives.

Indexing structured data could be awesome -- again I think of file metadata as 
well as wikidata-style stuff. But I'm not sure how easy that'll be. Should 
probably be in addition to the text indexing, rather than replacing.

-- brion

I agree with Brion.

Here are my 5 shenekel's worth.

To indexing non-mwdumps with LuceneSearch I would:
1. modify the demon to read the custom/dump format or update the xml dump to 
support json dump. 
2. it uses the MWdumper codebase to do this now.
3. add a lucene analyzer to handle the new data type, say a json analyzer.
4. add a Lucenedoc per Json based Wikidata schema
5. update the queries parser to handle the new queries and the modified Lucene 
documents.
6. for bonus points modify spelling correction and write a wiki data ranking 
algoritm
But this would only solve reading static dumps used to bootstrap the index, I 
would then have to 
Change how MWSearch periodically polls Brion's OAIRepository to pull in updated 
pages.

I've been coding some analytics from MWDumps from WMF/Wikia Wikis for research 
project I can say this:
1. Most big dumps (e.g. historic) inherit the isses of wikitext namely 
unescaped tags and entities which crash modern XML java libraries - so escape 
your data and validate the xml!
2. The god old SAX code in the MWDumper still works fine - so use it.
3. Use lucene 2.4 with the deprecated old APIs
4. Ariel is doing a great job (e.g. the 7Z compression and the splitting of the 
dumps) but these are things MWdumper does not handle yet.

Finally based on my work with i18n team, TranslateWiki search that indexing 
JSON data with Solar + Solarium requires no Search Engine coding at all.
You define the document schema, and use solarium to push JSON and get results 
too. I could do a demo of how to do this at a coming Hakathon if there
is any interest, however when I offered to replace LuceneSearch like this last 
October the idea was rejected out of hand.

-- oren

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-07 Thread Munagala Ramanath
(1) seems like the right way to go to me too.

There may be other ways but puppet/files/lucene/lucene.jobs.sh has a
function called
import-db() which creates a dump like this:

   php $MWinstall/common/multiversion/MWScript.php dumpBackup.php $dbname
--current > $dumpfile

Ram


On Thu, Mar 7, 2013 at 1:05 PM, Daniel Kinzler  wrote:

> On 07.03.2013 20:58, Brion Vibber wrote:
> >> 3) The indexer code (without plugins) should not know about Wikibase,
> but it may
> >> have hard coded knowledge about JSON. It could have a special indexing
> mode for
> >> JSON, in which the structure is deserialized and traversed, and any
> values are
> >> added to the index (while the keys used in the structure would be
> ignored). We
> >> may still be indexing useless interna from the JSON, but at least there
> would be
> >> a lot fewer false negatives.
> >
> > Indexing structured data could be awesome -- again I think of file
> > metadata as well as wikidata-style stuff. But I'm not sure how easy
> > that'll be. Should probably be in addition to the text indexing,
> > rather than replacing.
>
> Indeed, but option 3 is about *blindly* indexing *JSON*. We definitly want
> indexed structured data, the question is just how to get that into the
> LSearch
> infrastructure.
>
> -- daniel
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-07 Thread Daniel Kinzler
On 07.03.2013 20:58, Brion Vibber wrote:
>> 3) The indexer code (without plugins) should not know about Wikibase, but it 
>> may
>> have hard coded knowledge about JSON. It could have a special indexing mode 
>> for
>> JSON, in which the structure is deserialized and traversed, and any values 
>> are
>> added to the index (while the keys used in the structure would be ignored). 
>> We
>> may still be indexing useless interna from the JSON, but at least there 
>> would be
>> a lot fewer false negatives.
> 
> Indexing structured data could be awesome -- again I think of file
> metadata as well as wikidata-style stuff. But I'm not sure how easy
> that'll be. Should probably be in addition to the text indexing,
> rather than replacing.

Indeed, but option 3 is about *blindly* indexing *JSON*. We definitly want
indexed structured data, the question is just how to get that into the LSearch
infrastructure.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-07 Thread Brion Vibber
On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler  wrote:
> 1) create a specialized XML dump that contains the text generated by
> getTextForSearchIndex() instead of actual page content.

That probably makes the most sense; alternately, make a dump that
includes both "raw" data and "text for search". This also allows for
indexing extra stuff for files -- such as extracted text from a PDF of
DjVu or metadata from a JPEG -- if the dump process etc can produce
appropriate indexable data.

> However, that only works
> if the dump is created using the PHP dumper. How are the regular dumps 
> currently
> generated on WMF infrastructure? Also, would be be feasible to make an extra
> dump just for LuceneSearch (at least for wikidata.org)?

The dumps are indeed created via MediaWiki. I think Ariel or someone
can comment with more detail on how it currently runs, it's been a
while since I was in the thick of it.

> 2) We could re-implement the ContentHandler facility in Java, and require
> extensions that define their own content types to provide a Java based handler
> in addition to the PHP one. That seems like a pretty massive undertaking of
> dubious value. But it would allow maximum control over what is indexed how.

No don't do it :)

> 3) The indexer code (without plugins) should not know about Wikibase, but it 
> may
> have hard coded knowledge about JSON. It could have a special indexing mode 
> for
> JSON, in which the structure is deserialized and traversed, and any values are
> added to the index (while the keys used in the structure would be ignored). We
> may still be indexing useless interna from the JSON, but at least there would 
> be
> a lot fewer false negatives.

Indexing structured data could be awesome -- again I think of file
metadata as well as wikidata-style stuff. But I'm not sure how easy
that'll be. Should probably be in addition to the text indexing,
rather than replacing.


-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-07 Thread Daniel Kinzler
Hi all!

I would like to ask for you input on the question how non-wikitext content can
be indexed by LuceneSearch.

Background is the fact that full text search (Special:Search) is nearly useless
on wikidata.org at the moment, see
.

The reason for the problem appears to be that when rebuilding a Lucene index
from scratch, using an XML dump of wikidata.org, the raw JSON structure used by
Wikibase gets indexed. The indexer is blind, it just takes whatever "text" it
finds in the dump. Indexing JSON does not work at all for fulltext search,
especially not when non-ascii characters are represented as unicode escape
sequences.

Inside MediaWiki, in PHP, this work like this:

* wikidata.org (or rather, the Wikibase extension) stores non-text content in
wiki pages, using a ContentHandler that manages a JSON structure.
* Wikibase's EntityContent class implements Content::getTextForSearchIndex() so
it returns the labels and aliases of an entity. Data items thus get indexed by
their labels and aliases.
* getTextForSearchIndex() is used by the default MySQL search to build an index.
It's also (ab)used by things that can only operate on flat text, like the
AbuseFilter extension.
* The LuceneSearch index gets updated live using the OAI extension, which in
turn knows to use getTextForSearchIndex() to get the text for indexing.

So, for anything indexed live, this works, but for rebuilding the search index
from a dump, it doesn't - because the Java indexer knows nothing about content
types, and has no interface for an extension to register additional content 
types.


To improve this, I can think of a few options:

1) create a specialized XML dump that contains the text generated by
getTextForSearchIndex() instead of actual page content. However, that only works
if the dump is created using the PHP dumper. How are the regular dumps currently
generated on WMF infrastructure? Also, would be be feasible to make an extra
dump just for LuceneSearch (at least for wikidata.org)?

2) We could re-implement the ContentHandler facility in Java, and require
extensions that define their own content types to provide a Java based handler
in addition to the PHP one. That seems like a pretty massive undertaking of
dubious value. But it would allow maximum control over what is indexed how.

3) The indexer code (without plugins) should not know about Wikibase, but it may
have hard coded knowledge about JSON. It could have a special indexing mode for
JSON, in which the structure is deserialized and traversed, and any values are
added to the index (while the keys used in the structure would be ignored). We
may still be indexing useless interna from the JSON, but at least there would be
a lot fewer false negatives.


I personally would prefer 1) if dumps are created with PHP, and 3) otherwise. 2)
looks nice, but is hard to keep the Java and the PHP version from diverging.

So, how would you fix this?

thanks
daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l