On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler <dan...@brightbyte.de> wrote:
> 1) create a specialized XML dump that contains the text generated by
> getTextForSearchIndex() instead of actual page content.

That probably makes the most sense; alternately, make a dump that
includes both "raw" data and "text for search". This also allows for
indexing extra stuff for files -- such as extracted text from a PDF of
DjVu or metadata from a JPEG -- if the dump process etc can produce
appropriate indexable data.

> However, that only works
> if the dump is created using the PHP dumper. How are the regular dumps 
> currently
> generated on WMF infrastructure? Also, would be be feasible to make an extra
> dump just for LuceneSearch (at least for wikidata.org)?

The dumps are indeed created via MediaWiki. I think Ariel or someone
can comment with more detail on how it currently runs, it's been a
while since I was in the thick of it.

> 2) We could re-implement the ContentHandler facility in Java, and require
> extensions that define their own content types to provide a Java based handler
> in addition to the PHP one. That seems like a pretty massive undertaking of
> dubious value. But it would allow maximum control over what is indexed how.

Nooooo don't do it :)

> 3) The indexer code (without plugins) should not know about Wikibase, but it 
> may
> have hard coded knowledge about JSON. It could have a special indexing mode 
> for
> JSON, in which the structure is deserialized and traversed, and any values are
> added to the index (while the keys used in the structure would be ignored). We
> may still be indexing useless interna from the JSON, but at least there would 
> be
> a lot fewer false negatives.

Indexing structured data could be awesome -- again I think of file
metadata as well as wikidata-style stuff. But I'm not sure how easy
that'll be. Should probably be in addition to the text indexing,
rather than replacing.


-- brion

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to