On Thu, Mar 7, 2013 at 11:45 AM, Daniel Kinzler <dan...@brightbyte.de> wrote: > 1) create a specialized XML dump that contains the text generated by > getTextForSearchIndex() instead of actual page content.
That probably makes the most sense; alternately, make a dump that includes both "raw" data and "text for search". This also allows for indexing extra stuff for files -- such as extracted text from a PDF of DjVu or metadata from a JPEG -- if the dump process etc can produce appropriate indexable data. > However, that only works > if the dump is created using the PHP dumper. How are the regular dumps > currently > generated on WMF infrastructure? Also, would be be feasible to make an extra > dump just for LuceneSearch (at least for wikidata.org)? The dumps are indeed created via MediaWiki. I think Ariel or someone can comment with more detail on how it currently runs, it's been a while since I was in the thick of it. > 2) We could re-implement the ContentHandler facility in Java, and require > extensions that define their own content types to provide a Java based handler > in addition to the PHP one. That seems like a pretty massive undertaking of > dubious value. But it would allow maximum control over what is indexed how. Nooooo don't do it :) > 3) The indexer code (without plugins) should not know about Wikibase, but it > may > have hard coded knowledge about JSON. It could have a special indexing mode > for > JSON, in which the structure is deserialized and traversed, and any values are > added to the index (while the keys used in the structure would be ignored). We > may still be indexing useless interna from the JSON, but at least there would > be > a lot fewer false negatives. Indexing structured data could be awesome -- again I think of file metadata as well as wikidata-style stuff. But I'm not sure how easy that'll be. Should probably be in addition to the text indexing, rather than replacing. -- brion _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l