Hi, I'm looking to get some direction on where I should focus my attention, with regards to the Solr codebase and documentation. Rather than write a ton of stuff no one wants to read, I'll just start with a use-case. For context, the data originates from Nutch crawls and is indexed into Solr.
Imagine a web page has the following content (4 occurences of "Johnson" are bolded): --content_-- Lorem ipsum dolor *Johnson* sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla dictum *johnson* in at tortor. Nulla eu nulla magna, nec sodales est. Sed *johnSon* sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada *Johnsons* mi. Morbi eget ligula nisi. Ut fringilla ullamcorper sem. --_content-- *First*; I would like to have the entire "content" block be indexed within Solr. This is done and definitely not an issue. *Second* (+); during the injection of crawl data into Solr, I would like to grab every occurence of a specific word, or phrase, with "Johnson" being my example for the above. I want to take every such phrase (without collision), as well as its unique-context, and inject that into its own, separate Solr index. For example, the above "content" example, having been indexed in its entirety, would also be the source of 4 additional indexes. In each index, "Johnson" would only appear once. All of the text before and after "Johnson" would be BOUND BY any other occurrence of "Johnson." eg: --index1_-- Lorem ipsum dolor *Johnson* sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla dictum --_index1-- --index2_-- sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla dictum *johnson* in at tortor. Nulla eu nulla magna, nec sodales est. Sed --_index2-- --index3_-- in at tortor. Nulla eu nulla magna, nec sodales est. Sed *johnSon* sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada --_index3-- --index4_-- sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel malesuada *Johnsons* mi. Morbi eget ligula nisi. Ut fringilla ullamcorper sem. --_index4-- Q: How much of this is feasible in "present-day Solr" and how much of it do I need to produce in a patch of my own? Can anyone give me some direction on where I should look, in approaching this problem (ie, libs / classes / confs)? I sincerely appreciate it. *Third*; I would later like to go through the above, child indexes and dismiss any that appear within a given context. For example, I may deem "ipsum dolor *Johnson* sit amet" as not being useful and I'd want to delete any indexes matching that particular phrase-context. The deletion is trivial and, with the 2nd item resolved--this becomes a fairly non-issue. Q: The question, more or less, comes from the fact that my source data is from a web crawler. When recrawled, I need to repeat the process of dismissing phrase-contexts that are not relevant to me. Where is the best place to perform this work? I could easily perform queries, after indexing my crawl, but that seems needlessly intensive. I think the answer to that will be "wherever I implement #2", but assumptions can be painfully expensive. Thank you for reading my bloated e-mail. Again, I'm mostly just looking to be pointed to various pieces of the Lucene / Solr code-base, and am trolling for any insight that people might share. Scott Gonyea