Hi,

I'm looking to get some direction on where I should focus my attention, with
regards to the Solr codebase and documentation.  Rather than write a ton of
stuff no one wants to read, I'll just start with a use-case.  For context,
the data originates from Nutch crawls and is indexed into Solr.

Imagine a web page has the following content (4 occurences of "Johnson" are
bolded):

--content_--
Lorem ipsum dolor *Johnson* sit amet, consectetur adipiscing elit. Aenean id
urna et justo fringilla dictum *johnson* in at tortor. Nulla eu nulla magna,
nec sodales est. Sed *johnSon* sed elit non lorem sagittis fermentum. Mauris
a arcu et sem sagittis rhoncus vel malesuada *Johnsons* mi. Morbi eget
ligula nisi. Ut fringilla ullamcorper sem.
--_content--

*First*; I would like to have the entire "content" block be indexed within
Solr.  This is done and definitely not an issue.

*Second* (+); during the injection of crawl data into Solr, I would like to
grab every occurence of a specific word, or phrase, with "Johnson" being my
example for the above.  I want to take every such phrase (without
collision), as well as its unique-context, and inject that into its own,
separate Solr index.  For example, the above "content" example, having been
indexed in its entirety, would also be the source of 4 additional indexes.
In each index, "Johnson" would only appear once.  All of the text before and
after "Johnson" would be BOUND BY any other occurrence of "Johnson."  eg:

--index1_--
Lorem ipsum dolor *Johnson* sit amet, consectetur adipiscing elit. Aenean id
urna et justo fringilla dictum
--_index1-- --index2_--
sit amet, consectetur adipiscing elit. Aenean id urna et justo fringilla
dictum *johnson* in at tortor. Nulla eu nulla magna, nec sodales est. Sed
--_index2-- --index3_--
in at tortor. Nulla eu nulla magna, nec sodales est. Sed *johnSon* sed elit
non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus vel
malesuada
--_index3-- --index4_--
sed elit non lorem sagittis fermentum. Mauris a arcu et sem sagittis rhoncus
vel malesuada *Johnsons* mi. Morbi eget ligula nisi. Ut fringilla
ullamcorper sem.
--_index4--

Q:
How much of this is feasible in "present-day Solr" and how much of it do I
need to produce in a patch of my own?  Can anyone give me some direction on
where I should look, in approaching this problem (ie, libs / classes /
confs)?  I sincerely appreciate it.

*Third*; I would later like to go through the above, child indexes and
dismiss any that appear within a given context.  For example, I may deem
"ipsum dolor *Johnson* sit amet" as not being useful and I'd want to delete
any indexes matching that particular phrase-context.  The deletion is
trivial and, with the 2nd item resolved--this becomes a fairly non-issue.

Q:
The question, more or less, comes from the fact that my source data is from
a web crawler.  When recrawled, I need to repeat the process of dismissing
phrase-contexts that are not relevant to me.  Where is the best place to
perform this work?  I could easily perform queries, after indexing my crawl,
but that seems needlessly intensive.  I think the answer to that will be
"wherever I implement #2", but assumptions can be painfully expensive.


Thank you for reading my bloated e-mail.  Again, I'm mostly just looking to
be pointed to various pieces of the Lucene / Solr code-base, and am trolling
for any insight that people might share.

Scott Gonyea

Reply via email to