1.  In my playing around with 
sending in an XML document within a an XML CDATA tag,
with termVectors="true"
 
I noticed the following behavior:
<person>peter</person>
collapses to the term
personpeterperson
instead of
person
and 
peter separately.
 
I realize I could try and do a search and replaces of characters like
<>"=  to a space so that the default parser/indexer can preserve element
names.
However, I'm wondering if someon could point me to where one might do
this withing
the solr or apache lucene code as a proper plug in with maybe an example
that I could use
as a template.  Also where in the solrconfig.xml file I would want to
change to reference the new parser.
 
2.  My other question would also be if this technique would work for XML
type messages embedded
in Microsoft Excel, or Powerpoint presentations where I would like to
preserve knowining xml element term frequencies
where I would try and leverage the component that automatically indexes
microsoft documents.
Would I need to modify that component and customize it?
 
-Peter
 
 

  • Question on modifying solr ... Thung, Peter C CIV SPAWARSYSCEN-PACIFIC, 56340

Reply via email to