Fwd: Strip html

Michael Della Bitta Thu, 31 May 2012 12:34:27 -0700

If I'm not mistaken, that's TEI, and I suggest you consult with the
TEI community for strategies for document indexing, as there are a lot
of branching-style tags in TEI. My guess is that you'll hear that it's
best to perform some sort of term expansion on the document as a
preprocessing step.


Michael Della Bitta

------------------------------------------------
Appinions, Inc. -- Where Influence Isn’t a Game.
http://www.appinions.com





-----Original Message----- From: Tigunn
Sent: Thursday, May 31, 2012 11:30 AM
To: solr-user@lucene.apache.org
Subject: Strip html


Hello,
I have an index full text on xml files.
Exemple:
---------------------------------------
<item type="fragment" n="3">
                          <cit dbp:hand="GF-encre">

si les <hi rend="underline">ruches d’<term>abeilles</term>
>
>                                    </hi> prouvent la
>                  monarchie, les fourmillières, les troupes d’éléphants ou
> de <lb/>
>                                    <choice>
>                                        <orig>C</orig>
>                                        <reg>c</reg>
>                                    </choice>astors prouvent la
> république.

                              <bibl xml:id="b-7468-3"/>
                          </cit>
                      </item>
---------------------------------------
I use solr 1.4.1 to make full text search with php. When i search "castor",
i can't fund this one. But if i search "c astor" it's ok: problem !!!!

I make a transformation XSLT which return :
---------------------------------------
si les ruches d’abeilles prouvent la
                monarchie, les fourmillières, les troupes d’éléphants ou
de castors prouvent la république.
---------------------------------------
i put this html in solr:  $doc->addField('body_strip_html', $body_norm);

In schema.xml:
<fieldType name="text_strip_html" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
              <charFilter class="solr.HTMLStripCharFilterFactory"/>
              <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
  </fieldType>

AND

 <field name="body_strip_html" type="text_strip_html" indexed="true"
stored="true"/>


But this don't work!
I want to return this xml files (look exemple) if i search "castor".

Can you help me, please?
thanks.


--
View this message in context:
http://lucene.472066.n3.nabble.com/Strip-html-tp3987051.html
Sent from the Solr - User mailing list archive at Nabble.com.

Fwd: Strip html

Reply via email to