Are you going to use the values stored on Solr to display the data in HTML? For 
searching purposes I suggest to delete all the HTML tags, and store the plain 
text, for this you could use the HTMLStripCharFilterFactory char filter, this 
will "clean" your content and only pass the actual text which is in the end 
what you're going to use. 

If you are going to use the solr result to display the content in an HTML page 
then I would suggest to keep your index clean and index only the actual 
searchable text no HTML, I actually use the recommended filter to strip HTML 
out of crawled HTML pages. Although what a Solr document means to you? An 
entire conversation is modeled 1 Solr document? have you considered separating 
each conversation interaction on a document? 


----- Original Message -----
From: "tomas.kalas" <kala...@email.cz>
To: solr-user@lucene.apache.org
Sent: Thursday, October 30, 2014 10:27:50 AM
Subject: Design optimal Solr Schema

Hello i have problem with design of schema in Solr. I have a transcript of a
telephone conversation in this format. I parse it at individual fields. I
have this schema:

<?xml version="1.0"?>
<add>
<doc>
<field name="id">01.cn</field>
<field name="t">0<br /> 1<br /> 2<br /> 2 <br /> 3 <br /> ....</field>
<field name="st">0.00<br /> 1.54<br /> 1.54<br /> 1.54 <br /> 1.57 <br />
....</field>
<field name="et">1.54<br /> 1.54<br /> 1.57<br /> 1.57 <br /> 1.7 <br />
....</field>
<field name="w">_SILENCE_<br /> <s><br /> HELLO<br /> HALLO <br /> _DELETE_
<br /> ....</field>
<field name="p">0.000000<br /> 1<br /> 1<br /> 2.06115e-009 <br /> 1 <br />
....</field>
<field name="c">0<br /> 0<br /> 0<br /> 0 <br /> 0 <br /> ....</field>
</doc>
</add>

I displayed it in html document, and therefore i used the <br />.

This is a original document:

T=0 ST=0.00 ET=1.54 W=_SILENCE_ P=0.000000 C=0
T=1 ST=1.54 ET=1.54 W=<s> P=1 C=0
T=2 ST=1.54 ET=1.57 W=HELLO P=1 C=0
T=2 ST=1.54 ET=1.57 W=HALLO P=2.06115e-009 C=0
T=3 ST=1.57 ET=1.70 W=_DELETE_ P=1 C=0
T=3 ST=1.57 ET=1.70 W=NO P=2.06115e-009 C=0
T=4 ST=1.70 ET=2.12 W=HOW P=1 C=0
T=5 ST=2.12 ET=2.18 W=ARE_ P=0.25 C=0
T=5 ST=2.12 ET=2.18 W=_DELETE_ P=0.25 C=0
..........................................
..........................................

Id - filename
T = Segment
ST = Start time
ET = End time
W = Word
P = Probability
C = Chanel

I want to search for example word which is to time 1.57 (w:HeLLO) AND (t:[0
TO 1.57]). But if i have all data in one field (t, st,et ...) then it
doesn't work. It find all files where is hello a further time than 1.57.

Do you have any ideas how it make it? Thanks a lot for your help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to