Re: Highlighting Performance On Large Documents

Serdar Sahin Sat, 08 May 2010 17:05:43 -0700

Hi,

Thanks. However as I said before, termOffsets/termPositions/termVectors had
very little effect on the performance and I don't know why. I have done
exactly what you are saying but highlighting 10 documents that have 200-400
A4 pages still takes around 2 seconds, depending on the query. I will play
with it more.


I actually want to highlight (and search) 4 fields not one field if
possible. So that's why I have added those four fields to the highlighting.
However, what I want is to store only limited character from the plainText
field, and I'll use that copyfield as an alternate text field as well. So if
it cannot find any matches for highlighting due to limited character size
from the plainText field, I can bring description, and if it is not
available, then I can bring plainText, and also if it is not available (for
example for scanned documents), then I can bring tags or title as a
snippet/description and insert it into the web page. So, that's why I need
multivalued text field for both sides, for indexing and storing. Just
storing will be a little different.

I have also tried to duplicate these four fields and use each one of them
separately for indexing and storing to avoid indexing both copyfields (my
previous email) . The problem was for both copyfields;

       <field name="all_text" type="text" indexed="true" stored="false"
multiValued="true" />
       <field name="short_text" type="text" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true" />

I had to index them even if I use it just for storing. It was giving an
error. So I have duplicated these four fields;

mydata.xml
                                <field column="plainText"
name="plain_text"/>
                                <field column="plainText"
name="plain_text_ind"/>

schema.xml
        <field name="plain_text_ind" type="text" indexed="true"
stored="false"/>
        <field name="plain_text" type="text" indexed="false"
stored="true"/>

I have done it for these four fields and created copyfields for them.

       <field name="all_text" type="text" indexed="true" stored="false"
multiValued="true" />
       <field name="short_text" type="text" indexed="false" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true" />

but still it did not work and gave the same error;

Caused by: java.lang.RuntimeException: SchemaField: short_text conflicting
indexed field options:

So, I have disabled term* directives just for testing
and successfully indexed the data, but there was no short_text column in the
solr index. Maybe duplicating does not work or I have done something wrong.

So I guess my only way is to index short_text field as well without
duplicating anything.

     <field name="all_text" type="text" indexed="true" stored="false"
multiValued="true" />
       <field name="short_text" type="text" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true" />

Thanks again,

Serdar









On  Sun, May 9, 2010 at 9:24 AM, Lance Norskog <goks...@gmail.com> wrote:

> If you want to highlight field X, doing the
> termOffsets/termPositions/termVectors will make highlighting that
> field faster. You should make a separate field and apply these options
> to that field.
>
> Now: doing a copyfield adds a "value" to a multiValued field. For a
> text field, you get a multi-valued text field. You should only copy
> one value to the highlighted field, so just copyField the document to
> your special field. To enforce this, I would add multiValued="false"
> to that field, just to avoid mistakes.
>
> So, all_text should be indexed without the term* attributes, and
> should not be stored. Then your document stored in a separate field
> that you use for highlighting and has the term* attributes.
>
> In general, highlighting has been a problem area all along and there
> are little edge cases that I don't know how to solve.
>
> On Sat, May 8, 2010 at 7:23 AM, Serdar Sahin <anlamar...@gmail.com> wrote:
> > Hi,
> >
> > Thanks a lot for the replies, I could have chance today to test them.
> >
> > First of all termVectors/termPositions/termOffsets did not help, it has
> very
> > little effect, but I tried a workaroud, however it is not as efficient as
> I
> > thought.
> >
> > From these fields;
> >
> >        <field name="title" type="text" indexed="true" stored="true"
> > required="true" omitNorms="true"/>
> >        <field name="description" type="text" indexed="true" stored="true"
> > />
> >        <field name="tags" type="text" indexed="true" stored="true"
> > omitNorms="true" />
> >        <field name="plainText" type="text" indexed="true"
> stored="false"/>
> >
> > I tried to create copyfield
> >        <field name="all_text" type="text" indexed="true" stored="true"
> > multiValued="true" termVectors="true" termPositions="true"
> > termOffsets="true" />
> >
> >        <copyField source="title" dest="all_text" />
> >        <copyField source="tags" dest="all_text" />
> >        <copyField source="plainText" dest="all_text"  maxChars="20000"/>
> >        <copyField source="description" dest="all_text" />
> >
> > And I have indexed 1000 documents that have more than 200 pages.
> >
> > However, maxChars directive also limited the character limit for indexed
> > field. For the query of "Institute of Information Systems", it gave 12
> > results. I also tried to get unique words from bottom of the text files
> and
> > search them, they did not give any result. So,  I just wanted to limit
> size
> > of the stored field, but did not work. Then I tried to create two
> copyfields
> >
> >       <field name="all_text" type="text" indexed="true" stored="false"
> > multiValued="true" />
> >       <field name="short_text" type="text" indexed="true" stored="true"
> > multiValued="true" termVectors="true" termPositions="true"
> > termOffsets="true" />
> >
> >        <copyField source="title" dest="all_text" />
> >        <copyField source="tags" dest="all_text" />
> >        <copyField source="plainText" dest="all_text" />
> >        <copyField source="description" dest="all_text" />
> >
> >        <copyField source="title" dest="short_text" />
> >        <copyField source="tags" dest="short_text" />
> >        <copyField source="plainText" dest="short_text" maxChars="20000"/>
> >        <copyField source="description" dest="short_text" />
> >
> > It gave 168 results, as I expected, and highlighting also worked
> reasonably
> > fast.However, I don't know Solr/Lucene internals but I guess I store the
> > same indexed field twice, and it should have effect on the performance
> and
> > storage. I tried to make it false but then it gave this error:
> >
> > Caused by: java.lang.RuntimeException: SchemaField: all_text conflicting
> > indexed field options:
> >
> > So, it was not possible.
> >
> > Then I tried hl.useFastVectorHighlighter with the latest version (and yes
> I
> > have turned on termVectors, termPositions, termOffsets) but the result
> was
> > 2-2.5x slower, which was very strange. Do you have any guess why this
> might
> > have happened?
> >
> > 1. Provide another field for highlighting and use copyField
> >
> > to copy plainText to the highlighting field. When using copyField,
> >
> > specify maxChars attribute to limit the length of the copy of plainText.
> >
> > This should work on Solr 1.4.
> >
> >
> > Do you mean I need to duplicate these four fields and use one of them for
> > storing and other one for indexing? Than I guess, I can do;
> >
> >      <field name="all_text" type="text" indexed="true" stored="false"
> > multiValued="true" />
> >       <field name="short_text" type="text" indexed="false" stored="true"
> > multiValued="true" termVectors="true" termPositions="true"
> > termOffsets="true" />  (indexing false)
> >
> > Is this the only solution? What are the effects on index size and
> > performance? Could you give me an advice?
> >
> > Thanks,
> >
> > Serdar
> >
> >
> >
> > On Sat, May 8, 2010 at 1:00 PM, Lance Norskog <goks...@gmail.com> wrote:
> >
> >> Do you have these options turned on when you index the text field:
> >> termVectors/termPositions/termOffsets ?
> >>
> >> Highlighting needs the information created by these anlysis options.
> >> If they are not turned on, Solr has load the document text and run the
> >> analyzer again with these options on, uses that data to create the
> >> highlighting, then throws away the reanalyzed data. Without these
> >> options, you are basically re-indexing the document when you highlight
> >> it.
> >>
> >>
> >>
> http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FFieldOptionsByUseCase
> >>
> >> On Wed, May 5, 2010 at 5:01 PM, Koji Sekiguchi <k...@r.email.ne.jp>
> wrote:
> >> > (10/05/05 22:08), Serdar Sahin wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> Currently, there are similar topics active in the mailing list, but
> it I
> >> >> did
> >> >> not want to steal the topic.
> >> >>
> >> >> I have currently indexed 100.000 documents, they are microsoft
> >> office/pdf
> >> >> etc documents I convert them to TXT files before indexing. Files are
> >> >> between
> >> >> 1-500 pages. When I search something and filter it to retrieve
> documents
> >> >> that has more than 100 pages, and activate highlighting, it takes
> 0.8-3
> >> >> seconds, depending on the query. (10 result per page) If I retrieve
> >> >> documents that has 1-5 pages, it drops to 0.1 seconds.
> >> >>
> >> >> If I disable highlighting, it drops to 0.1-0.2 seconds, even on the
> >> large
> >> >> documents, which is more than enough. This problem mostly happens
> where
> >> >> there are no caches, on the first query. I use this configuration for
> >> >> highlighting:
> >> >>
> >> >>
> >> >>
> >>
>  $query->addHighlightField('description')->addHighlightField('plainText');
> >> >>     $query->setHighlightSimplePre('<strong>');
> >> >>     $query->setHighlightSimplePost('</strong>');
> >> >>     $query->setHighlightHighlightMultiTerm(TRUE);
> >> >>     $query->setHighlightMaxAnalyzedChars(10000);
> >> >>     $query->setHighlightSnippets(2);
> >> >>
> >> >> Do you have any suggestions to improve response time while
> highlighting
> >> is
> >> >> active? I have read couple of articles you have previously provided
> but
> >> >> they
> >> >> did not help.
> >> >>
> >> >> And for the second question, I retrieve these fields:
> >> >>
> >> >>
> $query->addField('title')->addField('cat')->addField('thumbs_up')->
> >> >>
> addField('thumbs_down')->addField('lang')->addField('id')->
> >> >>
> >> >>  addField('username')->addField('view_count')->addField('pages')->
> >> >>             addField('no_img')->addField('date');
> >> >>
> >> >> If I can't solve the highlighting problem on large documents, I can
> >> simply
> >> >> disable it and retrieve first x characters from the plainText (full
> >> text)
> >> >> field, but is it possible to retrieve first x characters without
> using
> >> the
> >> >> highlighting feature? When I use this;
> >> >>     $query->setHighlight(TRUE);
> >> >>     $query->setHighlightAlternateField('plainText');
> >> >>     $query->setHighlightMaxAnalyzedChars(0);
> >> >>     $query->setHighlightMaxAlternateFieldLength(256);
> >> >>
> >> >> It still takes 2 seconds if I retrieve 10 rows that has 200-300
> pages.
> >> The
> >> >> highlighting still works so it might be the source of the problem, I
> >> want
> >> >> to
> >> >> completely disable it and retrieve only the first 256 characters of
> the
> >> >> plainText field. Is it possible? It may remove some overhead give
> better
> >> >> performance.
> >> >>
> >> >> I personally prefer the highlighting solution but I also would like
> to
> >> >> hear
> >> >> the solution for this problem. For the same query, if I disable
> >> >> highlighting
> >> >> and without retrieving (but still searching) the plainText field, it
> >> drops
> >> >> to 0.0094 seconds. So I think if I can get the first 256 characters
> >> >> without
> >> >> using the highlighting, I will get better performance.
> >> >>
> >> >> Any suggestions regarding with these two problems will highly
> >> appreciated.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Serdar Sahin
> >> >>
> >> >>
> >> >
> >> > Hi Serdar,
> >> >
> >> > There are a few things I think of you can try.
> >> >
> >> > 1. Provide another field for highlighting and use copyField
> >> > to copy plainText to the highlighting field. When using copyField,
> >> > specify maxChars attribute to limit the length of the copy of
> plainText.
> >> > This should work on Solr 1.4.
> >> >
> >> > 2. If you can use branch_3x version of Solr, try
> FastVectorHighlighter.
> >> >
> >> > Koji
> >> >
> >> > --
> >> > http://www.rondhuit.com/en/
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goks...@gmail.com
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Re: Highlighting Performance On Large Documents

Reply via email to