Re: Highlighting Performance On Large Documents

Serdar Sahin Sat, 08 May 2010 17:34:54 -0700

Hi,

Sorry for the second e-mail, but for the duplication problem, I have done
something wrong, ok now it works, and the query time reduced to 0.1 seconds
which is perfect. However, still if I use


       <field name="short_text" type="text" indexed="false" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true" />

term* directives, it gives the same error, so either I will index short_text
field as well or not use the term* directives. It still gives perfect result
so I am not using them.

Thanks everyone, I hope they will be useful for others as well.

Serdar

On Sun, May 9, 2010 at 10:05 AM, Serdar Sahin <anlamar...@gmail.com> wrote:

> Hi,
>
> Thanks. However as I said before, termOffsets/termPositions/termVectors
> had very little effect on the performance and I don't know why. I have done
> exactly what you are saying but highlighting 10 documents that have 200-400
> A4 pages still takes around 2 seconds, depending on the query. I will play
> with it more.
>
> I actually want to highlight (and search) 4 fields not one field if
> possible. So that's why I have added those four fields to the highlighting.
> However, what I want is to store only limited character from the plainText
> field, and I'll use that copyfield as an alternate text field as well. So if
> it cannot find any matches for highlighting due to limited character size
> from the plainText field, I can bring description, and if it is not
> available, then I can bring plainText, and also if it is not available (for
> example for scanned documents), then I can bring tags or title as a
> snippet/description and insert it into the web page. So, that's why I need
> multivalued text field for both sides, for indexing and storing. Just
> storing will be a little different.
>
> I have also tried to duplicate these four fields and use each one of them
> separately for indexing and storing to avoid indexing both copyfields (my
> previous email) . The problem was for both copyfields;
>
>        <field name="all_text" type="text" indexed="true" stored="false"
> multiValued="true" />
>        <field name="short_text" type="text" indexed="true" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true" />
>
> I had to index them even if I use it just for storing. It was giving an
> error. So I have duplicated these four fields;
>
> mydata.xml
>                                 <field column="plainText"
> name="plain_text"/>
>                                 <field column="plainText"
> name="plain_text_ind"/>
>
> schema.xml
>         <field name="plain_text_ind" type="text" indexed="true"
> stored="false"/>
>         <field name="plain_text" type="text" indexed="false"
> stored="true"/>
>
> I have done it for these four fields and created copyfields for them.
>
>        <field name="all_text" type="text" indexed="true" stored="false"
> multiValued="true" />
>        <field name="short_text" type="text" indexed="false" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true" />
>
> but still it did not work and gave the same error;
>
> Caused by: java.lang.RuntimeException: SchemaField: short_text conflicting
> indexed field options:
>
> So, I have disabled term* directives just for testing
> and successfully indexed the data, but there was no short_text column in the
> solr index. Maybe duplicating does not work or I have done something wrong.
>
> So I guess my only way is to index short_text field as well without
> duplicating anything.
>
>      <field name="all_text" type="text" indexed="true" stored="false"
> multiValued="true" />
>        <field name="short_text" type="text" indexed="true" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true" />
>
> Thanks again,
>
> Serdar
>
>
>
>
>
>
>
>
>
> On  Sun, May 9, 2010 at 9:24 AM, Lance Norskog <goks...@gmail.com> wrote:
>
>> If you want to highlight field X, doing the
>> termOffsets/termPositions/termVectors will make highlighting that
>> field faster. You should make a separate field and apply these options
>> to that field.
>>
>> Now: doing a copyfield adds a "value" to a multiValued field. For a
>> text field, you get a multi-valued text field. You should only copy
>> one value to the highlighted field, so just copyField the document to
>> your special field. To enforce this, I would add multiValued="false"
>> to that field, just to avoid mistakes.
>>
>> So, all_text should be indexed without the term* attributes, and
>> should not be stored. Then your document stored in a separate field
>> that you use for highlighting and has the term* attributes.
>>
>> In general, highlighting has been a problem area all along and there
>> are little edge cases that I don't know how to solve.
>>
>> On Sat, May 8, 2010 at 7:23 AM, Serdar Sahin <anlamar...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Thanks a lot for the replies, I could have chance today to test them.
>> >
>> > First of all termVectors/termPositions/termOffsets did not help, it has
>> very
>> > little effect, but I tried a workaroud, however it is not as efficient
>> as I
>> > thought.
>> >
>> > From these fields;
>> >
>> >        <field name="title" type="text" indexed="true" stored="true"
>> > required="true" omitNorms="true"/>
>> >        <field name="description" type="text" indexed="true"
>> stored="true"
>> > />
>> >        <field name="tags" type="text" indexed="true" stored="true"
>> > omitNorms="true" />
>> >        <field name="plainText" type="text" indexed="true"
>> stored="false"/>
>> >
>> > I tried to create copyfield
>> >        <field name="all_text" type="text" indexed="true" stored="true"
>> > multiValued="true" termVectors="true" termPositions="true"
>> > termOffsets="true" />
>> >
>> >        <copyField source="title" dest="all_text" />
>> >        <copyField source="tags" dest="all_text" />
>> >        <copyField source="plainText" dest="all_text"  maxChars="20000"/>
>> >        <copyField source="description" dest="all_text" />
>> >
>> > And I have indexed 1000 documents that have more than 200 pages.
>> >
>> > However, maxChars directive also limited the character limit for indexed
>> > field. For the query of "Institute of Information Systems", it gave 12
>> > results. I also tried to get unique words from bottom of the text files
>> and
>> > search them, they did not give any result. So,  I just wanted to limit
>> size
>> > of the stored field, but did not work. Then I tried to create two
>> copyfields
>> >
>> >       <field name="all_text" type="text" indexed="true" stored="false"
>> > multiValued="true" />
>> >       <field name="short_text" type="text" indexed="true" stored="true"
>> > multiValued="true" termVectors="true" termPositions="true"
>> > termOffsets="true" />
>> >
>> >        <copyField source="title" dest="all_text" />
>> >        <copyField source="tags" dest="all_text" />
>> >        <copyField source="plainText" dest="all_text" />
>> >        <copyField source="description" dest="all_text" />
>> >
>> >        <copyField source="title" dest="short_text" />
>> >        <copyField source="tags" dest="short_text" />
>> >        <copyField source="plainText" dest="short_text"
>> maxChars="20000"/>
>> >        <copyField source="description" dest="short_text" />
>> >
>> > It gave 168 results, as I expected, and highlighting also worked
>> reasonably
>> > fast.However, I don't know Solr/Lucene internals but I guess I store the
>> > same indexed field twice, and it should have effect on the performance
>> and
>> > storage. I tried to make it false but then it gave this error:
>> >
>> > Caused by: java.lang.RuntimeException: SchemaField: all_text conflicting
>> > indexed field options:
>> >
>> > So, it was not possible.
>> >
>> > Then I tried hl.useFastVectorHighlighter with the latest version (and
>> yes I
>> > have turned on termVectors, termPositions, termOffsets) but the result
>> was
>> > 2-2.5x slower, which was very strange. Do you have any guess why this
>> might
>> > have happened?
>> >
>> > 1. Provide another field for highlighting and use copyField
>> >
>> > to copy plainText to the highlighting field. When using copyField,
>> >
>> > specify maxChars attribute to limit the length of the copy of plainText.
>> >
>> > This should work on Solr 1.4.
>> >
>> >
>> > Do you mean I need to duplicate these four fields and use one of them
>> for
>> > storing and other one for indexing? Than I guess, I can do;
>> >
>> >      <field name="all_text" type="text" indexed="true" stored="false"
>> > multiValued="true" />
>> >       <field name="short_text" type="text" indexed="false" stored="true"
>> > multiValued="true" termVectors="true" termPositions="true"
>> > termOffsets="true" />  (indexing false)
>> >
>> > Is this the only solution? What are the effects on index size and
>> > performance? Could you give me an advice?
>> >
>> > Thanks,
>> >
>> > Serdar
>> >
>> >
>> >
>> > On Sat, May 8, 2010 at 1:00 PM, Lance Norskog <goks...@gmail.com>
>> wrote:
>> >
>> >> Do you have these options turned on when you index the text field:
>> >> termVectors/termPositions/termOffsets ?
>> >>
>> >> Highlighting needs the information created by these anlysis options.
>> >> If they are not turned on, Solr has load the document text and run the
>> >> analyzer again with these options on, uses that data to create the
>> >> highlighting, then throws away the reanalyzed data. Without these
>> >> options, you are basically re-indexing the document when you highlight
>> >> it.
>> >>
>> >>
>> >>
>> http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FFieldOptionsByUseCase
>> >>
>> >> On Wed, May 5, 2010 at 5:01 PM, Koji Sekiguchi <k...@r.email.ne.jp>
>> wrote:
>> >> > (10/05/05 22:08), Serdar Sahin wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> Currently, there are similar topics active in the mailing list, but
>> it I
>> >> >> did
>> >> >> not want to steal the topic.
>> >> >>
>> >> >> I have currently indexed 100.000 documents, they are microsoft
>> >> office/pdf
>> >> >> etc documents I convert them to TXT files before indexing. Files are
>> >> >> between
>> >> >> 1-500 pages. When I search something and filter it to retrieve
>> documents
>> >> >> that has more than 100 pages, and activate highlighting, it takes
>> 0.8-3
>> >> >> seconds, depending on the query. (10 result per page) If I retrieve
>> >> >> documents that has 1-5 pages, it drops to 0.1 seconds.
>> >> >>
>> >> >> If I disable highlighting, it drops to 0.1-0.2 seconds, even on the
>> >> large
>> >> >> documents, which is more than enough. This problem mostly happens
>> where
>> >> >> there are no caches, on the first query. I use this configuration
>> for
>> >> >> highlighting:
>> >> >>
>> >> >>
>> >> >>
>> >>
>>  $query->addHighlightField('description')->addHighlightField('plainText');
>> >> >>     $query->setHighlightSimplePre('<strong>');
>> >> >>     $query->setHighlightSimplePost('</strong>');
>> >> >>     $query->setHighlightHighlightMultiTerm(TRUE);
>> >> >>     $query->setHighlightMaxAnalyzedChars(10000);
>> >> >>     $query->setHighlightSnippets(2);
>> >> >>
>> >> >> Do you have any suggestions to improve response time while
>> highlighting
>> >> is
>> >> >> active? I have read couple of articles you have previously provided
>> but
>> >> >> they
>> >> >> did not help.
>> >> >>
>> >> >> And for the second question, I retrieve these fields:
>> >> >>
>> >> >>
>> $query->addField('title')->addField('cat')->addField('thumbs_up')->
>> >> >>
>> addField('thumbs_down')->addField('lang')->addField('id')->
>> >> >>
>> >> >>  addField('username')->addField('view_count')->addField('pages')->
>> >> >>             addField('no_img')->addField('date');
>> >> >>
>> >> >> If I can't solve the highlighting problem on large documents, I can
>> >> simply
>> >> >> disable it and retrieve first x characters from the plainText (full
>> >> text)
>> >> >> field, but is it possible to retrieve first x characters without
>> using
>> >> the
>> >> >> highlighting feature? When I use this;
>> >> >>     $query->setHighlight(TRUE);
>> >> >>     $query->setHighlightAlternateField('plainText');
>> >> >>     $query->setHighlightMaxAnalyzedChars(0);
>> >> >>     $query->setHighlightMaxAlternateFieldLength(256);
>> >> >>
>> >> >> It still takes 2 seconds if I retrieve 10 rows that has 200-300
>> pages.
>> >> The
>> >> >> highlighting still works so it might be the source of the problem, I
>> >> want
>> >> >> to
>> >> >> completely disable it and retrieve only the first 256 characters of
>> the
>> >> >> plainText field. Is it possible? It may remove some overhead give
>> better
>> >> >> performance.
>> >> >>
>> >> >> I personally prefer the highlighting solution but I also would like
>> to
>> >> >> hear
>> >> >> the solution for this problem. For the same query, if I disable
>> >> >> highlighting
>> >> >> and without retrieving (but still searching) the plainText field, it
>> >> drops
>> >> >> to 0.0094 seconds. So I think if I can get the first 256 characters
>> >> >> without
>> >> >> using the highlighting, I will get better performance.
>> >> >>
>> >> >> Any suggestions regarding with these two problems will highly
>> >> appreciated.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Serdar Sahin
>> >> >>
>> >> >>
>> >> >
>> >> > Hi Serdar,
>> >> >
>> >> > There are a few things I think of you can try.
>> >> >
>> >> > 1. Provide another field for highlighting and use copyField
>> >> > to copy plainText to the highlighting field. When using copyField,
>> >> > specify maxChars attribute to limit the length of the copy of
>> plainText.
>> >> > This should work on Solr 1.4.
>> >> >
>> >> > 2. If you can use branch_3x version of Solr, try
>> FastVectorHighlighter.
>> >> >
>> >> > Koji
>> >> >
>> >> > --
>> >> > http://www.rondhuit.com/en/
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Lance Norskog
>> >> goks...@gmail.com
>> >>
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>
>

Re: Highlighting Performance On Large Documents

Reply via email to