Hi, Sorry for the second e-mail, but for the duplication problem, I have done something wrong, ok now it works, and the query time reduced to 0.1 seconds which is perfect. However, still if I use
<field name="short_text" type="text" indexed="false" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" /> term* directives, it gives the same error, so either I will index short_text field as well or not use the term* directives. It still gives perfect result so I am not using them. Thanks everyone, I hope they will be useful for others as well. Serdar On Sun, May 9, 2010 at 10:05 AM, Serdar Sahin <anlamar...@gmail.com> wrote: > Hi, > > Thanks. However as I said before, termOffsets/termPositions/termVectors > had very little effect on the performance and I don't know why. I have done > exactly what you are saying but highlighting 10 documents that have 200-400 > A4 pages still takes around 2 seconds, depending on the query. I will play > with it more. > > I actually want to highlight (and search) 4 fields not one field if > possible. So that's why I have added those four fields to the highlighting. > However, what I want is to store only limited character from the plainText > field, and I'll use that copyfield as an alternate text field as well. So if > it cannot find any matches for highlighting due to limited character size > from the plainText field, I can bring description, and if it is not > available, then I can bring plainText, and also if it is not available (for > example for scanned documents), then I can bring tags or title as a > snippet/description and insert it into the web page. So, that's why I need > multivalued text field for both sides, for indexing and storing. Just > storing will be a little different. > > I have also tried to duplicate these four fields and use each one of them > separately for indexing and storing to avoid indexing both copyfields (my > previous email) . The problem was for both copyfields; > > <field name="all_text" type="text" indexed="true" stored="false" > multiValued="true" /> > <field name="short_text" type="text" indexed="true" stored="true" > multiValued="true" termVectors="true" termPositions="true" > termOffsets="true" /> > > I had to index them even if I use it just for storing. It was giving an > error. So I have duplicated these four fields; > > mydata.xml > <field column="plainText" > name="plain_text"/> > <field column="plainText" > name="plain_text_ind"/> > > schema.xml > <field name="plain_text_ind" type="text" indexed="true" > stored="false"/> > <field name="plain_text" type="text" indexed="false" > stored="true"/> > > I have done it for these four fields and created copyfields for them. > > <field name="all_text" type="text" indexed="true" stored="false" > multiValued="true" /> > <field name="short_text" type="text" indexed="false" stored="true" > multiValued="true" termVectors="true" termPositions="true" > termOffsets="true" /> > > but still it did not work and gave the same error; > > Caused by: java.lang.RuntimeException: SchemaField: short_text conflicting > indexed field options: > > So, I have disabled term* directives just for testing > and successfully indexed the data, but there was no short_text column in the > solr index. Maybe duplicating does not work or I have done something wrong. > > So I guess my only way is to index short_text field as well without > duplicating anything. > > <field name="all_text" type="text" indexed="true" stored="false" > multiValued="true" /> > <field name="short_text" type="text" indexed="true" stored="true" > multiValued="true" termVectors="true" termPositions="true" > termOffsets="true" /> > > Thanks again, > > Serdar > > > > > > > > > > On Sun, May 9, 2010 at 9:24 AM, Lance Norskog <goks...@gmail.com> wrote: > >> If you want to highlight field X, doing the >> termOffsets/termPositions/termVectors will make highlighting that >> field faster. You should make a separate field and apply these options >> to that field. >> >> Now: doing a copyfield adds a "value" to a multiValued field. For a >> text field, you get a multi-valued text field. You should only copy >> one value to the highlighted field, so just copyField the document to >> your special field. To enforce this, I would add multiValued="false" >> to that field, just to avoid mistakes. >> >> So, all_text should be indexed without the term* attributes, and >> should not be stored. Then your document stored in a separate field >> that you use for highlighting and has the term* attributes. >> >> In general, highlighting has been a problem area all along and there >> are little edge cases that I don't know how to solve. >> >> On Sat, May 8, 2010 at 7:23 AM, Serdar Sahin <anlamar...@gmail.com> >> wrote: >> > Hi, >> > >> > Thanks a lot for the replies, I could have chance today to test them. >> > >> > First of all termVectors/termPositions/termOffsets did not help, it has >> very >> > little effect, but I tried a workaroud, however it is not as efficient >> as I >> > thought. >> > >> > From these fields; >> > >> > <field name="title" type="text" indexed="true" stored="true" >> > required="true" omitNorms="true"/> >> > <field name="description" type="text" indexed="true" >> stored="true" >> > /> >> > <field name="tags" type="text" indexed="true" stored="true" >> > omitNorms="true" /> >> > <field name="plainText" type="text" indexed="true" >> stored="false"/> >> > >> > I tried to create copyfield >> > <field name="all_text" type="text" indexed="true" stored="true" >> > multiValued="true" termVectors="true" termPositions="true" >> > termOffsets="true" /> >> > >> > <copyField source="title" dest="all_text" /> >> > <copyField source="tags" dest="all_text" /> >> > <copyField source="plainText" dest="all_text" maxChars="20000"/> >> > <copyField source="description" dest="all_text" /> >> > >> > And I have indexed 1000 documents that have more than 200 pages. >> > >> > However, maxChars directive also limited the character limit for indexed >> > field. For the query of "Institute of Information Systems", it gave 12 >> > results. I also tried to get unique words from bottom of the text files >> and >> > search them, they did not give any result. So, I just wanted to limit >> size >> > of the stored field, but did not work. Then I tried to create two >> copyfields >> > >> > <field name="all_text" type="text" indexed="true" stored="false" >> > multiValued="true" /> >> > <field name="short_text" type="text" indexed="true" stored="true" >> > multiValued="true" termVectors="true" termPositions="true" >> > termOffsets="true" /> >> > >> > <copyField source="title" dest="all_text" /> >> > <copyField source="tags" dest="all_text" /> >> > <copyField source="plainText" dest="all_text" /> >> > <copyField source="description" dest="all_text" /> >> > >> > <copyField source="title" dest="short_text" /> >> > <copyField source="tags" dest="short_text" /> >> > <copyField source="plainText" dest="short_text" >> maxChars="20000"/> >> > <copyField source="description" dest="short_text" /> >> > >> > It gave 168 results, as I expected, and highlighting also worked >> reasonably >> > fast.However, I don't know Solr/Lucene internals but I guess I store the >> > same indexed field twice, and it should have effect on the performance >> and >> > storage. I tried to make it false but then it gave this error: >> > >> > Caused by: java.lang.RuntimeException: SchemaField: all_text conflicting >> > indexed field options: >> > >> > So, it was not possible. >> > >> > Then I tried hl.useFastVectorHighlighter with the latest version (and >> yes I >> > have turned on termVectors, termPositions, termOffsets) but the result >> was >> > 2-2.5x slower, which was very strange. Do you have any guess why this >> might >> > have happened? >> > >> > 1. Provide another field for highlighting and use copyField >> > >> > to copy plainText to the highlighting field. When using copyField, >> > >> > specify maxChars attribute to limit the length of the copy of plainText. >> > >> > This should work on Solr 1.4. >> > >> > >> > Do you mean I need to duplicate these four fields and use one of them >> for >> > storing and other one for indexing? Than I guess, I can do; >> > >> > <field name="all_text" type="text" indexed="true" stored="false" >> > multiValued="true" /> >> > <field name="short_text" type="text" indexed="false" stored="true" >> > multiValued="true" termVectors="true" termPositions="true" >> > termOffsets="true" /> (indexing false) >> > >> > Is this the only solution? What are the effects on index size and >> > performance? Could you give me an advice? >> > >> > Thanks, >> > >> > Serdar >> > >> > >> > >> > On Sat, May 8, 2010 at 1:00 PM, Lance Norskog <goks...@gmail.com> >> wrote: >> > >> >> Do you have these options turned on when you index the text field: >> >> termVectors/termPositions/termOffsets ? >> >> >> >> Highlighting needs the information created by these anlysis options. >> >> If they are not turned on, Solr has load the document text and run the >> >> analyzer again with these options on, uses that data to create the >> >> highlighting, then throws away the reanalyzed data. Without these >> >> options, you are basically re-indexing the document when you highlight >> >> it. >> >> >> >> >> >> >> http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FFieldOptionsByUseCase >> >> >> >> On Wed, May 5, 2010 at 5:01 PM, Koji Sekiguchi <k...@r.email.ne.jp> >> wrote: >> >> > (10/05/05 22:08), Serdar Sahin wrote: >> >> >> >> >> >> Hi, >> >> >> >> >> >> Currently, there are similar topics active in the mailing list, but >> it I >> >> >> did >> >> >> not want to steal the topic. >> >> >> >> >> >> I have currently indexed 100.000 documents, they are microsoft >> >> office/pdf >> >> >> etc documents I convert them to TXT files before indexing. Files are >> >> >> between >> >> >> 1-500 pages. When I search something and filter it to retrieve >> documents >> >> >> that has more than 100 pages, and activate highlighting, it takes >> 0.8-3 >> >> >> seconds, depending on the query. (10 result per page) If I retrieve >> >> >> documents that has 1-5 pages, it drops to 0.1 seconds. >> >> >> >> >> >> If I disable highlighting, it drops to 0.1-0.2 seconds, even on the >> >> large >> >> >> documents, which is more than enough. This problem mostly happens >> where >> >> >> there are no caches, on the first query. I use this configuration >> for >> >> >> highlighting: >> >> >> >> >> >> >> >> >> >> >> >> $query->addHighlightField('description')->addHighlightField('plainText'); >> >> >> $query->setHighlightSimplePre('<strong>'); >> >> >> $query->setHighlightSimplePost('</strong>'); >> >> >> $query->setHighlightHighlightMultiTerm(TRUE); >> >> >> $query->setHighlightMaxAnalyzedChars(10000); >> >> >> $query->setHighlightSnippets(2); >> >> >> >> >> >> Do you have any suggestions to improve response time while >> highlighting >> >> is >> >> >> active? I have read couple of articles you have previously provided >> but >> >> >> they >> >> >> did not help. >> >> >> >> >> >> And for the second question, I retrieve these fields: >> >> >> >> >> >> >> $query->addField('title')->addField('cat')->addField('thumbs_up')-> >> >> >> >> addField('thumbs_down')->addField('lang')->addField('id')-> >> >> >> >> >> >> addField('username')->addField('view_count')->addField('pages')-> >> >> >> addField('no_img')->addField('date'); >> >> >> >> >> >> If I can't solve the highlighting problem on large documents, I can >> >> simply >> >> >> disable it and retrieve first x characters from the plainText (full >> >> text) >> >> >> field, but is it possible to retrieve first x characters without >> using >> >> the >> >> >> highlighting feature? When I use this; >> >> >> $query->setHighlight(TRUE); >> >> >> $query->setHighlightAlternateField('plainText'); >> >> >> $query->setHighlightMaxAnalyzedChars(0); >> >> >> $query->setHighlightMaxAlternateFieldLength(256); >> >> >> >> >> >> It still takes 2 seconds if I retrieve 10 rows that has 200-300 >> pages. >> >> The >> >> >> highlighting still works so it might be the source of the problem, I >> >> want >> >> >> to >> >> >> completely disable it and retrieve only the first 256 characters of >> the >> >> >> plainText field. Is it possible? It may remove some overhead give >> better >> >> >> performance. >> >> >> >> >> >> I personally prefer the highlighting solution but I also would like >> to >> >> >> hear >> >> >> the solution for this problem. For the same query, if I disable >> >> >> highlighting >> >> >> and without retrieving (but still searching) the plainText field, it >> >> drops >> >> >> to 0.0094 seconds. So I think if I can get the first 256 characters >> >> >> without >> >> >> using the highlighting, I will get better performance. >> >> >> >> >> >> Any suggestions regarding with these two problems will highly >> >> appreciated. >> >> >> >> >> >> Thanks, >> >> >> >> >> >> Serdar Sahin >> >> >> >> >> >> >> >> > >> >> > Hi Serdar, >> >> > >> >> > There are a few things I think of you can try. >> >> > >> >> > 1. Provide another field for highlighting and use copyField >> >> > to copy plainText to the highlighting field. When using copyField, >> >> > specify maxChars attribute to limit the length of the copy of >> plainText. >> >> > This should work on Solr 1.4. >> >> > >> >> > 2. If you can use branch_3x version of Solr, try >> FastVectorHighlighter. >> >> > >> >> > Koji >> >> > >> >> > -- >> >> > http://www.rondhuit.com/en/ >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Lance Norskog >> >> goks...@gmail.com >> >> >> > >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >> > >