Re: Clustering question: searching two diferent indexes
Thanks Otis, but I can merge two indexes with different fields? My big index has this fields, code, title, content, language and date. I add the new documents incrementally. The clustering index only contains the fields code, and cluster. Merging the big index with the clustering one will preserve the order of the big one? For example, if I have the following indexes: Big index code_1, title_1, content_1, language_1, date_1 code_2, title_2, content_2, language_2, date_2 ... Clustering index code_1, cluster_1 code_2, cluster_2 ... then the new merged index will be: Merged index code_1, title_1, content_1, language_1, date_1, cluster_1 code_2, title_2, content_2, language_2, date_2, cluster_2 ... If I can do that then fine, but I think the merging process uses the lucene internal ID to match the documents. I wanna use the code field to do that matching, is that possible?. I cannot be sure the lucene internal ID's are the same for the same codes in both indexes. Thanks again, Albert Otis Gospodnetic wrote: (re-directing to lucene-user list) Albert, If I understand your question correctly... You could run a query like the one you gave on both indices, but if one of them contains documents that have only one of those fields (cluster), then there will never be any matches in the second index. However, why not leave your big index along, add documents to a new, smaller index, and then merge them periodically. I may be off with this; it sounds like this is what you want to do, but I'm not certain I understood you fully. Otis --- Albert Vila [EMAIL PROTECTED] wrote: Hi all, I was wondering If I can search using the MultiSearcher over two diferent indexes at the same time (with diferent fields). I've got one big index, with the code, title, content, language, etc fields (new documents are added incrementally). Now, I have to introduce a clustering field. The problem is that I have to update the whole index each time the clusters change, and I have no enought time to do it (I wanna check for new clusters every 10 minuts and I spent 25 minutes to reindex the whole index). A query example could be: language:0 and title:java and cluster:0 Can I leave the big index whitout any changes and create a new index with only the following fields, code and cluster, and perform the searches using this two indexes? I think I cannot do that without changing the code. It would need a postprocess, matching all returning codes from index 1 with index 2. Anyone have a solution for this problem? I would appreciate that. -- Albert Vila Director de proyectos I+D http://www.imente.com 902 933 242 [iMente La informacin con ms beneficios] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering question: searching two diferent indexes
Albert, --- Albert Vila [EMAIL PROTECTED] wrote: Thanks Otis, but I can merge two indexes with different fields? Yes. Documents with different Fields can be stored in the same index. Not every Document has to have all fields, and it can even have a completely different set of Fields. My big index has this fields, code, title, content, language and date. I add the new documents incrementally. The clustering index only contains the fields code, and cluster. Merging the big index with the clustering one will preserve the order of the big one? I don't fully understand what you mean by 'order'. If you are asking whether internal document Ids will remain the same, the answer is negative. If you have deleted some documents, there will be gaps in document Id sequence, which Lucene will fill, thus re-assigning internal document Ids. For example, if I have the following indexes: Big index code_1, title_1, content_1, language_1, date_1 code_2, title_2, content_2, language_2, date_2 Clustering index code_1, cluster_1 code_2, cluster_2 then the new merged index will be: Merged index code_1, title_1, content_1, language_1, date_1, cluster_1 code_2, title_2, content_2, language_2, date_2, cluster_2 If I can do that then fine, but I think the merging process uses the lucene internal ID to match the documents. I wanna use the code field to do that matching, is that possible?. I cannot be sure the lucene internal ID's are the same for the same codes in both indexes. Are you storing the internal Lucene Document Id in the 'code' field? If you are, I suggest you change your application to use its own set of unique Ids to serve as 'primary keys' in your indices. Otis Thanks again, Albert Otis Gospodnetic wrote: (re-directing to lucene-user list) Albert, If I understand your question correctly... You could run a query like the one you gave on both indices, but if one of them contains documents that have only one of those fields (cluster), then there will never be any matches in the second index. However, why not leave your big index along, add documents to a new, smaller index, and then merge them periodically. I may be off with this; it sounds like this is what you want to do, but I'm not certain I understood you fully. Otis --- Albert Vila [EMAIL PROTECTED] wrote: Hi all, I was wondering If I can search using the MultiSearcher over two diferent indexes at the same time (with diferent fields). I've got one big index, with the code, title, content, language, etc fields (new documents are added incrementally). Now, I have to introduce a clustering field. The problem is that I have to update the whole index each time the clusters change, and I have no enought time to do it (I wanna check for new clusters every 10 minuts and I spent 25 minutes to reindex the whole index). A query example could be: language:0 and title:java and cluster:0 Can I leave the big index whitout any changes and create a new index with only the following fields, code and cluster, and perform the searches using this two indexes? I think I cannot do that without changing the code. It would need a postprocess, matching all returning codes from index 1 with index 2. Anyone have a solution for this problem? I would appreciate that. -- Albert Vila Director de proyectos I+D http://www.imente.com 902 933 242 [iMente La información con más beneficios] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering question: searching two diferent indexes
By 'order', I mean that I'm adding the documents in the big index sorted by date (in order to increase the sorting process). I wanna preserve this sorting after the merging process. I'm not using the internal lucene ID in the code field. The code field contains my own IDs. I was asking, if I can do the merge using my own IDs (the code field), and not the lucene internal IDs, for example: luceneID_0, code_x, title_x, content_x, language_x, date_x luceneID_1, code_y, title_y, content_y, language_y, date_y luceneID_0, code_y, cluster_y luceneID_1, code_x, cluster_x Will the prevous index structure procude an unconsistent merged index? I wanna achieve the following merged index: luceneID_0, code_x, title_x, content_x, language_x, date_x, cluster_x luceneID_1, code_y, title_y, content_y, language_y, date_y, cluster_y Thanks Otis Gospodnetic wrote: Albert, --- Albert Vila [EMAIL PROTECTED] wrote: Thanks Otis, but I can merge two indexes with different fields? Yes. Documents with different Fields can be stored in the same index. Not every Document has to have all fields, and it can even have a completely different set of Fields. My big index has this fields, code, title, content, language and date. I add the new documents incrementally. The clustering index only contains the fields code, and cluster. Merging the big index with the clustering one will preserve the order of the big one? I don't fully understand what you mean by 'order'. If you are asking whether internal document Ids will remain the same, the answer is negative. If you have deleted some documents, there will be gaps in document Id sequence, which Lucene will fill, thus re-assigning internal document Ids. For example, if I have the following indexes: Big index code_1, title_1, content_1, language_1, date_1 code_2, title_2, content_2, language_2, date_2 Clustering index code_1, cluster_1 code_2, cluster_2 then the new merged index will be: Merged index code_1, title_1, content_1, language_1, date_1, cluster_1 code_2, title_2, content_2, language_2, date_2, cluster_2 If I can do that then fine, but I think the merging process uses the lucene internal ID to match the documents. I wanna use the code field to do that matching, is that possible?. I cannot be sure the lucene internal ID's are the same for the same codes in both indexes. Are you storing the internal Lucene Document Id in the 'code' field? If you are, I suggest you change your application to use its own set of unique Ids to serve as 'primary keys' in your indices. Otis Thanks again, Albert Otis Gospodnetic wrote: (re-directing to lucene-user list) Albert, If I understand your question correctly... You could run a query like the one you gave on both indices, but if one of them contains documents that have only one of those fields (cluster), then there will never be any matches in the second index. However, why not leave your big index along, add documents to a new, smaller index, and then merge them periodically. I may be off with this; it sounds like this is what you want to do, but I'm not certain I understood you fully. Otis --- Albert Vila [EMAIL PROTECTED] wrote: Hi all, I was wondering If I can search using the MultiSearcher over two diferent indexes at the same time (with diferent fields). I've got one big index, with the code, title, content, language, etc fields (new documents are added incrementally). Now, I have to introduce a clustering field. The problem is that I have to update the whole index each time the clusters change, and I have no enought time to do it (I wanna check for new clusters every 10 minuts and I spent 25 minutes to reindex the whole index). A query example could be: language:0 and title:java and cluster:0 Can I leave the big index whitout any changes and create a new index with only the following fields, code and cluster, and perform the searches using this two indexes? I think I cannot do that without changing the code. It would need a postprocess, matching all returning codes from index 1 with index 2. Anyone have a solution for this problem? I would appreciate that. -- Albert Vila Director de proyectos I+D http://www.imente.com 902 933 242 [iMente La informacin con ms beneficios] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Albert Vila Director de proyectos I+D http://www.imente.com 902 933 242 [iMente La informacin con ms beneficios] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail:
Re: Clustering question: searching two diferent indexes
Aha, now I see what you mean. You didn't mention 'date' before. :) So, dates will get preserved, and you will be able to keep using them for sorting. However, Lucene will not automatically recognize your 'PK fields' and merge fields from two Documents with the same PK into a single Document. You can think of 'merge' as 'add' (well, the method name is addIndices, actually :)), so Lucene will simply make a cumulative index from your two separate indices: luceneID_0, code_x, title_x, content_x, language_x, date_x luceneID_1, code_y, title_y, content_y, language_y, date_y luceneID_0, code_y, cluster_y luceneID_1, code_x, cluster_x Otis --- Albert Vila [EMAIL PROTECTED] wrote: By 'order', I mean that I'm adding the documents in the big index sorted by date (in order to increase the sorting process). I wanna preserve this sorting after the merging process. I'm not using the internal lucene ID in the code field. The code field contains my own IDs. I was asking, if I can do the merge using my own IDs (the code field), and not the lucene internal IDs, for example: luceneID_0, code_x, title_x, content_x, language_x, date_x luceneID_1, code_y, title_y, content_y, language_y, date_y luceneID_0, code_y, cluster_y luceneID_1, code_x, cluster_x Will the prevous index structure procude an unconsistent merged index? I wanna achieve the following merged index: luceneID_0, code_x, title_x, content_x, language_x, date_x, cluster_x luceneID_1, code_y, title_y, content_y, language_y, date_y, cluster_y Thanks Otis Gospodnetic wrote: Albert, --- Albert Vila [EMAIL PROTECTED] wrote: Thanks Otis, but I can merge two indexes with different fields? Yes. Documents with different Fields can be stored in the same index. Not every Document has to have all fields, and it can even have a completely different set of Fields. My big index has this fields, code, title, content, language and date. I add the new documents incrementally. The clustering index only contains the fields code, and cluster. Merging the big index with the clustering one will preserve the order of the big one? I don't fully understand what you mean by 'order'. If you are asking whether internal document Ids will remain the same, the answer is negative. If you have deleted some documents, there will be gaps in document Id sequence, which Lucene will fill, thus re-assigning internal document Ids. For example, if I have the following indexes: Big index code_1, title_1, content_1, language_1, date_1 code_2, title_2, content_2, language_2, date_2 Clustering index code_1, cluster_1 code_2, cluster_2 then the new merged index will be: Merged index code_1, title_1, content_1, language_1, date_1, cluster_1 code_2, title_2, content_2, language_2, date_2, cluster_2 If I can do that then fine, but I think the merging process uses the lucene internal ID to match the documents. I wanna use the code field to do that matching, is that possible?. I cannot be sure the lucene internal ID's are the same for the same codes in both indexes. Are you storing the internal Lucene Document Id in the 'code' field? If you are, I suggest you change your application to use its own set of unique Ids to serve as 'primary keys' in your indices. Otis Thanks again, Albert Otis Gospodnetic wrote: (re-directing to lucene-user list) Albert, If I understand your question correctly... You could run a query like the one you gave on both indices, but if one of them contains documents that have only one of those fields (cluster), then there will never be any matches in the second index. However, why not leave your big index along, add documents to a new, smaller index, and then merge them periodically. I may be off with this; it sounds like this is what you want to do, but I'm not certain I understood you fully. Otis --- Albert Vila [EMAIL PROTECTED] wrote: Hi all, I was wondering If I can search using the MultiSearcher over two diferent indexes at the same time (with diferent fields). I've got one big index, with the code, title, content, language, etc fields (new documents are added incrementally). Now, I have to introduce a clustering field. The problem is that I have to update the whole index each time the clusters change, and I have no enought time to do it (I wanna check for new clusters every 10 minuts and I spent 25 minutes to reindex the whole index). A query example could be: language:0 and title:java and cluster:0 Can I leave the big index whitout any changes and create a new index with only the following fields, code and cluster, and perform the searches using this two indexes? I think
Re: Storing data in Lucene or Xindice
(redirecting to lucene-user list) Hello Rob, I think you will end up with a simpler final result if you try saving everything in a single data source. I have not used Xindice, so I cannot comment on its features, performance, etc., but judging from your description, you could simply use Lucene to index the textual information from XML feeds or HTML. For XML parsing and indexing, you can see the article I wrote for IBM developerWorks: http://www-106.ibm.com/developerworks/java/library/j-lucene/ If you will be doing a lot of parsing, you will want to use something faster than Digester, though. Maybe Electric XML parser. For HTML you can use NekoHTML, JTidy, htmlparser (sf.net), or Brian Goetz's HTMLParser. Now that I think about it, I seem to recall that Xindice uses Lucene under the hood I can't find any information that confirms this, now. Maybe I'm mixing somehting up. Otis --- Rob Clews [EMAIL PROTECTED] wrote: Hi, I'm currently looking at using Lucene to index some XML feeds we receive for content. However, some of the feeds contain the articles contents and some don't, the feeds that do contain the contents are in XML, for the others we must retrieve them in HTML. I was originally going to store the XML contents from the feed in Xindice and retrieve them for each result from a Lucene query, but I guess I could store them in Lucene. We expect to build up a lot of content from shortish articles on the web and our main focus is speed, so would I be best storing the contents in Lucene or Xindice? Would storing more data (non-indexable) in Lucene slow it down on queries? Thanks, Rob Clews - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ANN: Luke v. 0.5 released
Vladimir Yuryev wrote: Hi Andrzej! I congratulate on the successful version. RussianAnalyzer works with my indexes, but there are problems with some words. These problem words are found only WildCard a method. I don't quite understand what you are saying... Do you suspect there is a bug in Luke somewhere on the Search tab? If that's the case, please provide an example. Besides AnalizerTool works with these words without problems. There is one more small discrepancy on webpage http://www.getopt.org/luke/ - Remember to put both JARs on your classpath, e.g.: java-classpath luke.jar; lucene.jar org.getopt.luke. Luke + Remember to put both JARs on your classpath, e.g.: java-classpath luke.jar:lucene.jar org.getopt.luke. Luke Well, both versions are correct - just the platform is different :-). I'll make a clarification. Thank you! -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering question: searching two diferent indexes
OK, but with this solution, i cannot perform queries like: get all codes that match title:java and language:english and cluster:0 Albert Otis Gospodnetic wrote: Aha, now I see what you mean. You didn't mention 'date' before. :) So, dates will get preserved, and you will be able to keep using them for sorting. However, Lucene will not automatically recognize your 'PK fields' and merge fields from two Documents with the same PK into a single Document. You can think of 'merge' as 'add' (well, the method name is addIndices, actually :)), so Lucene will simply make a cumulative index from your two separate indices: luceneID_0, code_x, title_x, content_x, language_x, date_x luceneID_1, code_y, title_y, content_y, language_y, date_y luceneID_0, code_y, cluster_y luceneID_1, code_x, cluster_x Otis --- Albert Vila [EMAIL PROTECTED] wrote: By 'order', I mean that I'm adding the documents in the big index sorted by date (in order to increase the sorting process). I wanna preserve this sorting after the merging process. I'm not using the internal lucene ID in the code field. The code field contains my own IDs. I was asking, if I can do the merge using my own IDs (the code field), and not the lucene internal IDs, for example: luceneID_0, code_x, title_x, content_x, language_x, date_x luceneID_1, code_y, title_y, content_y, language_y, date_y luceneID_0, code_y, cluster_y luceneID_1, code_x, cluster_x Will the prevous index structure procude an unconsistent merged index? I wanna achieve the following merged index: luceneID_0, code_x, title_x, content_x, language_x, date_x, cluster_x luceneID_1, code_y, title_y, content_y, language_y, date_y, cluster_y Thanks Otis Gospodnetic wrote: Albert, --- Albert Vila [EMAIL PROTECTED] wrote: Thanks Otis, but I can merge two indexes with different fields? Yes. Documents with different Fields can be stored in the same index. Not every Document has to have all fields, and it can even have a completely different set of Fields. My big index has this fields, code, title, content, language and date. I add the new documents incrementally. The clustering index only contains the fields code, and cluster. Merging the big index with the clustering one will preserve the order of the big one? I don't fully understand what you mean by 'order'. If you are asking whether internal document Ids will remain the same, the answer is negative. If you have deleted some documents, there will be gaps in document Id sequence, which Lucene will fill, thus re-assigning internal document Ids. For example, if I have the following indexes: Big index code_1, title_1, content_1, language_1, date_1 code_2, title_2, content_2, language_2, date_2 Clustering index code_1, cluster_1 code_2, cluster_2 then the new merged index will be: Merged index code_1, title_1, content_1, language_1, date_1, cluster_1 code_2, title_2, content_2, language_2, date_2, cluster_2 If I can do that then fine, but I think the merging process uses the lucene internal ID to match the documents. I wanna use the code field to do that matching, is that possible?. I cannot be sure the lucene internal ID's are the same for the same codes in both indexes. Are you storing the internal Lucene Document Id in the 'code' field? If you are, I suggest you change your application to use its own set of unique Ids to serve as 'primary keys' in your indices. Otis Thanks again, Albert Otis Gospodnetic wrote: (re-directing to lucene-user list) Albert, If I understand your question correctly... You could run a query like the one you gave on both indices, but if one of them contains documents that have only one of those fields (cluster), then there will never be any matches in the second index. However, why not leave your big index along, add documents to a new, smaller index, and then merge them periodically. I may be off with this; it sounds like this is what you want to do, but I'm not certain I understood you fully. Otis --- Albert Vila [EMAIL PROTECTED] wrote: Hi all, I was wondering If I can search using the MultiSearcher over two diferent indexes at the same time (with diferent fields). I've got one big index, with the code, title, content, language, etc fields (new documents are added incrementally). Now, I have to introduce a clustering field. The problem is that I have to update the whole index each time the clusters change, and I have no enought time to do it (I wanna check for new clusters every 10 minuts and I spent 25 minutes to reindex the
Re: Clustering question: searching two diferent indexes
Correct, that is what I meant when I said you application will have to handle your particular merge. Instead of using addIndexes method, your applicatoin will have to go through all Documents in the smaller index (the one with cluster fields), get the PK of each Doc, look up that Doc by PK in the big index, delete it from there if it exists, and re-add it to the big index). Otis --- Albert Vila [EMAIL PROTECTED] wrote: OK, but with this solution, i cannot perform queries like: get all codes that match title:java and language:english and cluster:0 Albert Otis Gospodnetic wrote: Aha, now I see what you mean. You didn't mention 'date' before. :) So, dates will get preserved, and you will be able to keep using them for sorting. However, Lucene will not automatically recognize your 'PK fields' and merge fields from two Documents with the same PK into a single Document. You can think of 'merge' as 'add' (well, the method name is addIndices, actually :)), so Lucene will simply make a cumulative index from your two separate indices: luceneID_0, code_x, title_x, content_x, language_x, date_x luceneID_1, code_y, title_y, content_y, language_y, date_y luceneID_0, code_y, cluster_y luceneID_1, code_x, cluster_x Otis --- Albert Vila [EMAIL PROTECTED] wrote: By 'order', I mean that I'm adding the documents in the big index sorted by date (in order to increase the sorting process). I wanna preserve this sorting after the merging process. I'm not using the internal lucene ID in the code field. The code field contains my own IDs. I was asking, if I can do the merge using my own IDs (the code field), and not the lucene internal IDs, for example: luceneID_0, code_x, title_x, content_x, language_x, date_x luceneID_1, code_y, title_y, content_y, language_y, date_y luceneID_0, code_y, cluster_y luceneID_1, code_x, cluster_x Will the prevous index structure procude an unconsistent merged index? I wanna achieve the following merged index: luceneID_0, code_x, title_x, content_x, language_x, date_x, cluster_x luceneID_1, code_y, title_y, content_y, language_y, date_y, cluster_y Thanks Otis Gospodnetic wrote: Albert, --- Albert Vila [EMAIL PROTECTED] wrote: Thanks Otis, but I can merge two indexes with different fields? Yes. Documents with different Fields can be stored in the same index. Not every Document has to have all fields, and it can even have a completely different set of Fields. My big index has this fields, code, title, content, language and date. I add the new documents incrementally. The clustering index only contains the fields code, and cluster. Merging the big index with the clustering one will preserve the order of the big one? I don't fully understand what you mean by 'order'. If you are asking whether internal document Ids will remain the same, the answer is negative. If you have deleted some documents, there will be gaps in document Id sequence, which Lucene will fill, thus re-assigning internal document Ids. For example, if I have the following indexes: Big index code_1, title_1, content_1, language_1, date_1 code_2, title_2, content_2, language_2, date_2 Clustering index code_1, cluster_1 code_2, cluster_2 then the new merged index will be: Merged index code_1, title_1, content_1, language_1, date_1, cluster_1 code_2, title_2, content_2, language_2, date_2, cluster_2 If I can do that then fine, but I think the merging process uses the lucene internal ID to match the documents. I wanna use the code field to do that matching, is that possible?. I cannot be sure the lucene internal ID's are the same for the same codes in both indexes. Are you storing the internal Lucene Document Id in the 'code' field? If you are, I suggest you change your application to use its own set of unique Ids to serve as 'primary keys' in your indices. Otis Thanks again, Albert Otis Gospodnetic wrote: (re-directing to lucene-user list) Albert, If I understand your question correctly... You could run a query like the one you gave on both indices, but if one of them contains documents that have only one of those fields (cluster), then there will never be any matches in the second index. However, why not leave your big index along, add documents to a new, smaller index, and then merge them periodically. I may be off with this; it sounds like this is what you want to do, but
RE: using boost factor
Hi guys, It seems like to really customise the scoring in lucene, one will have to go into the lucene source. I spend a fair bit of time looking into this and it seems to me not the full scoring api is exported. The formula documented on the Similarity class seems to explain how a term is scored, but not, for example, how the final score on a Boolean query is computed from each individual component. (Please correct me if I'm wrong). Normalisation is another part where the API is not exported. Anson -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 23, 2004 3:51 AM To: Lucene Users List Subject: Re: using boost factor Hello Anson, I would look at IndexSearcher's explain method: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSear cher.html#explain(org.apache.lucene.search.Query,%20int) This should give you insight into what's contributing to the high/low scores, thus telling you what you can tweak. Perhaps it's just the boost, perhaps some other similarity factors. Using explain should provide you information such as this, for example: http://www.mozdex.com/explain.jsp?idx=2id=2067257query=goober I hope this helps. Somebody else will probably be able to give more information, but this should get you started while you wait. Otis --- Anson Lau [EMAIL PROTECTED] wrote: Hi guys, Lets say I want to search the term hello world over 3 fields with different boost: ((hello:field1 world:field1)^0.001 (hello:field2 world:field2)^100 (hello:field3 world:field3)^2)) Note I've given field1 a really low boost, a heavy boost to field2 and a REALLY heavy boost to field3. What is happening to me is that a term that matches both field1 and field2, will have a higher score than a term that matches field3 only, even though field3's boost is WAY higher. Can I change this behaviour such that the match in field3 only will actually have a higher score because of the boost? Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Delete Indexed from Merged Document
Guys Has Somebody out there tried DELETING/UPDATION of INDEXED Files from a MERGED Index Format, If HowTo do this Please Explain with regards Karthik -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 23, 2004 9:24 AM To: Lucene Users List Subject: RE: Delete Indexed from Merged Document Hi Otis The link u have specified displays on how to update an Indexed File [ Deleting the Old and then updating with new Ones'] But My Question to be more Specific is : - When we MERGED more then 2 Indexed files [using writer.addIndexes(luceneDirs)] , In such a case How to Delete one of the Indexed files from the MERGED Index in order to Insert an new updated one Please have some sample code snippet in this regard.. with regards Karthik -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 22, 2004 12:52 PM To: Lucene Users List Subject: Re: Delete Indexed from Merged Document Hello Karthik, Here is the answer: http://www.jguru.com/faq/view.jsp?EID=492423 Otis --- Karthik N S [EMAIL PROTECTED] wrote: Dev Guys Apologies Please How Do I DELETE an Indexed Document from a MERGED Index File Can Some body Write me some Code Snippets on this... please With Regards Karthik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Delete Indexed from Merged Document
Hi Mr Wolf What is this // remove the document from index int docID = hits.id(0); and can I increment the 0 factor in the bracket ...for deletion Thx in advance Karthik -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 23, 2004 5:33 PM To: [EMAIL PROTECTED] Subject: AW: Delete Indexed from Merged Document Hello, Karthik N S [mailto:[EMAIL PROTECTED] Has Somebody out there tried DELETING/UPDATION of INDEXED Files from a MERGED Index Format, If HowTo do this Please Explain Of course you can delete or update a document from a merged index. It works in the same way as for all other indexes. You need an unique key (e.g. the file name or uri), which is indexed for searching, to find the right document, because the internal document numbers are changed after merging indexes or deleting documents and optimizing an index. Using this key you can search for the document and remove it. It doesn't matter if your index was created by merging serveral indexes or not. Example: /* Create index: */ Document document = new Document(); document.add(Field.Keyword(filename, file_name)); // this must be unique for each document! document.add(Field.Text(content, file_content)); writer.addDocument(document); /* ... */ writer.close(); /* Update or remove document: Use the file name to find the original document and remove it from index */ FSDirectory indexDirectory = FSDirectory.getDirectory(indexPath, false); IndexReader indexReader = IndexReader.open(indexDirectory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); // create query and search for document using its filename TermQuery query = new TermQuery(new Term(filename, file_name)); Hits hits = indexSearcher.search(query); if ( hits.length() 0 ) { // remove the document from index int docID = hits.id(0); indexReader.delete( docID ); } // else: this is a new file or already removed, so we can simply add it. indexSearcher.close(); indexReader.close(); indexDirectory.close(); // now open an IndexWriter for the same index and add the updated file // as new document /* done */ Hope it helps. Regards, Wolf-Dietrich Materna - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Delete Indexed from Merged Document
Hello, Karthik N S [mailto:[EMAIL PROTECTED] wrote: Hi Mr Wolf Wolf-Dietrich is my first name, so leave out Mr. or use my family name (which is uncommon here). What is this // remove the document from index int docID = hits.id(0); and can I increment the 0 factor in the bracket ...for deletion Yes, but there is no reason to do this in this case. You search for documents using their file name (including their full path!). You get a result (some kind of list). Please read Java-Docs about Hits class. hits.id(0) returns the (internal) ID of the first hit in your result. This is the document that you want to remove (using indexReader.delete(...).). There are no more documents in your result hits unless your key is not unique. hits.length() returns 0 or 1. Regards, Wolf-Dietrich Materna - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Categorization
Hi, How can I do a categorization of the results ? Is it possible with the Lucene API ? Thanks, William. _ Watch the online reality show Mixed Messages with a friend and enter to win a trip to NY http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to use Highlighter concretly ?
Hi ! I'd like to use the Highlighter class to show a summury highlighted after a search. But I don't know how to use correctly the Highlighter class. I found this piece of code which works well. public class TestHighlighter { public static void main(String[] args) { try { Analyzer a = new StandardAnalyzer(); Query q = QueryParser.parse(jennifer lopez, cached, a); String s = the unofficial home page Britney Spears Elizabeth Hurley Kirsten Dunst + Anna Kournikova Katie Holmes Katherine Heigl Jessica Alba Alyson Hannigan Jennifer + Lopez Sarah Michelle Gellar; Highlighter highlighter = new Highlighter(new QueryScorer(q)); TokenStream tokenstream = a.tokenStream(cached, new java.io.StringReader(s)); String summary = highlighter.getBestFragments(tokenstream, s, 2, ...); System.out.println(summary : + summary); } catch(Exception e) { e.printStackTrace(); } } } But I don't know how to adapt it. In fact, I've made a search and I get a Hits instance. And now, I want to give a highlighted summury of each documents of the hits. So it must looks like this : Highlighter highlighter; TokenStream tokenstream; for (int i = 0; i hits.length(); i++) { Document doc = hits.doc(i); String contents = I DON'T KNOW HOW TO GET THE CONTENTS OF MY DOC highlighter = new Highlighter(new QueryScorer(query)); tokenstream = analyzer.tokenStream(contents, new java.io.StringReader(contents)); String summary = highlighter.getBestFragments(tokenstream, contents, 2, ...); System.out.println(summary : + summary); } Here is my questions. First, is it the good method to get a highlighted summury ? And if it is, how is the best way to get the contents of my document (the same way that I used to index their contents or another way ?) ? (To be more precise, I use Lucene to index PDF, DOC, TXT. The size of these document could be about 5Mo.) Thanks.
Re: How to use Highlighter concretly ?
Hello Olivier, You already have your Document instance from Hits. Use Documents's get(String) method to get the content of the given field. Note that the Field MUST be stored, not just indexed. For example String text = doc.get(myStoredTextField); See: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Document.html#get(java.lang.String) I suggest you comment out all Highligher-related code from your application, and first make sure you know how to, and really can get the contents of fields whose content you later want to highlight. Add the Highligher code only after you get this first step working. Otis --- Olivier Catteau [EMAIL PROTECTED] wrote: Hi ! I'd like to use the Highlighter class to show a summury highlighted after a search. But I don't know how to use correctly the Highlighter class. I found this piece of code which works well. public class TestHighlighter { public static void main(String[] args) { try { Analyzer a = new StandardAnalyzer(); Query q = QueryParser.parse(jennifer lopez, cached, a); String s = the unofficial home page Britney Spears Elizabeth Hurley Kirsten Dunst + Anna Kournikova Katie Holmes Katherine Heigl Jessica Alba Alyson Hannigan Jennifer + Lopez Sarah Michelle Gellar; Highlighter highlighter = new Highlighter(new QueryScorer(q)); TokenStream tokenstream = a.tokenStream(cached, new java.io.StringReader(s)); String summary = highlighter.getBestFragments(tokenstream, s, 2, ...); System.out.println(summary : + summary); } catch(Exception e) { e.printStackTrace(); } } } But I don't know how to adapt it. In fact, I've made a search and I get a Hits instance. And now, I want to give a highlighted summury of each documents of the hits. So it must looks like this : Highlighter highlighter; TokenStream tokenstream; for (int i = 0; i hits.length(); i++) { Document doc = hits.doc(i); String contents = I DON'T KNOW HOW TO GET THE CONTENTS OF MY DOC highlighter = new Highlighter(new QueryScorer(query)); tokenstream = analyzer.tokenStream(contents, new java.io.StringReader(contents)); String summary = highlighter.getBestFragments(tokenstream, contents, 2, ...); System.out.println(summary : + summary); } Here is my questions. First, is it the good method to get a highlighted summury ? And if it is, how is the best way to get the contents of my document (the same way that I used to index their contents or another way ?) ? (To be more precise, I use Lucene to index PDF, DOC, TXT. The size of these document could be about 5Mo.) Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
JavaOne and Lucene
I'm presenting Lucene in Action Tuesday morning next week at JavaOne (TS-2994). Any other Luceners going to JavaOne? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
carrot2 - Re: Categorization
Otis Gospodnetic wrote: Hello William, Lucene does not have a categorization engine, but you may want to look at Carrot2 (http://sourceforge.net/projects/carrot2/) May be getting off topic - but maybe not..I can't find an example of how to use Carrot2. It builds easy enough, but there's no obvious example what it takes as input (documents?) and what it returns as output (some list of clustered docs?). I want to use the local interface to it and hook it into Lucene. thx, Dave Otis --- William W [EMAIL PROTECTED] wrote: Hi, How can I do a categorization of the results ? Is it possible with the Lucene API ? Thanks, William. _ Watch the online reality show Mixed Messages with a friend and enter to win a trip to NY http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: carrot2 - Re: Categorization
Hi, Carrot seems to be very interesting but I didn't find a simple example :( I will try to use it ! :) Thx, william. From: David Spencer [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Subject: carrot2 - Re: Categorization Date: Wed, 23 Jun 2004 11:50:22 -0700 Otis Gospodnetic wrote: Hello William, Lucene does not have a categorization engine, but you may want to look at Carrot2 (http://sourceforge.net/projects/carrot2/) May be getting off topic - but maybe not..I can't find an example of how to use Carrot2. It builds easy enough, but there's no obvious example what it takes as input (documents?) and what it returns as output (some list of clustered docs?). I want to use the local interface to it and hook it into Lucene. thx, Dave Otis --- William W [EMAIL PROTECTED] wrote: Hi, How can I do a categorization of the results ? Is it possible with the Lucene API ? Thanks, William. _ Watch the online reality show Mixed Messages with a friend and enter to win a trip to NY http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Get fast, reliable Internet access with MSN 9 Dial-up now 3 months FREE! http://join.msn.click-url.com/go/onm00200361ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: carrot2 - Re: Categorization
William W wrote: Hi, Carrot seems to be very interesting but I didn't find a simple example :( I will try to use it ! :) I can't find an example either, but after going through their source I think the heart of it is com.dawidweiss.carrot.filter.stc.algorithm.STCEngine and com.dawidweiss.carrot.filter.stc.Processor is a class that drives this. Lucene hook - hey - I'm trying to integrate the two. I think this is how it would be done, get search results from Lucene then set up STCEngine a la how Processor does. Thx, william. From: David Spencer [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Subject: carrot2 - Re: Categorization Date: Wed, 23 Jun 2004 11:50:22 -0700 Otis Gospodnetic wrote: Hello William, Lucene does not have a categorization engine, but you may want to look at Carrot2 (http://sourceforge.net/projects/carrot2/) May be getting off topic - but maybe not..I can't find an example of how to use Carrot2. It builds easy enough, but there's no obvious example what it takes as input (documents?) and what it returns as output (some list of clustered docs?). I want to use the local interface to it and hook it into Lucene. thx, Dave Otis --- William W [EMAIL PROTECTED] wrote: Hi, How can I do a categorization of the results ? Is it possible with the Lucene API ? Thanks, William. _ Watch the online reality show Mixed Messages with a friend and enter to win a trip to NY http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Get fast, reliable Internet access with MSN 9 Dial-up now 3 months FREE! http://join.msn.click-url.com/go/onm00200361ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]