Solr tika and extracting formatting info
Hi all, I am using solr tika to index various file formats.I have used ExtractingRequestHandler to get the data and render it in GUI using VB.NET. Now my requirement is to render the file as it is(With all formatting,for eg.Table,) or almost a similar look of original file.So i need to receive all the formatting information of the file posted to Tika not only the data. Is that possible with Tika? or do i need use any other module ? I would like to get your suggestions regarding this. -- Yours, S.Selvam
Solr tika and posting .pst files
Hi, I am using Solr-Tika to post various files.When i try to post .pst file(outlook express), the file is being posted but it does not contain any data.I could not found anything useful after googling. Regarding solrschema , i use 1) id 2) content(this is the default field) Do i need to configure Tika to be able to handle .pst format ? ,I would like to hear your suggestions. Note:1) I use VB.NET as a front end tool. 2) Other file contents are properly mapped to content field. -- Yours, S.Selvam
solr-duplicate post management
Hi, I have 6 fields in my solr-schema. 1)id(unique key) 2)urlid 3)url and so on to 6) We have been posting 3 to 4 lakh .xml files per day which includes 50% duplicate posts. what i need is ,to log the existing urlid and new urlid(of course both will not be same) ,when a .xml file of same id(unique field) is posted. I want to make this by modifying the solr source.Which file do i need to modify so that i could get the above details in log ? I tried with DirectUpdateHandler2.java(which removes the duplicate entries),but efforts in vein. -- Yours, S.Selvam
Re: solr-duplicate post management
On Thu, Jan 22, 2009 at 7:12 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : what i need is ,to log the existing urlid and new urlid(of course both will : not be same) ,when a .xml file of same id(unique field) is posted. : : I want to make this by modifying the solr source.Which file do i need to : modify so that i could get the above details in log ? : : I tried with DirectUpdateHandler2.java(which removes the duplicate : entries),but efforts in vein. DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's IndexWriter.updateDocument method when you have a uniqueKey and you aren't allowing duplicates -- this method doesn't give you any way to access the old document(s) that had that existing key. The easiest way to make a change like what you are interested in might be an UpdateProcessor that does a lookup/search for the uniqueKey of each document about to be added to see if it already exists. that's probably about as efficient as you can get, and would be nicely encapsulated. You might also want to take a look at SOLR-799, where some work is being done to create UpdateProcessors that can do near duplicate detection... http://wiki.apache.org/solr/Deduplication https://issues.apache.org/jira/browse/SOLR-799 -Hoss Thank you for your response.I will try it out. -- Yours, S.Selvam
Re: solr-duplicate post management
On Thu, Jan 22, 2009 at 2:33 PM, S.Selvam Siva s.selvams...@gmail.comwrote: On Thu, Jan 22, 2009 at 7:12 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : what i need is ,to log the existing urlid and new urlid(of course both will : not be same) ,when a .xml file of same id(unique field) is posted. : : I want to make this by modifying the solr source.Which file do i need to : modify so that i could get the above details in log ? : : I tried with DirectUpdateHandler2.java(which removes the duplicate : entries),but efforts in vein. DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's IndexWriter.updateDocument method when you have a uniqueKey and you aren't allowing duplicates -- this method doesn't give you any way to access the old document(s) that had that existing key. The easiest way to make a change like what you are interested in might be an UpdateProcessor that does a lookup/search for the uniqueKey of each document about to be added to see if it already exists. that's probably about as efficient as you can get, and would be nicely encapsulated. You might also want to take a look at SOLR-799, where some work is being done to create UpdateProcessors that can do near duplicate detection... http://wiki.apache.org/solr/Deduplication https://issues.apache.org/jira/browse/SOLR-799 -Hoss Hi, i added some code to *DirectUpdateHandler2.java's doDeletions()* (solr 1.2.0) ,and got the solution i wanted.(logging duplicate post entry-i.e old field and new field of duplicate post) Document d1=searcher.doc(prev);//existing doc to be deleted Document d2=searcher.doc(tdocs.doc());//new doc String oldname=d1.get(name); String id1=d1.get(id); String newname=d2.get(name); String id2=d1.get(id); out3.write(id1+,+oldname+,+newname+\n); But i dont know ,wether the performance of solr will be affected by this. Any comment on the performance issue for the above solution is welcome... -- Yours, S.Selvam
Re: solr-duplicate post management
On Tue, Jan 27, 2009 at 5:03 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Hi, i added some code to *DirectUpdateHandler2.java's doDeletions()* (solr : 1.2.0) ,and got the solution i wanted.(logging duplicate post entry-i.e old : field and new field of duplicate post) : : :Document d1=searcher.doc(prev);//existing doc to be deleted :Document d2=searcher.doc(tdocs.doc());//new doc :String oldname=d1.get(name); :String id1=d1.get(id); :String newname=d2.get(name); :String id2=d1.get(id); :out3.write(id1+,+oldname+,+newname+\n); : : But i dont know ,wether the performance of solr will be affected by this. : Any comment on the performance issue for the above solution is welcome... it's probably going to be painfully slow -- you're probably going to be a lot better off avoiding the use of searcher.doc and instead stick with using the FieldCache, but there are trade offs there as well, it's largely going to depend on how often you're doing adds vs. commits. BTW: as i mentioned before, it probably make more sense to implement this in an UpdateProcessor instead of hacking DirectUpdateHandler2 ... that way you'll be able to upgrade Solr without worryiing about losing/redocing your changes. -Hoss Thanks a lot Chris Hostetter , I realize i must make it to UpdateProcessor for best performance and i am new to SOLR (a few months back i started working on solr). I found modifying DirectUpdateHandler2 bit easy. Further,for the current importance of finding duplicate post,i made the above modification to DirectUpdateHandler2. Note:And for your information,we are commiting for every 1000 posts. -- Yours, S.Selvam