Solr tika and extracting formatting info

2009-07-11 Thread S.Selvam
Hi all,

I am using solr tika to index various file formats.I have used
ExtractingRequestHandler to get the data and render it in GUI using VB.NET.
Now my requirement is to render the file as it is(With all formatting,for
eg.Table,) or almost a similar look of original file.So i need to receive
all the formatting information of the file posted to Tika not only the data.
Is that possible with Tika? or do i need use any other module ?

I would like to get your suggestions regarding this.


-- 
Yours,
S.Selvam


Solr tika and posting .pst files

2009-07-20 Thread S.Selvam
Hi,

I am using Solr-Tika to post various files.When i try to post .pst
file(outlook express), the file is being posted but it does not contain any
data.I could not found anything useful after googling.

Regarding solrschema , i use

  1) id
  2) content(this is the default field)

Do i need to configure Tika to be able to handle .pst format ? ,I would like
to hear your suggestions.

Note:1) I use VB.NET as a front end tool.
   2) Other file contents are properly mapped to content field.

-- 
Yours,
S.Selvam


solr-duplicate post management

2009-01-11 Thread S.Selvam Siva
Hi,

I have 6 fields in my solr-schema.
   1)id(unique key)
   2)urlid
   3)url
and so on to 6)

We have been posting  3 to 4 lakh .xml files per day which includes 50%
duplicate posts.

what i need is ,to log the existing urlid and new urlid(of course both will
not be same) ,when a .xml file of same id(unique field) is posted.

I want to make this by modifying the solr source.Which file do i need to
modify so that i could get the above details in log ?

I tried with DirectUpdateHandler2.java(which removes the duplicate
entries),but efforts in vein.


-- 
Yours,
S.Selvam


Re: solr-duplicate post management

2009-01-22 Thread S.Selvam Siva
On Thu, Jan 22, 2009 at 7:12 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : what i need is ,to log the existing urlid and new urlid(of course both
 will
 : not be same) ,when a .xml file of same id(unique field) is posted.
 :
 : I want to make this by modifying the solr source.Which file do i need to
 : modify so that i could get the above details in log ?
 :
 : I tried with DirectUpdateHandler2.java(which removes the duplicate
 : entries),but efforts in vein.

 DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's
 IndexWriter.updateDocument method when you have a uniqueKey and you aren't
 allowing duplicates -- this method doesn't give you any way to access the
 old document(s) that had that existing key.

 The easiest way to make a change like what you are interested in might be
 an UpdateProcessor that does a lookup/search for the uniqueKey of each
 document about to be added to see if it already exists.  that's probably
 about as efficient as you can get, and would be nicely encapsulated.

 You might also want to take a look at SOLR-799, where some work is being
 done to create UpdateProcessors that can do near duplicate detection...

 http://wiki.apache.org/solr/Deduplication
 https://issues.apache.org/jira/browse/SOLR-799






 -Hoss


Thank you for your response.I will try it out.



-- 
Yours,
S.Selvam


Re: solr-duplicate post management

2009-01-24 Thread S.Selvam Siva
On Thu, Jan 22, 2009 at 2:33 PM, S.Selvam Siva s.selvams...@gmail.comwrote:



 On Thu, Jan 22, 2009 at 7:12 AM, Chris Hostetter hossman_luc...@fucit.org
  wrote:


 : what i need is ,to log the existing urlid and new urlid(of course both
 will
 : not be same) ,when a .xml file of same id(unique field) is posted.
 :
 : I want to make this by modifying the solr source.Which file do i need to
 : modify so that i could get the above details in log ?
 :
 : I tried with DirectUpdateHandler2.java(which removes the duplicate
 : entries),but efforts in vein.

 DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's
 IndexWriter.updateDocument method when you have a uniqueKey and you aren't
 allowing duplicates -- this method doesn't give you any way to access the
 old document(s) that had that existing key.

 The easiest way to make a change like what you are interested in might be
 an UpdateProcessor that does a lookup/search for the uniqueKey of each
 document about to be added to see if it already exists.  that's probably
 about as efficient as you can get, and would be nicely encapsulated.

 You might also want to take a look at SOLR-799, where some work is being
 done to create UpdateProcessors that can do near duplicate detection...

 http://wiki.apache.org/solr/Deduplication
 https://issues.apache.org/jira/browse/SOLR-799






 -Hoss




Hi, i added some code to *DirectUpdateHandler2.java's doDeletions()* (solr
1.2.0) ,and got the solution i wanted.(logging duplicate post entry-i.e old
field and new field of duplicate post)


   Document d1=searcher.doc(prev);//existing doc to be deleted
   Document d2=searcher.doc(tdocs.doc());//new doc
   String oldname=d1.get(name);
   String id1=d1.get(id);
   String newname=d2.get(name);
   String id2=d1.get(id);
   out3.write(id1+,+oldname+,+newname+\n);

But i dont know ,wether the performance of solr will be affected by this.
Any comment on the performance issue for the above solution is welcome...
-- 
Yours,
S.Selvam


Re: solr-duplicate post management

2009-01-26 Thread S.Selvam Siva
On Tue, Jan 27, 2009 at 5:03 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Hi, i added some code to *DirectUpdateHandler2.java's doDeletions()*
 (solr
 : 1.2.0) ,and got the solution i wanted.(logging duplicate post entry-i.e
 old
 : field and new field of duplicate post)
 :
 :
 :Document d1=searcher.doc(prev);//existing doc to be
 deleted
 :Document d2=searcher.doc(tdocs.doc());//new doc
 :String oldname=d1.get(name);
 :String id1=d1.get(id);
 :String newname=d2.get(name);
 :String id2=d1.get(id);
 :out3.write(id1+,+oldname+,+newname+\n);
 :
 : But i dont know ,wether the performance of solr will be affected by this.
 : Any comment on the performance issue for the above solution is welcome...

 it's probably going to be painfully slow -- you're probably going to be a
 lot better off avoiding the use of searcher.doc and instead stick with
 using the FieldCache, but there are trade offs there as well, it's largely
 going to depend on how often you're doing adds vs. commits.

 BTW: as i mentioned before, it probably make more sense to implement this
 in an UpdateProcessor instead of hacking DirectUpdateHandler2 ... that way
 you'll be able to upgrade Solr without worryiing about losing/redocing
 your changes.




 -Hoss



Thanks a lot Chris Hostetter ,

I realize i must make it to UpdateProcessor for best performance and
 i am new to SOLR (a few months back i started working on solr).
I found modifying DirectUpdateHandler2 bit easy.
Further,for the current importance of finding duplicate post,i made the
 above modification to DirectUpdateHandler2.

Note:And for your information,we are commiting for every 1000 posts.



-- 
Yours,
S.Selvam