Re: Solr Cell and Deduplication - Get ID of doc

2010-03-02 Thread Bill Engle
Thanks for the responses.  This is exactly what I had to resort to.  I will
definitely put in a feature request to get the generated ID back from the
extract request.

I am doing this with PHP cURL for extraction and pecl php solr for
querying.  I am then saving the unique id and dupe hash in a MySQL table
which I check against after the doc is indexed in Solr.  If it is a dupe I
delete the Solr record and discard the file.  My problem now is the dupe
hash sometimes comes back NULL from Solr although when I check it through
Solr Admin it is there.  I am working through this now to isolate.

I had to set Solr to ALLOW duplicates because I have to somehow know that
the file is a dupe and then remove the duplicate files on my filesystem.
Based on the extract response I have no way of knowing this if duplicates
are disallowed.

-Bill


On Tue, Mar 2, 2010 at 2:11 AM, Chris Hostetter hossman_luc...@fucit.orgwrote:



 : To quote from the wiki,
...
 That's all true ... but Bill explicitly said he wanted to use
 SignatureUpdateProcessorFactory to generate a uniqueKey from the content
 field post-extraction so he could dedup documents with the same content
 ... his question was how to get that key after adding a doc.

 Using a unique literal.field value will work -- but only as the value of
 a secondary field that he can then query on to get the uniqueKeyField
 value.


 :  : You could create your own unique ID and pass it in with the
 :  : literal.field=value feature.
 : 
 :  By which Lance means you could specify an unique value in a differnet
 :  field from yoru uniqueKey field, and then query on that field:value
 pair
 :  to get the doc after it's been added -- but that query will only work
 :  until some other version of the doc (with some other value) overwrites
 it.
 :  so you'd esentially have to query for the field:value to lookup the
 :  uniqueKey.
 : 
 :  it seems like it should definitely be feasible for the
 :  Update RequestHandlers to return the uniqueKeyField values for all the
 :  added docs (regardless of wether the key was included in the request,
 or
 :  added by an UpdateProcessor -- but i'm not sure how that would fit in
 with
 :  the SolrJ API.
 : 
 :  would you mind opening a feature request in Jira?
 : 
 : 
 : 
 :  -Hoss
 : 
 : 
 :
 :
 :
 : --
 : Lance Norskog
 : goks...@gmail.com
 :



 -Hoss




Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter

: You could create your own unique ID and pass it in with the
: literal.field=value feature.

By which Lance means you could specify an unique value in a differnet 
field from yoru uniqueKey field, and then query on that field:value pair 
to get the doc after it's been added -- but that query will only work 
until some other version of the doc (with some other value) overwrites it.  
so you'd esentially have to query for the field:value to lookup the 
uniqueKey.

it seems like it should definitely be feasible for the 
Update RequestHandlers to return the uniqueKeyField values for all the 
added docs (regardless of wether the key was included in the request, or 
added by an UpdateProcessor -- but i'm not sure how that would fit in with 
the SolrJ API.

would you mind opening a feature request in Jira?



-Hoss



Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Lance Norskog
To quote from the wiki,
http://wiki.apache.org/solr/ExtractingRequestHandler

curl 'http://localhost:8983/solr/update/extract?literal.id=doc1commit=true'
-F myfi...@tutorial.html

This runs the extractor on your input file (in this case an HTML
file). It then stores the generated document with the id field (the
uniqueKey declared in schema.xml) set to 'doc1'. This way, you do not
rely on the ExtractingRequestHandler to create a unique key for you.
This command throws away that generated key.

On Mon, Mar 1, 2010 at 4:22 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : You could create your own unique ID and pass it in with the
 : literal.field=value feature.

 By which Lance means you could specify an unique value in a differnet
 field from yoru uniqueKey field, and then query on that field:value pair
 to get the doc after it's been added -- but that query will only work
 until some other version of the doc (with some other value) overwrites it.
 so you'd esentially have to query for the field:value to lookup the
 uniqueKey.

 it seems like it should definitely be feasible for the
 Update RequestHandlers to return the uniqueKeyField values for all the
 added docs (regardless of wether the key was included in the request, or
 added by an UpdateProcessor -- but i'm not sure how that would fit in with
 the SolrJ API.

 would you mind opening a feature request in Jira?



 -Hoss





-- 
Lance Norskog
goks...@gmail.com


Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter


: To quote from the wiki,
...
That's all true ... but Bill explicitly said he wanted to use 
SignatureUpdateProcessorFactory to generate a uniqueKey from the content 
field post-extraction so he could dedup documents with the same content 
... his question was how to get that key after adding a doc.

Using a unique literal.field value will work -- but only as the value of 
a secondary field that he can then query on to get the uniqueKeyField 
value.


:  : You could create your own unique ID and pass it in with the
:  : literal.field=value feature.
: 
:  By which Lance means you could specify an unique value in a differnet
:  field from yoru uniqueKey field, and then query on that field:value pair
:  to get the doc after it's been added -- but that query will only work
:  until some other version of the doc (with some other value) overwrites it.
:  so you'd esentially have to query for the field:value to lookup the
:  uniqueKey.
: 
:  it seems like it should definitely be feasible for the
:  Update RequestHandlers to return the uniqueKeyField values for all the
:  added docs (regardless of wether the key was included in the request, or
:  added by an UpdateProcessor -- but i'm not sure how that would fit in with
:  the SolrJ API.
: 
:  would you mind opening a feature request in Jira?
: 
: 
: 
:  -Hoss
: 
: 
: 
: 
: 
: -- 
: Lance Norskog
: goks...@gmail.com
: 



-Hoss



Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Bill Engle
Any thoughts on this? I would like to get the id back in the request after
indexing.  My initial thoughts were to do a search to get the docid  based
on the attr_stream_name after indexing but now that I reread my message I
mentioned the attr_stream_name (file_name) may be different so that is
unreliable.  My only option is to somehow return the id in the XML
response.  Any guidance is greatly appreciated.

-Bill

On Wed, Feb 24, 2010 at 12:06 PM, Bill Engle billengle...@gmail.com wrote:

 Hi -

 New Solr user here.  I am using Solr Cell to index files (PDF, doc, docx,
 txt, htm, etc.) and there is a good chance that a new file will have
 duplicate content but not necessarily the same file name.  To avoid this I
 am using the deduplication feature of Solr.

   updateRequestProcessorChain name=dedupe
 processor
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
   bool name=enabledtrue/bool
   str name=signatureFieldid/str
   bool name=overwriteDupestrue/bool
   str name=fieldsattr_content/str
   str name=signatureClassorg.apache.solr.update.processor./str
 /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain

 How do I get the id value post Solr processing.  Is there someway to
 modify the curl response so that id is returned.  I need this id because I
 would like to rename the file to the id value.  I could probably do a Solr
 search after the fact to get the id field based on the attr_stream_name but
 I would like to do only one request.

 curl '
 http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true'
 -F myfi...@myfile.pdf

 Thanks,
 Bill



Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Lance Norskog
You could create your own unique ID and pass it in with the
literal.field=value feature.

http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters

On Fri, Feb 26, 2010 at 7:56 AM, Bill Engle billengle...@gmail.com wrote:
 Any thoughts on this? I would like to get the id back in the request after
 indexing.  My initial thoughts were to do a search to get the docid  based
 on the attr_stream_name after indexing but now that I reread my message I
 mentioned the attr_stream_name (file_name) may be different so that is
 unreliable.  My only option is to somehow return the id in the XML
 response.  Any guidance is greatly appreciated.

 -Bill

 On Wed, Feb 24, 2010 at 12:06 PM, Bill Engle billengle...@gmail.com wrote:

 Hi -

 New Solr user here.  I am using Solr Cell to index files (PDF, doc, docx,
 txt, htm, etc.) and there is a good chance that a new file will have
 duplicate content but not necessarily the same file name.  To avoid this I
 am using the deduplication feature of Solr.

   updateRequestProcessorChain name=dedupe
     processor
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
       bool name=enabledtrue/bool
       str name=signatureFieldid/str
       bool name=overwriteDupestrue/bool
       str name=fieldsattr_content/str
       str name=signatureClassorg.apache.solr.update.processor./str
     /processor
     processor class=solr.LogUpdateProcessorFactory /
     processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain

 How do I get the id value post Solr processing.  Is there someway to
 modify the curl response so that id is returned.  I need this id because I
 would like to rename the file to the id value.  I could probably do a Solr
 search after the fact to get the id field based on the attr_stream_name but
 I would like to do only one request.

 curl '
 http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true'
 -F myfi...@myfile.pdf

 Thanks,
 Bill





-- 
Lance Norskog
goks...@gmail.com


Solr Cell and Deduplication - Get ID of doc

2010-02-24 Thread Bill Engle
Hi -

New Solr user here.  I am using Solr Cell to index files (PDF, doc, docx,
txt, htm, etc.) and there is a good chance that a new file will have
duplicate content but not necessarily the same file name.  To avoid this I
am using the deduplication feature of Solr.

  updateRequestProcessorChain name=dedupe
processor
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldid/str
  bool name=overwriteDupestrue/bool
  str name=fieldsattr_content/str
  str name=signatureClassorg.apache.solr.update.processor./str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

How do I get the id value post Solr processing.  Is there someway to
modify the curl response so that id is returned.  I need this id because I
would like to rename the file to the id value.  I could probably do a Solr
search after the fact to get the id field based on the attr_stream_name but
I would like to do only one request.

curl '
http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true'
-F myfi...@myfile.pdf

Thanks,
Bill