Re: Solr Cell and Deduplication - Get ID of doc

2010-03-02 Thread Bill Engle
Thanks for the responses. This is exactly what I had to resort to. I will definitely put in a feature request to get the generated ID back from the extract request. I am doing this with PHP cURL for extraction and pecl php solr for querying. I am then saving the unique id and dupe hash in a MyS

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter
: To quote from the wiki, ... That's all true ... but Bill explicitly said he wanted to use SignatureUpdateProcessorFactory to generate a uniqueKey from the content field post-extraction so he could dedup documents with the same content ... his question was how to get that key after ad

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Lance Norskog
To quote from the wiki, http://wiki.apache.org/solr/ExtractingRequestHandler curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "myfi...@tutorial.html" This runs the extractor on your input file (in this case an HTML file). It then stores the generated document with t

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter
: You could create your own unique ID and pass it in with the : literal.field=value feature. By which Lance means you could specify an unique value in a differnet field from yoru uniqueKey field, and then query on that field:value pair to get the doc after it's been added -- but that query will

Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Lance Norskog
You could create your own unique ID and pass it in with the literal.field=value feature. http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters On Fri, Feb 26, 2010 at 7:56 AM, Bill Engle wrote: > Any thoughts on this? I would like to get the id back in the request after > indexin

Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Bill Engle
Any thoughts on this? I would like to get the id back in the request after indexing. My initial thoughts were to do a search to get the docid based on the attr_stream_name after indexing but now that I reread my message I mentioned the attr_stream_name (file_name) may be different so that is unre

Solr Cell and Deduplication - Get ID of doc

2010-02-24 Thread Bill Engle
Hi - New Solr user here. I am using Solr Cell to index files (PDF, doc, docx, txt, htm, etc.) and there is a good chance that a new file will have duplicate content but not necessarily the same file name. To avoid this I am using the deduplication feature of Solr. true id