Re: Solr Cell and Deduplication - Get ID of doc
Thanks for the responses. This is exactly what I had to resort to. I will definitely put in a feature request to get the generated ID back from the extract request. I am doing this with PHP cURL for extraction and pecl php solr for querying. I am then saving the unique id and dupe hash in a MySQL table which I check against after the doc is indexed in Solr. If it is a dupe I delete the Solr record and discard the file. My problem now is the dupe hash sometimes comes back NULL from Solr although when I check it through Solr Admin it is there. I am working through this now to isolate. I had to set Solr to ALLOW duplicates because I have to somehow know that the file is a dupe and then remove the duplicate files on my filesystem. Based on the extract response I have no way of knowing this if duplicates are disallowed. -Bill On Tue, Mar 2, 2010 at 2:11 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : To quote from the wiki, ... That's all true ... but Bill explicitly said he wanted to use SignatureUpdateProcessorFactory to generate a uniqueKey from the content field post-extraction so he could dedup documents with the same content ... his question was how to get that key after adding a doc. Using a unique literal.field value will work -- but only as the value of a secondary field that he can then query on to get the uniqueKeyField value. : : You could create your own unique ID and pass it in with the : : literal.field=value feature. : : By which Lance means you could specify an unique value in a differnet : field from yoru uniqueKey field, and then query on that field:value pair : to get the doc after it's been added -- but that query will only work : until some other version of the doc (with some other value) overwrites it. : so you'd esentially have to query for the field:value to lookup the : uniqueKey. : : it seems like it should definitely be feasible for the : Update RequestHandlers to return the uniqueKeyField values for all the : added docs (regardless of wether the key was included in the request, or : added by an UpdateProcessor -- but i'm not sure how that would fit in with : the SolrJ API. : : would you mind opening a feature request in Jira? : : : : -Hoss : : : : : : -- : Lance Norskog : goks...@gmail.com : -Hoss
Re: Solr Cell and Deduplication - Get ID of doc
: You could create your own unique ID and pass it in with the : literal.field=value feature. By which Lance means you could specify an unique value in a differnet field from yoru uniqueKey field, and then query on that field:value pair to get the doc after it's been added -- but that query will only work until some other version of the doc (with some other value) overwrites it. so you'd esentially have to query for the field:value to lookup the uniqueKey. it seems like it should definitely be feasible for the Update RequestHandlers to return the uniqueKeyField values for all the added docs (regardless of wether the key was included in the request, or added by an UpdateProcessor -- but i'm not sure how that would fit in with the SolrJ API. would you mind opening a feature request in Jira? -Hoss
Re: Solr Cell and Deduplication - Get ID of doc
To quote from the wiki, http://wiki.apache.org/solr/ExtractingRequestHandler curl 'http://localhost:8983/solr/update/extract?literal.id=doc1commit=true' -F myfi...@tutorial.html This runs the extractor on your input file (in this case an HTML file). It then stores the generated document with the id field (the uniqueKey declared in schema.xml) set to 'doc1'. This way, you do not rely on the ExtractingRequestHandler to create a unique key for you. This command throws away that generated key. On Mon, Mar 1, 2010 at 4:22 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : You could create your own unique ID and pass it in with the : literal.field=value feature. By which Lance means you could specify an unique value in a differnet field from yoru uniqueKey field, and then query on that field:value pair to get the doc after it's been added -- but that query will only work until some other version of the doc (with some other value) overwrites it. so you'd esentially have to query for the field:value to lookup the uniqueKey. it seems like it should definitely be feasible for the Update RequestHandlers to return the uniqueKeyField values for all the added docs (regardless of wether the key was included in the request, or added by an UpdateProcessor -- but i'm not sure how that would fit in with the SolrJ API. would you mind opening a feature request in Jira? -Hoss -- Lance Norskog goks...@gmail.com
Re: Solr Cell and Deduplication - Get ID of doc
: To quote from the wiki, ... That's all true ... but Bill explicitly said he wanted to use SignatureUpdateProcessorFactory to generate a uniqueKey from the content field post-extraction so he could dedup documents with the same content ... his question was how to get that key after adding a doc. Using a unique literal.field value will work -- but only as the value of a secondary field that he can then query on to get the uniqueKeyField value. : : You could create your own unique ID and pass it in with the : : literal.field=value feature. : : By which Lance means you could specify an unique value in a differnet : field from yoru uniqueKey field, and then query on that field:value pair : to get the doc after it's been added -- but that query will only work : until some other version of the doc (with some other value) overwrites it. : so you'd esentially have to query for the field:value to lookup the : uniqueKey. : : it seems like it should definitely be feasible for the : Update RequestHandlers to return the uniqueKeyField values for all the : added docs (regardless of wether the key was included in the request, or : added by an UpdateProcessor -- but i'm not sure how that would fit in with : the SolrJ API. : : would you mind opening a feature request in Jira? : : : : -Hoss : : : : : : -- : Lance Norskog : goks...@gmail.com : -Hoss
Re: Solr Cell and Deduplication - Get ID of doc
Any thoughts on this? I would like to get the id back in the request after indexing. My initial thoughts were to do a search to get the docid based on the attr_stream_name after indexing but now that I reread my message I mentioned the attr_stream_name (file_name) may be different so that is unreliable. My only option is to somehow return the id in the XML response. Any guidance is greatly appreciated. -Bill On Wed, Feb 24, 2010 at 12:06 PM, Bill Engle billengle...@gmail.com wrote: Hi - New Solr user here. I am using Solr Cell to index files (PDF, doc, docx, txt, htm, etc.) and there is a good chance that a new file will have duplicate content but not necessarily the same file name. To avoid this I am using the deduplication feature of Solr. updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldid/str bool name=overwriteDupestrue/bool str name=fieldsattr_content/str str name=signatureClassorg.apache.solr.update.processor./str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain How do I get the id value post Solr processing. Is there someway to modify the curl response so that id is returned. I need this id because I would like to rename the file to the id value. I could probably do a Solr search after the fact to get the id field based on the attr_stream_name but I would like to do only one request. curl ' http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true' -F myfi...@myfile.pdf Thanks, Bill
Re: Solr Cell and Deduplication - Get ID of doc
You could create your own unique ID and pass it in with the literal.field=value feature. http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters On Fri, Feb 26, 2010 at 7:56 AM, Bill Engle billengle...@gmail.com wrote: Any thoughts on this? I would like to get the id back in the request after indexing. My initial thoughts were to do a search to get the docid based on the attr_stream_name after indexing but now that I reread my message I mentioned the attr_stream_name (file_name) may be different so that is unreliable. My only option is to somehow return the id in the XML response. Any guidance is greatly appreciated. -Bill On Wed, Feb 24, 2010 at 12:06 PM, Bill Engle billengle...@gmail.com wrote: Hi - New Solr user here. I am using Solr Cell to index files (PDF, doc, docx, txt, htm, etc.) and there is a good chance that a new file will have duplicate content but not necessarily the same file name. To avoid this I am using the deduplication feature of Solr. updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldid/str bool name=overwriteDupestrue/bool str name=fieldsattr_content/str str name=signatureClassorg.apache.solr.update.processor./str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain How do I get the id value post Solr processing. Is there someway to modify the curl response so that id is returned. I need this id because I would like to rename the file to the id value. I could probably do a Solr search after the fact to get the id field based on the attr_stream_name but I would like to do only one request. curl ' http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true' -F myfi...@myfile.pdf Thanks, Bill -- Lance Norskog goks...@gmail.com
Solr Cell and Deduplication - Get ID of doc
Hi - New Solr user here. I am using Solr Cell to index files (PDF, doc, docx, txt, htm, etc.) and there is a good chance that a new file will have duplicate content but not necessarily the same file name. To avoid this I am using the deduplication feature of Solr. updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldid/str bool name=overwriteDupestrue/bool str name=fieldsattr_content/str str name=signatureClassorg.apache.solr.update.processor./str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain How do I get the id value post Solr processing. Is there someway to modify the curl response so that id is returned. I need this id because I would like to rename the file to the id value. I could probably do a Solr search after the fact to get the id field based on the attr_stream_name but I would like to do only one request. curl ' http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true' -F myfi...@myfile.pdf Thanks, Bill