http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/UniqueKey
On Wed, Dec 7, 2011 at 5:04 PM, Lance Norskog <goks...@gmail.com> wrote: > Yes, the SignatureUpdateProcessor is what you want. The 128-bit hash is > exactly what you want to use in this situation. You will never get the > same ID for two urls- collisions have never been observed "in the wild" for > this hash algorithm. > > Another cool thing about using hash-codes as fields is this: you can give > the first few letters of the code and a wildcard to get a random subset of > the index with a given size. For example, 0a0* gives 1/(16^3) of the index. > > On Wed, Dec 7, 2011 at 2:48 AM, blaise thomson <blaiset...@yahoo.com>wrote: > >> Hi Hoss, >> >> Thanks for getting back to me on this. >> >> : I've been trying to use the UUIDField in solr to maintain ids of the >> >: pages I've crawled with nutch (as per >> >: http://wiki.apache.org/solr/UniqueKey). The use case is that I want to >> >: have the server able to use these ids in another database for various >> >: statistics gathering. So I want the link url to act like a primary key >> >: for determining if a page exists, and if it doesn't exist to generate a >> >: new uuid. >> > >> > >> >i'm confused ... if you want the URL to be the primary key, then use the >> >URL as the primary key, why use the UUID Field at all? >> >> I do use the URL as the primary key. The thing is that I want to have a >> fixed length id for the document so that I can reference it in another >> database. For example, if I want to count clicks of the url, then I was >> thinking of using a mysql database along with solr, where each document id >> has a count of the clicks. I didn't want to use the url itself in that db >> because of its arbitrary length. >> >> >> : 2. Looking at the code for UUIDField (relevant bit pasted below), it >> >: seems that the UUID is just generated randomly. There is no check if >> the >> >: generated UUID has already been used. >> > >> > >> >correct ... if you specify "NEW" then it generates a new UUID for you -- >> >if you wnat to update an existing doc with an existing UUID then you need >> >to send the real, existing, value of the UUID for the doc you are >> >updating. >> > >> > >> >: I can sort of solve this problem by generating the UUID myself, as a >> >: hash of the link url, but that doesn't help me for those random cases >> >: when the hash might happen to generate the same UUID. >> >: >> >: Does anyone know if there is a way for solr to only add a uuid if the >> >: document doesn't already exist? >> > >> > >> >I don't really understand your second sentence, but based on that first >> >sentence it sounds like what you want may be to use something like the >> >SignatureUpdateProcessor to generate a hash based on the URL... >> > >> > >> >https://wiki.apache.org/solr/Deduplication >> >> >> I didn't know actually about this, so thanks for sharing. I'm not sure it >> does exactly what I want though. I think it is more for checking if the two >> docs are the same, which for my purposes, the url works fine for. >> >> I think I've sort of come to realise that generating a uuid from the url >> might be the way to go. There is a chance of getting the same uuid from >> different urls, but it's only 1 in 2^128, so it's basically non-existant. >> >> Thanks again, >> Blaise > > > > > -- > Lance Norskog > goks...@gmail.com > > -- Lance Norskog goks...@gmail.com