Yes, the SignatureUpdateProcessor is what you want. The 128-bit hash is
exactly what you want to use in this situation.  You will never get the
same ID for two urls- collisions have never been observed "in the wild" for
this hash algorithm.

Another cool thing about using hash-codes as fields is this: you can give
the first few letters of the code and a wildcard to get a random subset of
the index with a given size. For example, 0a0* gives 1/(16^3) of the index.

On Wed, Dec 7, 2011 at 2:48 AM, blaise thomson <blaiset...@yahoo.com> wrote:

> Hi Hoss,
>
> Thanks for getting back to me on this.
>
> : I've been trying to use the UUIDField in solr to maintain ids of the
> >: pages I've crawled with nutch (as per
> >: http://wiki.apache.org/solr/UniqueKey). The use case is that I want to
> >: have the server able to use these ids in another database for various
> >: statistics gathering. So I want the link url to act like a primary key
> >: for determining if a page exists, and if it doesn't exist to generate a
> >: new uuid.
> >
> >
> >i'm confused ... if you want the URL to be the primary key, then use the
> >URL as the primary key, why use the UUID Field at all?
>
> I do use the URL as the primary key. The thing is that I want to have a
> fixed length id for the document so that I can reference it in another
> database. For example, if I want to count clicks of the url, then I was
> thinking of using a mysql database along with solr, where each document id
> has a count of the clicks. I didn't want to use the url itself in that db
> because of its arbitrary length.
>
>
> :     2. Looking at the code for UUIDField (relevant bit pasted below), it
> >: seems that the UUID is just generated randomly. There is no check if the
> >: generated UUID has already been used.
> >
> >
> >correct ... if you specify "NEW" then it generates a new UUID for you --
> >if you wnat to update an existing doc with an existing UUID then you need
> >to send the real, existing, value of the UUID for the doc you are
> >updating.
> >
> >
> >: I can sort of solve this problem by generating the UUID myself, as a
> >: hash of the link url, but that doesn't help me for those random cases
> >: when the hash might happen to generate the same UUID.
> >:
> >: Does anyone know if there is a way for solr to only add a uuid if the
> >: document doesn't already exist?
> >
> >
> >I don't really understand your second sentence, but based on that first
> >sentence it sounds like what you want may be to use something like the
> >SignatureUpdateProcessor to generate a hash based on the URL...
> >
> >
> >https://wiki.apache.org/solr/Deduplication
>
>
> I didn't know actually about this, so thanks for sharing. I'm not sure it
> does exactly what I want though. I think it is more for checking if the two
> docs are the same, which for my purposes, the url works fine for.
>
> I think I've sort of come to realise that generating a uuid from the url
> might be the way to go. There is a chance of getting the same uuid from
> different urls, but it's only 1 in 2^128, so it's basically non-existant.
>
> Thanks again,
> Blaise




-- 
Lance Norskog
goks...@gmail.com

Reply via email to