http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/UniqueKey

On Wed, Dec 7, 2011 at 5:04 PM, Lance Norskog <goks...@gmail.com> wrote:

> Yes, the SignatureUpdateProcessor is what you want. The 128-bit hash is
> exactly what you want to use in this situation.  You will never get the
> same ID for two urls- collisions have never been observed "in the wild" for
> this hash algorithm.
>
> Another cool thing about using hash-codes as fields is this: you can give
> the first few letters of the code and a wildcard to get a random subset of
> the index with a given size. For example, 0a0* gives 1/(16^3) of the index.
>
> On Wed, Dec 7, 2011 at 2:48 AM, blaise thomson <blaiset...@yahoo.com>wrote:
>
>> Hi Hoss,
>>
>> Thanks for getting back to me on this.
>>
>> : I've been trying to use the UUIDField in solr to maintain ids of the
>> >: pages I've crawled with nutch (as per
>> >: http://wiki.apache.org/solr/UniqueKey). The use case is that I want to
>> >: have the server able to use these ids in another database for various
>> >: statistics gathering. So I want the link url to act like a primary key
>> >: for determining if a page exists, and if it doesn't exist to generate a
>> >: new uuid.
>> >
>> >
>> >i'm confused ... if you want the URL to be the primary key, then use the
>> >URL as the primary key, why use the UUID Field at all?
>>
>> I do use the URL as the primary key. The thing is that I want to have a
>> fixed length id for the document so that I can reference it in another
>> database. For example, if I want to count clicks of the url, then I was
>> thinking of using a mysql database along with solr, where each document id
>> has a count of the clicks. I didn't want to use the url itself in that db
>> because of its arbitrary length.
>>
>>
>> :     2. Looking at the code for UUIDField (relevant bit pasted below), it
>> >: seems that the UUID is just generated randomly. There is no check if
>> the
>> >: generated UUID has already been used.
>> >
>> >
>> >correct ... if you specify "NEW" then it generates a new UUID for you --
>> >if you wnat to update an existing doc with an existing UUID then you need
>> >to send the real, existing, value of the UUID for the doc you are
>> >updating.
>> >
>> >
>> >: I can sort of solve this problem by generating the UUID myself, as a
>> >: hash of the link url, but that doesn't help me for those random cases
>> >: when the hash might happen to generate the same UUID.
>> >:
>> >: Does anyone know if there is a way for solr to only add a uuid if the
>> >: document doesn't already exist?
>> >
>> >
>> >I don't really understand your second sentence, but based on that first
>> >sentence it sounds like what you want may be to use something like the
>> >SignatureUpdateProcessor to generate a hash based on the URL...
>> >
>> >
>> >https://wiki.apache.org/solr/Deduplication
>>
>>
>> I didn't know actually about this, so thanks for sharing. I'm not sure it
>> does exactly what I want though. I think it is more for checking if the two
>> docs are the same, which for my purposes, the url works fine for.
>>
>> I think I've sort of come to realise that generating a uuid from the url
>> might be the way to go. There is a chance of getting the same uuid from
>> different urls, but it's only 1 in 2^128, so it's basically non-existant.
>>
>> Thanks again,
>> Blaise
>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>
>


-- 
Lance Norskog
goks...@gmail.com

Reply via email to