Hi Hoss,

Thanks for getting back to me on this.

: I've been trying to use the UUIDField in solr to maintain ids of the 
>: pages I've crawled with nutch (as per 
>: http://wiki.apache.org/solr/UniqueKey). The use case is that I want to 
>: have the server able to use these ids in another database for various 
>: statistics gathering. So I want the link url to act like a primary key 
>: for determining if a page exists, and if it doesn't exist to generate a 
>: new uuid.
>
>
>i'm confused ... if you want the URL to be the primary key, then use the 
>URL as the primary key, why use the UUID Field at all?

I do use the URL as the primary key. The thing is that I want to have a fixed 
length id for the document so that I can reference it in another database. For 
example, if I want to count clicks of the url, then I was thinking of using a 
mysql database along with solr, where each document id has a count of the 
clicks. I didn't want to use the url itself in that db because of its arbitrary 
length. 


:     2. Looking at the code for UUIDField (relevant bit pasted below), it 
>: seems that the UUID is just generated randomly. There is no check if the 
>: generated UUID has already been used.
>
>
>correct ... if you specify "NEW" then it generates a new UUID for you -- 
>if you wnat to update an existing doc with an existing UUID then you need 
>to send the real, existing, value of the UUID for the doc you are 
>updating.
>
>
>: I can sort of solve this problem by generating the UUID myself, as a 
>: hash of the link url, but that doesn't help me for those random cases 
>: when the hash might happen to generate the same UUID.
>: 
>: Does anyone know if there is a way for solr to only add a uuid if the 
>: document doesn't already exist?
>
>
>I don't really understand your second sentence, but based on that first 
>sentence it sounds like what you want may be to use something like the 
>SignatureUpdateProcessor to generate a hash based on the URL...
>
>
>https://wiki.apache.org/solr/Deduplication


I didn't know actually about this, so thanks for sharing. I'm not sure it does 
exactly what I want though. I think it is more for checking if the two docs are 
the same, which for my purposes, the url works fine for. 

I think I've sort of come to realise that generating a uuid from the url might 
be the way to go. There is a chance of getting the same uuid from different 
urls, but it's only 1 in 2^128, so it's basically non-existant.

Thanks again,
Blaise

Reply via email to