[EMAIL PROTECTED] wrote: > I am using nutch to crawl & index an intranet consisting of an initial > fixed set of urls (approx. 3000). For my application I need to reference > some metadata (stored in a database) for each of the original 3000 urls. > > Does nutch assign a unique integer id for each starting url in the > crawldb? If so, does the API allow me to get it? When a search is > performed can/is this id returned for each 'hit'? >
Nutch uses the full URL as a unique identifier. If your collection is relatively small (in the order of a few million docs or less) you can use MD5Hash.digest(url).halfDigest(), which returns a long value - and with pretty good confidence it should be unique. > I want my 'display search results' page to return the nutch results for > each 'hit' as well as the metadata for the hit url if it is one of the > original 3000. I'd rather use an integer ID than have to match on the url > string itself. > Nutch doesn't number the URLs, so you will need to somehow map URLs to integers. You could do this sequentially, but each time you add/remove URLs form the crawldb you will get different numbers for the same URLs. You could also use a perfect hash function which maps String to Integer, but even in this case you would have a small probability that existing URLs will be re-numbered. The space of int is too small to use random hashing and hope there are no collisions. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
