[
https://issues.apache.org/jira/browse/DROIDS-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717201#action_12717201
]
Mingfai Ma commented on DROIDS-53:
----------------------------------
hashCode in integer is signed (with negative number). it might be better to use
unsigned long to represent signed integer hashcode.
{code}
public long unsignInteger(int hashCode){
long id = hashCode;
id = (id << 32) >>> 32;
return id;
}
{code}
> Implement a unique hash function for Task ID
> --------------------------------------------
>
> Key: DROIDS-53
> URL: https://issues.apache.org/jira/browse/DROIDS-53
> Project: Droids
> Issue Type: New Feature
> Components: core
> Reporter: Mingfai Ma
>
> For the SimpleTaskQueueWithHistory.previous, CrawlingWorker "// TODO -- make
> the hashvalue for Outlink...",
> after some research, it seems to smaller hash that we could get easy is the
> hashCode() of URL String. It takes 16 bytes only. Java native hash code
> should be unique, otherwise, any HashTable or HashMap will be wrong.
> if in case we need to represent it in String, we may compress the hashCode in
> base62, such as with the following function:
> {code}
> public static final char[] baseChars =
> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".toCharArray();
> public static String base62(int hashCode) {
> StringBuilder out = new StringBuilder(hashCode == 0 ? "0" : "");
> while(hashCode!=0){
> out.append(baseChars[hashCode % baseChars.size()])
> hashCode = hashCode / baseChars.size()
> }
> return out.reverse().toString();
> }
> {code}
> e.g.
> {code}
> assertEquals("c55Ow",
> base62("http://incubator.apache.org/droids/".hashCode()) );
> {code}
> for 5 characters, it's 48 bytes.
> It is important to get a short, unique hash function for URL. For any crawler
> that do not want duplication, we unavoidably have to store a URL hash in
> memory or data grid. (unless it is stored on disk or in a database at the
> cost of higher look up time) With a hashCode that takes 16bytes, it's around
> 15M heap size for 1M URLs.
> This task is related to DROIDS-52
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.