[ 
https://issues.apache.org/jira/browse/DROIDS-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717201#action_12717201
 ] 

Mingfai Ma commented on DROIDS-53:
----------------------------------

hashCode in integer is signed (with negative number). it might be better to use 
unsigned long to represent signed integer hashcode. 

{code}
public long unsignInteger(int hashCode){
        long id = hashCode;
        id = (id << 32) >>> 32;
        return id;
}
{code}

> Implement a unique hash function for Task ID
> --------------------------------------------
>
>                 Key: DROIDS-53
>                 URL: https://issues.apache.org/jira/browse/DROIDS-53
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>            Reporter: Mingfai Ma
>
> For the SimpleTaskQueueWithHistory.previous, CrawlingWorker "// TODO -- make 
> the hashvalue for Outlink...", 
> after some research, it seems to smaller hash that we could get easy is the 
> hashCode() of URL String. It takes 16 bytes only. Java native hash code 
> should be unique, otherwise, any HashTable or HashMap will be wrong.
> if in case we need to represent it in String, we may compress the hashCode in 
> base62, such as with the following function:
> {code}
>   public static final char[] baseChars = 
> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".toCharArray();
>   public static String base62(int hashCode) {
>     StringBuilder out = new StringBuilder(hashCode == 0 ? "0" : "");
>     while(hashCode!=0){
>       out.append(baseChars[hashCode % baseChars.size()])
>       hashCode = hashCode / baseChars.size()
>     }
>     return out.reverse().toString();
>   }
> {code}
> e.g.
> {code}
> assertEquals("c55Ow", 
> base62("http://incubator.apache.org/droids/".hashCode()) );
> {code}
> for 5 characters, it's 48 bytes.
> It is important to get a short, unique hash function for URL. For any crawler 
> that do not want duplication, we unavoidably have to store a URL hash in 
> memory or data grid. (unless it is stored on disk or in a database at the 
> cost of higher look up time) With a hashCode that takes 16bytes, it's around 
> 15M heap size for 1M URLs. 
> This task is related to DROIDS-52

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to