Hi, let me start by saying I am sorry it took me so long to answer.

Rajarshi Guha wrote: (>)
> In the past for binary fingerprints the string version of a BitSet was
> sufficient to output fp's to a file
>
> For something like the signature fingerprint - I see it has a method
> to output  a bit version and the raw version. How is the bit version
> generated and what is the bit string length?

I am assuming you mean the signature fingerprinter as in the class
generating the fingerprint not the actual representation of it. It
does indeed provide the raw version and a new bit version. I
introduced a few interfaces for doing different fingerprint
representation since a dense BitSet is not always what one wants. My
implementation of the bit fingerprint interface does two things:

 1. Provides sparse representation of which bits are set by storing
indexes to the set bits. Compared to the approach based on Javas
BitSet this means that instead of storing an array of Bits it stores
and array of integers corresponding to which bits are set. For the
signature fingerprint this consisted of hashing to an integer using
Javas hashcode method on the signature String and then using that
integer unhashed. As in instead of having length 1024 (corresponding
to 10 bits for lookup) which is common in CDK it uses 32 bits i.e. an
integer.

 2. It does not store the actual signature which is the case for the
raw fingerprint. The raw fingerprint is handy when one wants to go
back using the benefits of signatures. However storing these Strings
take up memory and sometimes that is not wanted.

 Note that the sparse representation only makes sense for fingerprints
that are truly sparse i.e has very few true bits and many false.

> is it appropriate to dump out a binary version of a signature
> fingerprint? or instead provide the raw values of such an fp?

That depends on what you want. If you are going to use it only for
lookup and don't want to go back to the signature Strings then yea
some sort of String representation looking a bit like this probably
makes sens:

  5 45 756 45657 4568721 ...

where the numbers signifies which bits are set to true. I did not make
the class / interface with such a method but perhaps it would be good?
(Patches welcome)

As for if you want to keep the signatures too then perhaps another
String representation would be good for the RawFingerprint.

Before concluding let me also say I am sorry I never came around to
writing that blog post Egon suggested me to do about my ideas behind
the new Fingerprinter things was. Actually I think this mail has sort
of explained most of it -- except maybe that there now also is an
interface for Count fingerprints -- and now even that is mentioned...
:)

-- 
// Jonathan

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to