Re: [Cdk-user] Patch for FPS write support in the CDK

Andrew Dalke Mon, 24 Oct 2011 09:29:15 -0700

On Oct 24, 2011, at 10:11 AM, Egon Willighagen wrote:
> per an itch caused by this Blue Obelisk eXchange question:
> 
> http://blueobelisk.shapado.com/questions/how-can-i-store-cdk-fingerprints-in-a-file
> 
> on FPS support, I have written a writer (no reader yet):


Yay!


> https://github.com/egonw/cdk/commit/baf2a383ba07002f97029ff12a2090feca2c2655
> 
> I most welcome feedback.

== Where you have

+        this.byteLength = this.size/8;
+        // use the last byte for whatever bits we have left
+        if (this.byteLength*8 < size) byteLength++;

I've been using the equivalent of:

  this.byteLength = (this.size+7)/8;


== This bit of code:

+      for (int i=0; i<fingerprint.size(); i++) {
+            if (fingerprint.get(i)) set(i);
+        }

can be improved somewhat by iterating only over the on bits. I think will work:

 i=0;
 while ( (i=fingerprint.nextSetBit(i)) != -1 ) {
   set(i);
 }



== This code to unset()

 +          print[bytePos] = (byte) (print[bytePos] ^ (byte)(0x01 << bitPos));

didn't look correct at first observation. It looks like it flips the rather 
than unsets. You probably want one of:

  print[bytePos] = (byte) (print[bytePos] & (byte) (256 - (0x01 << (bitPos+1)));
-or
  print[bytePos] = (byte) (print[bytePos] & ~((byte) (0x01<<bitPos)));

Only when I looked at the larger scope did I see that you test to see if the 
bit is set first, before unsetting it.

However, you don't have a test case for that; you never unset() a bit which is 
0 to make sure that that test is working. One way to test this case is to 
change testUnSet() 
so it has the two lines marked with ">"

    public void testUnSet() {
        BitSet set = new BitSet(166);
        FPSFingerprint fps = new FPSFingerprint(set, 166);
        Assert.assertFalse(fps.isSet(5));
        fps.set(5);
        Assert.assertTrue(fps.isSet(5));
        fps.unset(5);
        Assert.assertFalse(fps.isSet(5));
>        fps.unset(5);
>        Assert.assertFalse(fps.isSet(5));
    }

One way to improve the current code is to remove the isInRange test for unset() 
(line 80), since the "isSet()" call on line 81 will do that anyway.

However, this is going to be more efficient, because you don't need to 
recompute bytePos and bitPos 

 public void unset(int position) {
        isInRange(position);
        int bytePos = getBytePosition(position);
        byte bit = (byte) (0x01 << getBitPosition(position));
        byte b = print[bytePos];
        if (b & bit) {
          print[bytePos] = (byte) (b ^ bit);
        }
 }



== As a deeper question, why do you need public methods to set and unset?

Since you already have a fingerprint type, even if it's only a BitSet, then you 
just need to convert the BitSet to the hex string. The constructor should 
allocate scratch space, and you have a single method which takes the BitSet and 
returns the hex string.

Instead of having a dedicated FPSFingerprint, you instead have only the 
FPSWriter. Its constructor makes space for the StringBuffer, and the 'write' 
implementation doesn't need to first convert the bitset to an FPSFingerprint 
then convert the FPSFingerprint to a StringBuffer, then convert the 
StringBuffer to a string.


== I didn't realize that hex encoding in Java took much code.

You have (and 
http://stackoverflow.com/questions/332079/in-java-how-do-i-convert-a-byte-array-to-a-string-of-hex-digits-while-keeping-l
 agrees):

@TestMethod("testBitCheckFake0,testBitCheckFake43")
+    public String toString() {
+        StringBuffer buffer = new StringBuffer();
+        for (int i=0; i<byteLength; i++) {
+            String toAppend = Integer.toHexString(0xff & print[i]);
+            if (toAppend.length() == 1) buffer.append('0');
+            buffer.append(toAppend);
+        }
+        return buffer.toString();
+    }

Since toString() depends on the buffer, and the buffer
is a fixed length, you can pre-allocate the StringBuffer()
in the FPSFingerprint constructor, and let it be full of '0's.
Then toString() becomes something like:

   char[] hexArray = 
{'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};


   public String toString() {
      for (int i=0; i<byteLength; i++) {
          b = print[i];
          buffer.setCharAt(i*2, hexArray[b >> 4]);
          buffer.setCharAt(i*2, hexArray[b & 0x0f]);
      }
      return buffer.toString();
   }

This means only one memory allocation during conversion,
rather than several.


== You have the incorrect ISO date format

+        SimpleDateFormat formatter = new 
SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss");

The 'hh' is for 12 hour time, while 'HH' is for 24 hour time. This should be

       SimpleDateFormat formatter = new 
SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");


== Do you have a compound identifier?


I see you write the output as:

+        this.output.print(fpsFingerprint.toString());
+        this.output.print('\t');
+        this.output.println(++counter);


> But, I am not entirely sure at this moment if the header is fully
> conform the standard...

The output looks good,

> One thing it does not do yet, is pass along fingerprint parameters, if
> any, such as the path length in the above example...

thought it would be better with that information.

> Also, it serializes BitSet's, which do not have a source field
> (neither does the original IMolecule),

Not a problem. That's an optional parameter and only meant to
improve metadata tracking.

> and neither do I currently use
> the CDKConstants.TITLE as identifier, but just numbers, like the '1'
> above. This is because that identifier field has a rather strict
> syntax, while the .TITLE can be any arbitrary String...

How is it strict? I suspect you are thinking about the value field for a header 
line, and not the identifier field for a fingerprint data line.

The current restriction on an identifier is that it can't include the "\t", 
"\n", or "\r" characters, and it should used only bytes in the range 32-127.

What do you need to use any of those characters?

What do you do with an SD file or other output formats when the identifier 
contains newlines? Can you do the same here?

I want to loosen the restriction on 32-127 and allow a wider range. I don't 
know if I should let it be any octet, or if I should say it's a UTF-8 encoded 
string.

I think this field should support internationalization, and let people use 
UTF-8 encoded identifiers. This would help the (to me) uncommon use-case where 
the identifier include non-ASCII, such as IUPAC names in Japanese.

It would mean that some byte strings aren't supported:
 http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
Python solves this with surrogate escapes, but I don't have experience with it.

Do you use characters with code points >=128? How?


                                Andrew
                                [email protected]



------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Patch for FPS write support in the CDK

Reply via email to