On Oct 24, 2011, at 10:11 AM, Egon Willighagen wrote: > per an itch caused by this Blue Obelisk eXchange question: > > http://blueobelisk.shapado.com/questions/how-can-i-store-cdk-fingerprints-in-a-file > > on FPS support, I have written a writer (no reader yet):
Yay! > https://github.com/egonw/cdk/commit/baf2a383ba07002f97029ff12a2090feca2c2655 > > I most welcome feedback. == Where you have + this.byteLength = this.size/8; + // use the last byte for whatever bits we have left + if (this.byteLength*8 < size) byteLength++; I've been using the equivalent of: this.byteLength = (this.size+7)/8; == This bit of code: + for (int i=0; i<fingerprint.size(); i++) { + if (fingerprint.get(i)) set(i); + } can be improved somewhat by iterating only over the on bits. I think will work: i=0; while ( (i=fingerprint.nextSetBit(i)) != -1 ) { set(i); } == This code to unset() + print[bytePos] = (byte) (print[bytePos] ^ (byte)(0x01 << bitPos)); didn't look correct at first observation. It looks like it flips the rather than unsets. You probably want one of: print[bytePos] = (byte) (print[bytePos] & (byte) (256 - (0x01 << (bitPos+1))); -or print[bytePos] = (byte) (print[bytePos] & ~((byte) (0x01<<bitPos))); Only when I looked at the larger scope did I see that you test to see if the bit is set first, before unsetting it. However, you don't have a test case for that; you never unset() a bit which is 0 to make sure that that test is working. One way to test this case is to change testUnSet() so it has the two lines marked with ">" public void testUnSet() { BitSet set = new BitSet(166); FPSFingerprint fps = new FPSFingerprint(set, 166); Assert.assertFalse(fps.isSet(5)); fps.set(5); Assert.assertTrue(fps.isSet(5)); fps.unset(5); Assert.assertFalse(fps.isSet(5)); > fps.unset(5); > Assert.assertFalse(fps.isSet(5)); } One way to improve the current code is to remove the isInRange test for unset() (line 80), since the "isSet()" call on line 81 will do that anyway. However, this is going to be more efficient, because you don't need to recompute bytePos and bitPos public void unset(int position) { isInRange(position); int bytePos = getBytePosition(position); byte bit = (byte) (0x01 << getBitPosition(position)); byte b = print[bytePos]; if (b & bit) { print[bytePos] = (byte) (b ^ bit); } } == As a deeper question, why do you need public methods to set and unset? Since you already have a fingerprint type, even if it's only a BitSet, then you just need to convert the BitSet to the hex string. The constructor should allocate scratch space, and you have a single method which takes the BitSet and returns the hex string. Instead of having a dedicated FPSFingerprint, you instead have only the FPSWriter. Its constructor makes space for the StringBuffer, and the 'write' implementation doesn't need to first convert the bitset to an FPSFingerprint then convert the FPSFingerprint to a StringBuffer, then convert the StringBuffer to a string. == I didn't realize that hex encoding in Java took much code. You have (and http://stackoverflow.com/questions/332079/in-java-how-do-i-convert-a-byte-array-to-a-string-of-hex-digits-while-keeping-l agrees): @TestMethod("testBitCheckFake0,testBitCheckFake43") + public String toString() { + StringBuffer buffer = new StringBuffer(); + for (int i=0; i<byteLength; i++) { + String toAppend = Integer.toHexString(0xff & print[i]); + if (toAppend.length() == 1) buffer.append('0'); + buffer.append(toAppend); + } + return buffer.toString(); + } Since toString() depends on the buffer, and the buffer is a fixed length, you can pre-allocate the StringBuffer() in the FPSFingerprint constructor, and let it be full of '0's. Then toString() becomes something like: char[] hexArray = {'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'}; public String toString() { for (int i=0; i<byteLength; i++) { b = print[i]; buffer.setCharAt(i*2, hexArray[b >> 4]); buffer.setCharAt(i*2, hexArray[b & 0x0f]); } return buffer.toString(); } This means only one memory allocation during conversion, rather than several. == You have the incorrect ISO date format + SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss"); The 'hh' is for 12 hour time, while 'HH' is for 24 hour time. This should be SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss"); == Do you have a compound identifier? I see you write the output as: + this.output.print(fpsFingerprint.toString()); + this.output.print('\t'); + this.output.println(++counter); > But, I am not entirely sure at this moment if the header is fully > conform the standard... The output looks good, > One thing it does not do yet, is pass along fingerprint parameters, if > any, such as the path length in the above example... thought it would be better with that information. > Also, it serializes BitSet's, which do not have a source field > (neither does the original IMolecule), Not a problem. That's an optional parameter and only meant to improve metadata tracking. > and neither do I currently use > the CDKConstants.TITLE as identifier, but just numbers, like the '1' > above. This is because that identifier field has a rather strict > syntax, while the .TITLE can be any arbitrary String... How is it strict? I suspect you are thinking about the value field for a header line, and not the identifier field for a fingerprint data line. The current restriction on an identifier is that it can't include the "\t", "\n", or "\r" characters, and it should used only bytes in the range 32-127. What do you need to use any of those characters? What do you do with an SD file or other output formats when the identifier contains newlines? Can you do the same here? I want to loosen the restriction on 32-127 and allow a wider range. I don't know if I should let it be any octet, or if I should say it's a UTF-8 encoded string. I think this field should support internationalization, and let people use UTF-8 encoded identifiers. This would help the (to me) uncommon use-case where the identifier include non-ASCII, such as IUPAC names in Japanese. It would mean that some byte strings aren't supported: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences Python solves this with surrogate escapes, but I don't have experience with it. Do you use characters with code points >=128? How? Andrew [email protected] ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Cdk-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/cdk-user

