[OpenBabel-Devel] Fwd: [openbabel] OBConversion improvements (#191)

Geoffrey Hutchison Mon, 15 Jun 2015 12:47:20 -0700

David Koes contributed a major patch to OBConversion which I'd like to discuss.


I think overall the change is good, and useful for 2.4. There is some potential 
for backwards incompatibility since it's possible to create a crash with this 
new code by deallocating a stream before the OBConversion object.

That said, this has been a touchy area of code and my general review looks like 
an overall win.

I'd like to get explicit feedback from at least 2-3 people before migrating the 
patch and beginning more widespread testing.

Thoughts?
-Geoff

> Begin forwarded message:
> 
> From: David Koes <notificati...@github.com>
> Subject: [openbabel] OBConversion improvements (#191)
> 
> Summary:
> *fixes memory leak 
> *more robust support for gzipped files
> *more convenient methods for opening files (format auto-detection)
> *fix bug with xml reader when reading files with multiple molecules
> 
> I have made every effort to maintain backwards compatibility (even when the 
> existing behaviour was dubious) but there are some unavoidable changes. 
> Specifically, streams passed to a OBConversion should not be deallocated 
> before the OBConversion is deallocated or the stream is replaced.
> 
> Details:
> This was all motivated by a memory leak due to the line ending filter. 
> OBConversion would swap out the rdbuf of the passed input stream with its own 
> filtering rdbuf to normalize all line endings. This means a user-provided 
> data structure is having its internals messed with and there is no way to 
> properly free the filtering streambuf (since OBConversion doesn't control the 
> stream itself). This would cause my long running program that opens millions 
> of files to eventually consume all the memory on the machine and crash. Not 
> acceptable.
> 
> The solution is to not mess with the internals of the passed stream, but wrap 
> the stream with a filtering stream. The full stream is wrapped, not just the 
> underlying streambuf, in order to (more or less) keep the state of the 
> user-provided stream consistent. That is, when OBConversion reads to the end 
> of the file, the good bit of the user-provided steram will also be turned off 
> (some test code depended on this behavior).
> 
> GetInStream/GetOutStream may now return a stream that is not identical to the 
> user-provided stream (since it's the wrapped version). In addition to the 
> newline filter, streams may be wrapped if compression is enabled.
> 
> Compression is now provided by virtue of wrapping the stream (and leaving it 
> that way until all writing/reading is done). GetInStream/GetOutStream provide 
> this wrapped stream and so can be read as if the underlying stream were not 
> compressed.
> 
> FormatFromExt can take a second argument, a bool passed by reference, that 
> will be set to true if the file is gzipped. For example, "sdf.gz" will return 
> the SDF format and set the bool to true.
> 
> SetInAndOutFormat, SetInFormat, SetOutFormat have all been modified to take 
> an optional isgzip parameter which, if true, enables compression. The 
> functions GetInGzipped/GetOutGzipped are also provided. Essentially, the 
> compression state of the file is part of its format (but has to be handled 
> separately due to how OBFormats work).
> 
> So to get and set the full format of a file, including compression state:
> bool isgzipped;
> OBFormat *format = FormatFromExt(fname, isgzipped);
> conv.SetInFormat(format, isgzipped);
> 
> Convert(in,out) has different semantics. It no longer sets the input/output 
> streams with the passed arguments. This is because existing code (e.g. babel) 
> was deallocating these provided streams before the OBConversion object they 
> were passed to. Since Convert consumes all its input stream, it doesn't seem 
> unreasonable for it not to save the streams. By not keeping pointers to the 
> provided streams, the possibility of a use after free error goes away. The 
> constructor, SetInStream/SetOutStream, ReadFile/WriteFile, and Read/Write 
> functions can still be used to set persistent streams.
> 
> Internally, SetInStream/SetOutStream are used to setup/teardown streams, as 
> opposed to the previous approach which was to duplicate (slightly different) 
> blocks of code everywhere this happened (for instance, to check for gzip, set 
> NeedToFree, etc). OBConversion keeps track of what streams it owns the memory 
> to. SetInStream(NULL) or SetOutStream(NULL) will take care of cleaning up the 
> current stream.
> 
> The semantics of the copy constructor have been refined and an assignment 
> operator added. Previously the state was just copied resulting in two objects 
> that both thought they owned the same memory and had to delete it. This only 
> worked because of the implicit assumption that OBConversions were only copied 
> to create AuxConv objects for XML parsing (why these need to be subclasses, I 
> know not) and there's special logic to avoid duplicate deletions in this 
> case. Now the semantics are that only the originating OBConversion retains 
> ownership of any memory it allocated. This eliminates the possibility of a 
> free-after-free, but not a use-after-free (the copy constructor just 
> shouldn't be used).
> 
> ReadFile/WriteFile will now extract the format from the filename if the 
> format is not already set. This includes setting the gzip status of the file. 
> I've also added a convenience constructor to OBConversion that takes 
> filenames. This removes a lot of the boilerplate required for doing simple 
> tasks. For example:
> 
> OBMol mol;
> OBConversion conv(infile);
> while(conv.Read(&mol)) {
> //do stuff
> }
> 
> Gzip support has been improved to work properly on files that are 
> concatenations of gzip files (this is suppose to be a valid thing to do).
> 
> The XML reader would return true to ReadMolecule even if it hadn't read a 
> molecule in a multi-molecule file (because it was absorbing the ending </cml> 
> tag). Now it returns false.
> 
> I suspect these changes will cause some people's code to crash. However, the 
> cause of the crash should be immediately identifiable - the lifetime of the 
> OBConversion object is longer than the lifetime of the input stream passed to 
> it (and the OBConversion destructor crashes trying to flush to the no longer 
> available user-provided stream). It should not be surprising to anyone that 
> leaving a stale pointer to a stream in OBConversion could cause bad things to 
> happen and the fix is easy (call SetInStream(NULL) when done reading the 
> input stream).
> 
> On the flip side, leaving the current memory leak in results in a much less 
> obvious and more difficult to debug crash (and it isn't the developer's 
> fault). I think this change is worth the minimal breakage of bad code that it 
> will entail.
> 
> You can view, comment on, or merge this pull request online at:
> 
>   https://github.com/openbabel/openbabel/pull/191 
> <https://github.com/openbabel/openbabel/pull/191>
> Commit Summary
> 
> checkpoint work on obconversion fix
> this is intensely frustrating
> slight reworking of gzip state, babel fix
> bug fixes
> "fix" eof/bad iostream behavior
> switch lineend stream to work on stream, not streambuf
> fix bug with xmlconversion
> fix xml problem
> another fix
> testing gzip support
> fix xml reading at end
> add test cases
> fix for multifile zip support
> modified 1ubq
> forgot to remove boost includes
> File Changes
> 
> M CMakeLists.txt 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-0> (2)
> M include/openbabel/format.h 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-1> (1)
> M include/openbabel/lineend.h 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-2> (62)
> M include/openbabel/obconversion.h 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-3> (99)
> M src/formats/xml/xml.cpp 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-4> (19)
> M src/obconversion.cpp 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-5> (580)
> M src/obmolecformat.cpp 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-6> (4)
> M src/zipstream.h 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-7> (1)
> M src/zipstreamimpl.h 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-8> (20)
> M test/CMakeLists.txt 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-9> (6)
> M test/charge_mmff94.cpp 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-10> (2)
> A test/files/1ubq.pdb.gz 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-11> (0)
> A test/files/c3.cml 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-12> (231)
> A test/files/c3.cml.gz 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-13> (0)
> A test/files/c4.mol2.gz 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-14> (0)
> A test/files/c5.smi.gz 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-15> (0)
> A test/files/five_obabel.sdf.gz 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-16> (0)
> A test/files/gzip.in 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-17> (37)
> A test/files/many.sdf.gz 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-18> (0)
> A test/files/tencmpds.sdf.gz 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-19> (0)
> A test/gziptest.cpp 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-20> (157)
> A test/multicmltest.cpp 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-21> (95)
> M tools/babel.cpp 
> <https://github.com/openbabel/openbabel/pull/191/files#diff-22> (6)
> Patch Links:
> 
> https://github.com/openbabel/openbabel/pull/191.patch 
> <https://github.com/openbabel/openbabel/pull/191.patch>
> https://github.com/openbabel/openbabel/pull/191.diff 
> <https://github.com/openbabel/openbabel/pull/191.diff>
> —
> Reply to this email directly or view it on GitHub 
> <https://github.com/openbabel/openbabel/pull/191>.
>

------------------------------------------------------------------------------

_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

[OpenBabel-Devel] Fwd: [openbabel] OBConversion improvements (#191)

Reply via email to