David Koes contributed a major patch to OBConversion which I'd like to discuss.
I think overall the change is good, and useful for 2.4. There is some potential for backwards incompatibility since it's possible to create a crash with this new code by deallocating a stream before the OBConversion object. That said, this has been a touchy area of code and my general review looks like an overall win. I'd like to get explicit feedback from at least 2-3 people before migrating the patch and beginning more widespread testing. Thoughts? -Geoff > Begin forwarded message: > > From: David Koes <notificati...@github.com> > Subject: [openbabel] OBConversion improvements (#191) > > Summary: > *fixes memory leak > *more robust support for gzipped files > *more convenient methods for opening files (format auto-detection) > *fix bug with xml reader when reading files with multiple molecules > > I have made every effort to maintain backwards compatibility (even when the > existing behaviour was dubious) but there are some unavoidable changes. > Specifically, streams passed to a OBConversion should not be deallocated > before the OBConversion is deallocated or the stream is replaced. > > Details: > This was all motivated by a memory leak due to the line ending filter. > OBConversion would swap out the rdbuf of the passed input stream with its own > filtering rdbuf to normalize all line endings. This means a user-provided > data structure is having its internals messed with and there is no way to > properly free the filtering streambuf (since OBConversion doesn't control the > stream itself). This would cause my long running program that opens millions > of files to eventually consume all the memory on the machine and crash. Not > acceptable. > > The solution is to not mess with the internals of the passed stream, but wrap > the stream with a filtering stream. The full stream is wrapped, not just the > underlying streambuf, in order to (more or less) keep the state of the > user-provided stream consistent. That is, when OBConversion reads to the end > of the file, the good bit of the user-provided steram will also be turned off > (some test code depended on this behavior). > > GetInStream/GetOutStream may now return a stream that is not identical to the > user-provided stream (since it's the wrapped version). In addition to the > newline filter, streams may be wrapped if compression is enabled. > > Compression is now provided by virtue of wrapping the stream (and leaving it > that way until all writing/reading is done). GetInStream/GetOutStream provide > this wrapped stream and so can be read as if the underlying stream were not > compressed. > > FormatFromExt can take a second argument, a bool passed by reference, that > will be set to true if the file is gzipped. For example, "sdf.gz" will return > the SDF format and set the bool to true. > > SetInAndOutFormat, SetInFormat, SetOutFormat have all been modified to take > an optional isgzip parameter which, if true, enables compression. The > functions GetInGzipped/GetOutGzipped are also provided. Essentially, the > compression state of the file is part of its format (but has to be handled > separately due to how OBFormats work). > > So to get and set the full format of a file, including compression state: > bool isgzipped; > OBFormat *format = FormatFromExt(fname, isgzipped); > conv.SetInFormat(format, isgzipped); > > Convert(in,out) has different semantics. It no longer sets the input/output > streams with the passed arguments. This is because existing code (e.g. babel) > was deallocating these provided streams before the OBConversion object they > were passed to. Since Convert consumes all its input stream, it doesn't seem > unreasonable for it not to save the streams. By not keeping pointers to the > provided streams, the possibility of a use after free error goes away. The > constructor, SetInStream/SetOutStream, ReadFile/WriteFile, and Read/Write > functions can still be used to set persistent streams. > > Internally, SetInStream/SetOutStream are used to setup/teardown streams, as > opposed to the previous approach which was to duplicate (slightly different) > blocks of code everywhere this happened (for instance, to check for gzip, set > NeedToFree, etc). OBConversion keeps track of what streams it owns the memory > to. SetInStream(NULL) or SetOutStream(NULL) will take care of cleaning up the > current stream. > > The semantics of the copy constructor have been refined and an assignment > operator added. Previously the state was just copied resulting in two objects > that both thought they owned the same memory and had to delete it. This only > worked because of the implicit assumption that OBConversions were only copied > to create AuxConv objects for XML parsing (why these need to be subclasses, I > know not) and there's special logic to avoid duplicate deletions in this > case. Now the semantics are that only the originating OBConversion retains > ownership of any memory it allocated. This eliminates the possibility of a > free-after-free, but not a use-after-free (the copy constructor just > shouldn't be used). > > ReadFile/WriteFile will now extract the format from the filename if the > format is not already set. This includes setting the gzip status of the file. > I've also added a convenience constructor to OBConversion that takes > filenames. This removes a lot of the boilerplate required for doing simple > tasks. For example: > > OBMol mol; > OBConversion conv(infile); > while(conv.Read(&mol)) { > //do stuff > } > > Gzip support has been improved to work properly on files that are > concatenations of gzip files (this is suppose to be a valid thing to do). > > The XML reader would return true to ReadMolecule even if it hadn't read a > molecule in a multi-molecule file (because it was absorbing the ending </cml> > tag). Now it returns false. > > I suspect these changes will cause some people's code to crash. However, the > cause of the crash should be immediately identifiable - the lifetime of the > OBConversion object is longer than the lifetime of the input stream passed to > it (and the OBConversion destructor crashes trying to flush to the no longer > available user-provided stream). It should not be surprising to anyone that > leaving a stale pointer to a stream in OBConversion could cause bad things to > happen and the fix is easy (call SetInStream(NULL) when done reading the > input stream). > > On the flip side, leaving the current memory leak in results in a much less > obvious and more difficult to debug crash (and it isn't the developer's > fault). I think this change is worth the minimal breakage of bad code that it > will entail. > > You can view, comment on, or merge this pull request online at: > > https://github.com/openbabel/openbabel/pull/191 > <https://github.com/openbabel/openbabel/pull/191> > Commit Summary > > checkpoint work on obconversion fix > this is intensely frustrating > slight reworking of gzip state, babel fix > bug fixes > "fix" eof/bad iostream behavior > switch lineend stream to work on stream, not streambuf > fix bug with xmlconversion > fix xml problem > another fix > testing gzip support > fix xml reading at end > add test cases > fix for multifile zip support > modified 1ubq > forgot to remove boost includes > File Changes > > M CMakeLists.txt > <https://github.com/openbabel/openbabel/pull/191/files#diff-0> (2) > M include/openbabel/format.h > <https://github.com/openbabel/openbabel/pull/191/files#diff-1> (1) > M include/openbabel/lineend.h > <https://github.com/openbabel/openbabel/pull/191/files#diff-2> (62) > M include/openbabel/obconversion.h > <https://github.com/openbabel/openbabel/pull/191/files#diff-3> (99) > M src/formats/xml/xml.cpp > <https://github.com/openbabel/openbabel/pull/191/files#diff-4> (19) > M src/obconversion.cpp > <https://github.com/openbabel/openbabel/pull/191/files#diff-5> (580) > M src/obmolecformat.cpp > <https://github.com/openbabel/openbabel/pull/191/files#diff-6> (4) > M src/zipstream.h > <https://github.com/openbabel/openbabel/pull/191/files#diff-7> (1) > M src/zipstreamimpl.h > <https://github.com/openbabel/openbabel/pull/191/files#diff-8> (20) > M test/CMakeLists.txt > <https://github.com/openbabel/openbabel/pull/191/files#diff-9> (6) > M test/charge_mmff94.cpp > <https://github.com/openbabel/openbabel/pull/191/files#diff-10> (2) > A test/files/1ubq.pdb.gz > <https://github.com/openbabel/openbabel/pull/191/files#diff-11> (0) > A test/files/c3.cml > <https://github.com/openbabel/openbabel/pull/191/files#diff-12> (231) > A test/files/c3.cml.gz > <https://github.com/openbabel/openbabel/pull/191/files#diff-13> (0) > A test/files/c4.mol2.gz > <https://github.com/openbabel/openbabel/pull/191/files#diff-14> (0) > A test/files/c5.smi.gz > <https://github.com/openbabel/openbabel/pull/191/files#diff-15> (0) > A test/files/five_obabel.sdf.gz > <https://github.com/openbabel/openbabel/pull/191/files#diff-16> (0) > A test/files/gzip.in > <https://github.com/openbabel/openbabel/pull/191/files#diff-17> (37) > A test/files/many.sdf.gz > <https://github.com/openbabel/openbabel/pull/191/files#diff-18> (0) > A test/files/tencmpds.sdf.gz > <https://github.com/openbabel/openbabel/pull/191/files#diff-19> (0) > A test/gziptest.cpp > <https://github.com/openbabel/openbabel/pull/191/files#diff-20> (157) > A test/multicmltest.cpp > <https://github.com/openbabel/openbabel/pull/191/files#diff-21> (95) > M tools/babel.cpp > <https://github.com/openbabel/openbabel/pull/191/files#diff-22> (6) > Patch Links: > > https://github.com/openbabel/openbabel/pull/191.patch > <https://github.com/openbabel/openbabel/pull/191.patch> > https://github.com/openbabel/openbabel/pull/191.diff > <https://github.com/openbabel/openbabel/pull/191.diff> > — > Reply to this email directly or view it on GitHub > <https://github.com/openbabel/openbabel/pull/191>. >
------------------------------------------------------------------------------
_______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel