Re: [OpenBabel-Devel] Fwd: [openbabel] OBConversion improvements (#191)

Noel O'Boyle Tue, 16 Jun 2015 03:22:22 -0700

I think we need to embrace a certain amount of backwards incompatability
over a few releases in order to get the toolkit where it needs to be, so
I've no particular problem with that. It would be nice to have the build
failures fixed so I can test on Windows though...


On 15 June 2015 at 20:45, Geoffrey Hutchison <[email protected]> wrote:

> David Koes contributed a major patch to OBConversion which I'd like to
> discuss.
>
> I think overall the change is good, and useful for 2.4. There is some
> potential for backwards incompatibility since it's possible to create a
> crash with this new code by deallocating a stream before the OBConversion
> object.
>
> That said, this has been a touchy area of code and my general review looks
> like an overall win.
>
> I'd like to get explicit feedback from at least 2-3 people before
> migrating the patch and beginning more widespread testing.
>
> Thoughts?
> -Geoff
>
> Begin forwarded message:
>
> *From: *David Koes <[email protected]>
> *Subject: **[openbabel] OBConversion improvements (#191)*
>
> Summary:
> *fixes memory leak
> *more robust support for gzipped files
> *more convenient methods for opening files (format auto-detection)
> *fix bug with xml reader when reading files with multiple molecules
>
> I have made every effort to maintain backwards compatibility (even when
> the existing behaviour was dubious) but there are some unavoidable changes.
> Specifically, streams passed to a OBConversion should not be deallocated
> before the OBConversion is deallocated or the stream is replaced.
>
> Details:
> This was all motivated by a memory leak due to the line ending filter.
> OBConversion would swap out the rdbuf of the passed input stream with its
> own filtering rdbuf to normalize all line endings. This means a
> user-provided data structure is having its internals messed with and there
> is no way to properly free the filtering streambuf (since OBConversion
> doesn't control the stream itself). This would cause my long running
> program that opens millions of files to eventually consume all the memory
> on the machine and crash. Not acceptable.
>
> The solution is to not mess with the internals of the passed stream, but
> wrap the stream with a filtering stream. The full stream is wrapped, not
> just the underlying streambuf, in order to (more or less) keep the state of
> the user-provided stream consistent. That is, when OBConversion reads to
> the end of the file, the good bit of the user-provided steram will also be
> turned off (some test code depended on this behavior).
>
> GetInStream/GetOutStream may now return a stream that is not identical to
> the user-provided stream (since it's the wrapped version). In addition to
> the newline filter, streams may be wrapped if compression is enabled.
>
> Compression is now provided by virtue of wrapping the stream (and leaving
> it that way until all writing/reading is done). GetInStream/GetOutStream
> provide this wrapped stream and so can be read as if the underlying stream
> were not compressed.
>
> FormatFromExt can take a second argument, a bool passed by reference, that
> will be set to true if the file is gzipped. For example, "sdf.gz" will
> return the SDF format and set the bool to true.
>
> SetInAndOutFormat, SetInFormat, SetOutFormat have all been modified to
> take an optional isgzip parameter which, if true, enables compression. The
> functions GetInGzipped/GetOutGzipped are also provided. Essentially, the
> compression state of the file is part of its format (but has to be handled
> separately due to how OBFormats work).
>
> So to get and set the full format of a file, including compression state:
> bool isgzipped;
> OBFormat *format = FormatFromExt(fname, isgzipped);
> conv.SetInFormat(format, isgzipped);
>
> Convert(in,out) has different semantics. It no longer sets the
> input/output streams with the passed arguments. This is because existing
> code (e.g. babel) was deallocating these provided streams before the
> OBConversion object they were passed to. Since Convert consumes all its
> input stream, it doesn't seem unreasonable for it not to save the streams.
> By not keeping pointers to the provided streams, the possibility of a use
> after free error goes away. The constructor, SetInStream/SetOutStream,
> ReadFile/WriteFile, and Read/Write functions can still be used to set
> persistent streams.
>
> Internally, SetInStream/SetOutStream are used to setup/teardown streams,
> as opposed to the previous approach which was to duplicate (slightly
> different) blocks of code everywhere this happened (for instance, to check
> for gzip, set NeedToFree, etc). OBConversion keeps track of what streams it
> owns the memory to. SetInStream(NULL) or SetOutStream(NULL) will take care
> of cleaning up the current stream.
>
> The semantics of the copy constructor have been refined and an assignment
> operator added. Previously the state was just copied resulting in two
> objects that both thought they owned the same memory and had to delete it.
> This only worked because of the implicit assumption that OBConversions were
> only copied to create AuxConv objects for XML parsing (why these need to be
> subclasses, I know not) and there's special logic to avoid duplicate
> deletions in this case. Now the semantics are that only the originating
> OBConversion retains ownership of any memory it allocated. This eliminates
> the possibility of a free-after-free, but not a use-after-free (the copy
> constructor just shouldn't be used).
>
> ReadFile/WriteFile will now extract the format from the filename if the
> format is not already set. This includes setting the gzip status of the
> file. I've also added a convenience constructor to OBConversion that takes
> filenames. This removes a lot of the boilerplate required for doing simple
> tasks. For example:
>
> OBMol mol;
> OBConversion conv(infile);
> while(conv.Read(&mol)) {
> //do stuff
> }
>
> Gzip support has been improved to work properly on files that are
> concatenations of gzip files (this is suppose to be a valid thing to do).
>
> The XML reader would return true to ReadMolecule even if it hadn't read a
> molecule in a multi-molecule file (because it was absorbing the ending
> </cml> tag). Now it returns false.
> ------------------------------
>
> I suspect these changes will cause some people's code to crash. However,
> the cause of the crash should be immediately identifiable - the lifetime of
> the OBConversion object is longer than the lifetime of the input stream
> passed to it (and the OBConversion destructor crashes trying to flush to
> the no longer available user-provided stream). It should not be surprising
> to anyone that leaving a stale pointer to a stream in OBConversion could
> cause bad things to happen and the fix is easy (call SetInStream(NULL) when
> done reading the input stream).
>
> On the flip side, leaving the current memory leak in results in a much
> less obvious and more difficult to debug crash (and it isn't the
> developer's fault). I think this change is worth the minimal breakage of
> bad code that it will entail.
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
>   https://github.com/openbabel/openbabel/pull/191
> Commit Summary
>
>    - checkpoint work on obconversion fix
>    - this is intensely frustrating
>    - slight reworking of gzip state, babel fix
>    - bug fixes
>    - "fix" eof/bad iostream behavior
>    - switch lineend stream to work on stream, not streambuf
>    - fix bug with xmlconversion
>    - fix xml problem
>    - another fix
>    - testing gzip support
>    - fix xml reading at end
>    - add test cases
>    - fix for multifile zip support
>    - modified 1ubq
>    - forgot to remove boost includes
>
> File Changes
>
>    - *M* CMakeLists.txt
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-0> (2)
>    - *M* include/openbabel/format.h
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-1> (1)
>    - *M* include/openbabel/lineend.h
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-2> (62)
>    - *M* include/openbabel/obconversion.h
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-3> (99)
>    - *M* src/formats/xml/xml.cpp
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-4> (19)
>    - *M* src/obconversion.cpp
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-5> (580)
>    - *M* src/obmolecformat.cpp
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-6> (4)
>    - *M* src/zipstream.h
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-7> (1)
>    - *M* src/zipstreamimpl.h
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-8> (20)
>    - *M* test/CMakeLists.txt
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-9> (6)
>    - *M* test/charge_mmff94.cpp
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-10> (2)
>    - *A* test/files/1ubq.pdb.gz
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-11> (0)
>    - *A* test/files/c3.cml
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-12> (231)
>    - *A* test/files/c3.cml.gz
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-13> (0)
>    - *A* test/files/c4.mol2.gz
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-14> (0)
>    - *A* test/files/c5.smi.gz
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-15> (0)
>    - *A* test/files/five_obabel.sdf.gz
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-16> (0)
>    - *A* test/files/gzip.in
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-17> (37)
>    - *A* test/files/many.sdf.gz
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-18> (0)
>    - *A* test/files/tencmpds.sdf.gz
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-19> (0)
>    - *A* test/gziptest.cpp
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-20> (157)
>    - *A* test/multicmltest.cpp
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-21> (95)
>    - *M* tools/babel.cpp
>    <https://github.com/openbabel/openbabel/pull/191/files#diff-22> (6)
>
> Patch Links:
>
>    - https://github.com/openbabel/openbabel/pull/191.patch
>    - https://github.com/openbabel/openbabel/pull/191.diff
>
> —
> Reply to this email directly or view it on GitHub
> <https://github.com/openbabel/openbabel/pull/191>.
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> OpenBabel-Devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/openbabel-devel
>
>

------------------------------------------------------------------------------

_______________________________________________
OpenBabel-Devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Re: [OpenBabel-Devel] Fwd: [openbabel] OBConversion improvements (#191)

Reply via email to