Hi all,

that is a discussion which started in this jira issue:
https://issues.apache.org/jira/browse/OPENNLP-99

Steven proposed to use Iterators instead of a stream like interface.

The current status is that we have an EventStream inside maxent which
is made like an iterator, but does not implement the java.util.Iterator interface.
And in the tools project we came up with the ObjectStream which is inspired
by the InputStream class but deals with objects instead of bytes. It has
the following methods, read, reset and close.

And the current plan is to also use an ObjectStream like interface as a replacement for
EventStream, but we never got this finished.

Ok, in my opinion we cannot implement the java.util.Iterator interface, because java.util.Iterators do not allows us to implement the error handling with checked exceptions nicely. I personally also believe that java.util.Iterator communicates that it can just
be used without worrying about any sever issues like I/O errors.
In order to use such an Iterator with an for each statement, the only option we have is to throw unchecked exceptions. Which I believe is uncommon and unexpected to most people
who read the code. The javadoc would of course document that, but
it would be easy to forget about, checked exceptions cannot be ignored because the
compiler forces the programmer to handle them.

It is simply a fact that the data we read for training must come from somewhere,
somewhere is usually the disk, or from some other storage system. Depending
on the source (if its not from memory) the user has to deal with certain errors and also
needs to free resources again.
In Java that means that the data is usually retrieved via Readers or InputStream, both
classes which should usually only be used with a try-catch-finally statement
to ensure that in case of an error the underlying resources can be released.

Using an Iterator with unchecked exceptions would mean to somehow hide that
from the user, using checked exceptions forces the user to deal with it hopefully
correctly.

And there are more good reasons why our ObjectStream isn't bad at all,
it can easily implemented and used in a thread safe way, which is harder
for an iterator like interface. Because the calls to next and hasNext are
of course not atomic, but a call to read can be atomic.

A composed stream could look like this:
1. PlainTextByLineStream
2. LineParsingStream (creates a sample object out the string line)
3. FeatureGenerationStream
4. Multi-threaded data indexer

The data indexer wants to call the read method of the composed stream from
multiple threads to pull in the training Events faster.
To make this thread safe the PlainTextByLineStream.read method would by synchronized, LineParsingStream.read is safe when it only calls the underlying read and does everything
else in its stack. Same story for the feature generation stream.

When you want to do something like this with an iterator style interface it is harder to get it thread safe, because the state can change after hasNext was called, which would mean
that more locking must be used.

In the end I simply think that Iterators are good if you do not have to deal with errors and underlying OS resources, and streams are the java way when you sadly have to take all this into account. Using an Iterator for all this just to be able to use an for each sounds for me like a design which
is made to be abused to circumvent important error handling.

Jörn





Reply via email to