Mahesh, Thanks for your input, I am going to take a close look tomorrow
morning.
MG


On Tue, Nov 12, 2013 at 7:35 PM, Mahesh Joshi <[email protected]> wrote:

> Hi All,
>
> Does anyone have recommendations about this use case and incorporating it
> into opennlp code base? Or is sticking with the code duplication currently
> the best option for users that require the use of real-valued features for
> sequence tagging?
>
> Thanks,
> Mahesh
>
>
>
> On Wed, Oct 30, 2013 at 1:05 PM, Mahesh Joshi <[email protected]>
> wrote:
>
> > Hi--
> >
> > I am using the opennlp framework for a sequence tagging task, and have
> > written the necessary code to handle input data in the following format:
> >
> > <token1> [<feature1=value> <feature2=value> ...] <tag1>
> > <token2> [<feature1=value> <feature5=value> ...] <tag2>
> > ....
> > ....
> > <tokenX> [<feature10=value> ...] <tagX>
> > <empty-line-to-separate-sentences>
> > [more sentences follow]
> >
> > The problem I am faced with is as follows. opennlp.tools.util.BeamSearch
> > uses the following API on openlp.maxent.GISModel in the
> bestSequences(...)
> > method:
> >
> >   public double[] eval(String[] context, double probs[]);
> >
> > In this version of eval(...), the real values for the input features are
> > not used.
> >
> > Instead, if bestSequences can use the following API,
> >
> >    public double[] eval(String[] context, float[] values);
> >
> > it can correctly use the real-valued features parsed from the above
> format.
> >
> > Currently, I have subclassed BeamSearch, and overridden the bestSequences
> > method, which parses the contexts using the RealValueFileEventStream
> class.
> > However, for the most part, this derived class and the new bestSequences
> > method is a copy of the original, leading to unnecessary code
> duplication.
> >
> > I am wondering if there is any possibility (and utility) of incorporating
> > this logic into the the original BeamSearch code. In particular would it
> be
> > acceptable to do something like this (ignoring caching logic for now):
> >
> >         float[] values =
> RealValueFileEventStream.parseContexts(contexts);
> >
> >         double scores = model.eval(contexts, values);
> >
> > RealValueFileEventStream.parseContexts returns null if it cannot parse
> > even a single valid float value from the input contexts. In that case,
> > GISModel.eval(...) will ignore a null values array (using the default
> value
> > of 1 for every context).
> >
> > One problem with this would be that there would be an implicit convention
> > that context generators will have to know-- any context in the format
> > "feature=value" will be considered for parsing a real-valued feature. So
> > '=' becomes a special character in some sense. Note that it won't be a
> > problem if the string following '=' cannot be parsed as a float (the
> > feature just gets the default value of 1). If it is parsed as a valid
> > negative value though, there will be a RuntimeException thrown from
> > RealValueFileEventStream.parseContext.
> >
> > Another issue is the possible performance hit due to the parseContexts
> > call.
> >
> > Any thoughts on this issue are most welcome. Have other users of opennlp
> > encountered a similar situation, and how was it handled?
> >
> > Thanks!
> >
> > Mahesh
> >
> >
>

Reply via email to