Hi All,

Does anyone have recommendations about this use case and incorporating it
into opennlp code base? Or is sticking with the code duplication currently
the best option for users that require the use of real-valued features for
sequence tagging?

Thanks,
Mahesh



On Wed, Oct 30, 2013 at 1:05 PM, Mahesh Joshi <[email protected]> wrote:

> Hi--
>
> I am using the opennlp framework for a sequence tagging task, and have
> written the necessary code to handle input data in the following format:
>
> <token1> [<feature1=value> <feature2=value> ...] <tag1>
> <token2> [<feature1=value> <feature5=value> ...] <tag2>
> ....
> ....
> <tokenX> [<feature10=value> ...] <tagX>
> <empty-line-to-separate-sentences>
> [more sentences follow]
>
> The problem I am faced with is as follows. opennlp.tools.util.BeamSearch
> uses the following API on openlp.maxent.GISModel in the bestSequences(...)
> method:
>
>   public double[] eval(String[] context, double probs[]);
>
> In this version of eval(...), the real values for the input features are
> not used.
>
> Instead, if bestSequences can use the following API,
>
>    public double[] eval(String[] context, float[] values);
>
> it can correctly use the real-valued features parsed from the above format.
>
> Currently, I have subclassed BeamSearch, and overridden the bestSequences
> method, which parses the contexts using the RealValueFileEventStream class.
> However, for the most part, this derived class and the new bestSequences
> method is a copy of the original, leading to unnecessary code duplication.
>
> I am wondering if there is any possibility (and utility) of incorporating
> this logic into the the original BeamSearch code. In particular would it be
> acceptable to do something like this (ignoring caching logic for now):
>
>         float[] values = RealValueFileEventStream.parseContexts(contexts);
>
>         double scores = model.eval(contexts, values);
>
> RealValueFileEventStream.parseContexts returns null if it cannot parse
> even a single valid float value from the input contexts. In that case,
> GISModel.eval(...) will ignore a null values array (using the default value
> of 1 for every context).
>
> One problem with this would be that there would be an implicit convention
> that context generators will have to know-- any context in the format
> "feature=value" will be considered for parsing a real-valued feature. So
> '=' becomes a special character in some sense. Note that it won't be a
> problem if the string following '=' cannot be parsed as a float (the
> feature just gets the default value of 1). If it is parsed as a valid
> negative value though, there will be a RuntimeException thrown from
> RealValueFileEventStream.parseContext.
>
> Another issue is the possible performance hit due to the parseContexts
> call.
>
> Any thoughts on this issue are most welcome. Have other users of opennlp
> encountered a similar situation, and how was it handled?
>
> Thanks!
>
> Mahesh
>
>

Reply via email to