Mahesh, Thanks for your input, I am going to take a close look tomorrow morning. MG
On Tue, Nov 12, 2013 at 7:35 PM, Mahesh Joshi <[email protected]> wrote: > Hi All, > > Does anyone have recommendations about this use case and incorporating it > into opennlp code base? Or is sticking with the code duplication currently > the best option for users that require the use of real-valued features for > sequence tagging? > > Thanks, > Mahesh > > > > On Wed, Oct 30, 2013 at 1:05 PM, Mahesh Joshi <[email protected]> > wrote: > > > Hi-- > > > > I am using the opennlp framework for a sequence tagging task, and have > > written the necessary code to handle input data in the following format: > > > > <token1> [<feature1=value> <feature2=value> ...] <tag1> > > <token2> [<feature1=value> <feature5=value> ...] <tag2> > > .... > > .... > > <tokenX> [<feature10=value> ...] <tagX> > > <empty-line-to-separate-sentences> > > [more sentences follow] > > > > The problem I am faced with is as follows. opennlp.tools.util.BeamSearch > > uses the following API on openlp.maxent.GISModel in the > bestSequences(...) > > method: > > > > public double[] eval(String[] context, double probs[]); > > > > In this version of eval(...), the real values for the input features are > > not used. > > > > Instead, if bestSequences can use the following API, > > > > public double[] eval(String[] context, float[] values); > > > > it can correctly use the real-valued features parsed from the above > format. > > > > Currently, I have subclassed BeamSearch, and overridden the bestSequences > > method, which parses the contexts using the RealValueFileEventStream > class. > > However, for the most part, this derived class and the new bestSequences > > method is a copy of the original, leading to unnecessary code > duplication. > > > > I am wondering if there is any possibility (and utility) of incorporating > > this logic into the the original BeamSearch code. In particular would it > be > > acceptable to do something like this (ignoring caching logic for now): > > > > float[] values = > RealValueFileEventStream.parseContexts(contexts); > > > > double scores = model.eval(contexts, values); > > > > RealValueFileEventStream.parseContexts returns null if it cannot parse > > even a single valid float value from the input contexts. In that case, > > GISModel.eval(...) will ignore a null values array (using the default > value > > of 1 for every context). > > > > One problem with this would be that there would be an implicit convention > > that context generators will have to know-- any context in the format > > "feature=value" will be considered for parsing a real-valued feature. So > > '=' becomes a special character in some sense. Note that it won't be a > > problem if the string following '=' cannot be parsed as a float (the > > feature just gets the default value of 1). If it is parsed as a valid > > negative value though, there will be a RuntimeException thrown from > > RealValueFileEventStream.parseContext. > > > > Another issue is the possible performance hit due to the parseContexts > > call. > > > > Any thoughts on this issue are most welcome. Have other users of opennlp > > encountered a similar situation, and how was it handled? > > > > Thanks! > > > > Mahesh > > > > >
