A really hairy token graph case

2014-10-24 Thread Benson Margulies
Consider a case where we have a token which can be subdivided in
several ways. This can happen in German. We'd like to represent this
with positionIncrement/positionLength, but it does not seem possible.

Once the position has moved out from one set of 'subtokens', we see no
way to move it back for the second set of alternatives.

Is this something that was considered?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: A really hairy token graph case

2014-10-24 Thread Will Martin
HI Benson:

This is the case with n-gramming (though you have a more complicated start 
chooser than most I imagine).  Does that help get your ideas unblocked?

Will

-Original Message-
From: Benson Margulies [mailto:bimargul...@gmail.com] 
Sent: Friday, October 24, 2014 4:43 PM
To: java-user@lucene.apache.org
Subject: A really hairy token graph case

Consider a case where we have a token which can be subdivided in several ways. 
This can happen in German. We'd like to represent this with 
positionIncrement/positionLength, but it does not seem possible.

Once the position has moved out from one set of 'subtokens', we see no way to 
move it back for the second set of alternatives.

Is this something that was considered?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A really hairy token graph case

2014-10-24 Thread Benson Margulies
I don't think so ... Let me be specific:

First, consider the case of one 'analysis': an input token maps to a lemma
and a sequence of components.

So, we product

 surface form
 lemmaPI 0
 comp1PI 0
 comp2PI 1
 .

with PL set appropriately to cover the pieces. All the information is there.

Now, if we have another analysis, we want to 'rewind' position, and deliver
another lemma and another set of components, but, of course, we can't do
that.

The best we could do is something like:

surface form
lemma1  PI 0
lemma2 PI 0

lemmaN PI 0

comp0-1  PI 0
comp1-1 PI 0

 
 comp0-N
compM-N

That is, group all the first-components, and all the second-components.

But now the bits and pieces of the compounds are interspersed. Maybe that's
OK.


On Fri, Oct 24, 2014 at 5:44 PM, Will Martin wmartin...@gmail.com wrote:

 HI Benson:

 This is the case with n-gramming (though you have a more complicated start
 chooser than most I imagine).  Does that help get your ideas unblocked?

 Will

 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Friday, October 24, 2014 4:43 PM
 To: java-user@lucene.apache.org
 Subject: A really hairy token graph case

 Consider a case where we have a token which can be subdivided in several
 ways. This can happen in German. We'd like to represent this with
 positionIncrement/positionLength, but it does not seem possible.

 Once the position has moved out from one set of 'subtokens', we see no way
 to move it back for the second set of alternatives.

 Is this something that was considered?

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




RE: A really hairy token graph case

2014-10-24 Thread Will Martin
Benson:  I'm in danger of trying to remember CPL's german decompounder and how 
we used it. That would be a very unreliable  memory.

However at the link below David and Rupert have a resoundingly informative 
discussion about making similar work for synonyms. It might bear reading 
through the kb info captured there.

https://github.com/OpenSextant/SolrTextTagger/issues/10




-Original Message-
From: Benson Margulies [mailto:ben...@basistech.com] 
Sent: Friday, October 24, 2014 5:54 PM
To: java-user@lucene apache. org; Richard Barnes
Subject: Re: A really hairy token graph case

I don't think so ... Let me be specific:

First, consider the case of one 'analysis': an input token maps to a lemma and 
a sequence of components.

So, we product

 surface form
 lemmaPI 0
 comp1PI 0
 comp2PI 1
 .

with PL set appropriately to cover the pieces. All the information is there.

Now, if we have another analysis, we want to 'rewind' position, and deliver 
another lemma and another set of components, but, of course, we can't do that.

The best we could do is something like:

surface form
lemma1  PI 0
lemma2 PI 0

lemmaN PI 0

comp0-1  PI 0
comp1-1 PI 0

 
 comp0-N
compM-N

That is, group all the first-components, and all the second-components.

But now the bits and pieces of the compounds are interspersed. Maybe that's OK.


On Fri, Oct 24, 2014 at 5:44 PM, Will Martin wmartin...@gmail.com wrote:

 HI Benson:

 This is the case with n-gramming (though you have a more complicated 
 start chooser than most I imagine).  Does that help get your ideas unblocked?

 Will

 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Friday, October 24, 2014 4:43 PM
 To: java-user@lucene.apache.org
 Subject: A really hairy token graph case

 Consider a case where we have a token which can be subdivided in 
 several ways. This can happen in German. We'd like to represent this 
 with positionIncrement/positionLength, but it does not seem possible.

 Once the position has moved out from one set of 'subtokens', we see no 
 way to move it back for the second set of alternatives.

 Is this something that was considered?

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org