Re: Problem with porter stemming

2016-03-14 Thread Benson Margulies
Stemming is an inherently limited process. It doesn't know about the
word 'news', it just has a rule about 's'.

Some of us sell commercial products that do more complex linguistic
processing that knows about which words are which.

There may be open source implementations of similar technology.


On Mon, Mar 14, 2016 at 12:13 PM, Ahmet Arslan
 wrote:
> Hi Dwaipayan,
>
> Another way is to use KeywordMarkerFilter. Stemmer implementations respect 
> this attribute.
> If you want to supply your own mappings, StemmerOverrideTokenFilter could be 
> used as well.
>
> ahmet
>
>
> On Monday, March 14, 2016 4:31 PM, Dwaipayan Roy  
> wrote:
>
>
>
> I am using EnglishAnalyzer with my own stopword list. EnglishAnalyzer uses
> the porter stemmer (snowball) to stem the words. But using the
> EnglishAnalyzer, I am getting erroneous result for 'news'. 'news' is
> getting stemmed into 'new'.
>
> Any help would be appreciated.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Text dependent analyzer

2015-04-17 Thread Benson Margulies
If you wait tokenization to depend on sentences, and you insist on
being inside Lucene, you have to be a Tokenizer. Your tokenizer can
set an attribute on the token that ends a sentence. Then, downstream,
filters can  read-ahead tokens to get the full sentence and buffer
tokens as needed.



On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote:
 Hi Hummel,

 There was an effort to bring open-nlp capabilities to Lucene:
 https://issues.apache.org/jira/browse/LUCENE-2899

 Lance was working on it to keep it up-to-date. But, it looks like it is not 
 always best to accomplish all things inside Lucene.
 I personally would do the sentence detection outside of the Lucene.

 By the way, I remember there was a way to consume all upstream token stream.

 I think it was consuming all input and injecting one concatenated huge 
 term/token.

 KeywordTokenizer has similar behaviour. It injects a single token.
 http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html

 Ahmet


 On Wednesday, April 15, 2015 3:12 PM, Shay Hummel shay.hum...@gmail.com 
 wrote:
 Hi Ahment,
 Thank you for the reply,
 That's exactly what I am doing. At the moment, to index a document, I break
 it to sentences, and each sentence is analyzed (lemmatizing, stopword
 removal etc.)
 Now, what I am looking for is a way to create an analyzer (a class which
 extends lucene's analyzer). This analyzer will be used for index and query
 processing. It (a like the english analyzer) will receive the text and
 produce tokens.
 The Api of Analyzer requires implementing the createComponents which
 is not dependent
 on the text being analyzed. This fact is problematic since as you know the
 OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
 model files to provide spans of each sentence and then break them).
 Is there a way around it?

 Shay


 On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan iori...@yahoo.com.invalid
 wrote:

 Hi Hummel,

 You can perform sentence detection outside of the solr, using opennlp for
 instance, and then feed them to solr.

 https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect

 Ahmet




 On Tuesday, April 14, 2015 8:12 PM, Shay Hummel shay.hum...@gmail.com
 wrote:
 Hi
 I would like to create a text dependent analyzer.
 That is, *given a string*, the analyzer will:
 1. Read the entire text and break it into sentences.
 2. Each sentence will then be tokenized, possesive removal, lowercased,
 mark terms and stemmed.

 The second part is essentially what happens in english analyzer
 (createComponent). However, this is not dependent of the text it receives -
 which is the first part of what I am trying to do.

 So ... How can it be achieved?

 Thank you,

 Shay Hummel

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
Based on reading the same comments you read, I'm pretty doubtful that
Codec.getDefault() is going to work. It seems to me that this
situation renders the FilterCodec a bit hard to to use, at least given
the 'every release deprecates a codec' sort of pattern.



On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 How about Codec.getDefault()? It does indeed not necessarily return the 
 newest one (if somebody changes the default using Codec.setDefault()), but 
 for your use case wrapping the current default one, it should be fine?

 I have not tried this yet, but there might be a chicken-egg problem:
 - Your codec will have a separate name and be listed in META-INF as service 
 (I assume this). So it gets discovered by the Codec discovery process and is 
 instantiated by that.
 - On loading the Codec framework the call to codec.getDefault() might get in 
 at a time where the codecs are not yet fully initialized (because it will 
 instantiate your codec while loading the META-INF). This happens before the 
 Codec class is itself fully statically initialized, so the default codec 
 might be null...
 So relying on Codec.getDefault() in constructors of filter codecs may not 
 work as expected!

 Maybe try it out, was just an idea :-)

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Thursday, February 12, 2015 2:11 AM
 To: java-user@lucene.apache.org
 Subject: A codec moment or pickle

 I have a class that extends FilterCodec. Written against Lucene 4.9, it uses 
 the
 Lucene49Codec.

 Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is
 read-only in 4.10. Is there some way to code one of these to get 'the default
 codec' and not have to chase versions?

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
Robert,

Let me lay out the scenario.

Hardware has .5T of Index is relatively small. Application profiling
shows a significant amount of time spent codec-ing.

Options as I see them:

1. Use DPF complete with the irritation of having to have this
spurious codec name in the on-disk format that has nothing to do with
the on-disk format.
2. 'Officially' use the standard codec, and then use something like
AOP to intercept and encapsulate it with the DPF or something else
like it -- essentially, a do-it-myself alternative to convincing the
community here that this is a use case worthy of support.
3. Find some way to move a significant amount of the data in question
out of Lucene altogether into something else which fits nicely
together with filling memory with a cache so that the amount of
codeccing drops below the threshold of interest.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
WHOOPS.

First sentence was, until just before I clicked 'send',

Hardware has .5T of RAM. Index is relatively small  (20g) ...


On Thu, Feb 12, 2015 at 4:51 PM, Benson Margulies ben...@basistech.com wrote:
 Robert,

 Let me lay out the scenario.

 Hardware has .5T of Index is relatively small. Application profiling
 shows a significant amount of time spent codec-ing.

 Options as I see them:

 1. Use DPF complete with the irritation of having to have this
 spurious codec name in the on-disk format that has nothing to do with
 the on-disk format.
 2. 'Officially' use the standard codec, and then use something like
 AOP to intercept and encapsulate it with the DPF or something else
 like it -- essentially, a do-it-myself alternative to convincing the
 community here that this is a use case worthy of support.
 3. Find some way to move a significant amount of the data in question
 out of Lucene altogether into something else which fits nicely
 together with filling memory with a cache so that the amount of
 codeccing drops below the threshold of interest.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
On Thu, Feb 12, 2015 at 8:43 AM, Robert Muir rcm...@gmail.com wrote:

 Honestly i dont agree. I don't know what you are trying to do, but if
 you want file format backwards compat working, then you need a
 different FilterCodec to match each lucene codec.

 Otherwise your codec is broken from a back compat standpoint. Wrapping
 the latest is an antipattern here.


I understand this logic. It leaves me wandering between:

1: My old desire to convince you that there should be a way to do
DirectPostingFormat's caching without being a codec at all. Unfortunately,
I got dragged away from the benchmarking that might have been persuasive.

2: The problem of deprecation. I give someone a jar-of-code that works fine
with Lucene 4.9. It does not work with 4.10. Now, maybe the answer here is
that the codec deprecation is fundamental to the definition of moving from
4.9 to 4.10, so having a codec means that I'm really married to a process
of making releases that mirror Lucene releases.






 On Thu, Feb 12, 2015 at 5:33 AM, Benson Margulies ben...@basistech.com
 wrote:
  Based on reading the same comments you read, I'm pretty doubtful that
  Codec.getDefault() is going to work. It seems to me that this
  situation renders the FilterCodec a bit hard to to use, at least given
  the 'every release deprecates a codec' sort of pattern.
 
 
 
  On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote:
  Hi,
 
  How about Codec.getDefault()? It does indeed not necessarily return the
 newest one (if somebody changes the default using Codec.setDefault()), but
 for your use case wrapping the current default one, it should be fine?
 
  I have not tried this yet, but there might be a chicken-egg problem:
  - Your codec will have a separate name and be listed in META-INF as
 service (I assume this). So it gets discovered by the Codec discovery
 process and is instantiated by that.
  - On loading the Codec framework the call to codec.getDefault() might
 get in at a time where the codecs are not yet fully initialized (because it
 will instantiate your codec while loading the META-INF). This happens
 before the Codec class is itself fully statically initialized, so the
 default codec might be null...
  So relying on Codec.getDefault() in constructors of filter codecs may
 not work as expected!
 
  Maybe try it out, was just an idea :-)
 
  Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Benson Margulies [mailto:bimargul...@gmail.com]
  Sent: Thursday, February 12, 2015 2:11 AM
  To: java-user@lucene.apache.org
  Subject: A codec moment or pickle
 
  I have a class that extends FilterCodec. Written against Lucene 4.9,
 it uses the
  Lucene49Codec.
 
  Dropped into a copy of Solr with Lucene 4.10, it discovers that this
 codec is
  read-only in 4.10. Is there some way to code one of these to get 'the
 default
  codec' and not have to chase versions?
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




A codec moment or pickle

2015-02-11 Thread Benson Margulies
I have a class that extends FilterCodec. Written against Lucene 4.9,
it uses the Lucene49Codec.

Dropped into a copy of Solr with Lucene 4.10, it discovers that this
codec is read-only in 4.10. Is there some way to code one of these to
get 'the default codec' and not have to chase versions?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



A really hairy token graph case

2014-10-24 Thread Benson Margulies
Consider a case where we have a token which can be subdivided in
several ways. This can happen in German. We'd like to represent this
with positionIncrement/positionLength, but it does not seem possible.

Once the position has moved out from one set of 'subtokens', we see no
way to move it back for the second set of alternatives.

Is this something that was considered?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A really hairy token graph case

2014-10-24 Thread Benson Margulies
I don't think so ... Let me be specific:

First, consider the case of one 'analysis': an input token maps to a lemma
and a sequence of components.

So, we product

 surface form
 lemmaPI 0
 comp1PI 0
 comp2PI 1
 .

with PL set appropriately to cover the pieces. All the information is there.

Now, if we have another analysis, we want to 'rewind' position, and deliver
another lemma and another set of components, but, of course, we can't do
that.

The best we could do is something like:

surface form
lemma1  PI 0
lemma2 PI 0

lemmaN PI 0

comp0-1  PI 0
comp1-1 PI 0

 
 comp0-N
compM-N

That is, group all the first-components, and all the second-components.

But now the bits and pieces of the compounds are interspersed. Maybe that's
OK.


On Fri, Oct 24, 2014 at 5:44 PM, Will Martin wmartin...@gmail.com wrote:

 HI Benson:

 This is the case with n-gramming (though you have a more complicated start
 chooser than most I imagine).  Does that help get your ideas unblocked?

 Will

 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Friday, October 24, 2014 4:43 PM
 To: java-user@lucene.apache.org
 Subject: A really hairy token graph case

 Consider a case where we have a token which can be subdivided in several
 ways. This can happen in German. We'd like to represent this with
 positionIncrement/positionLength, but it does not seem possible.

 Once the position has moved out from one set of 'subtokens', we see no way
 to move it back for the second set of alternatives.

 Is this something that was considered?

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Why does this search fail?

2014-08-27 Thread Benson Margulies
Does google actually support *?



On Wed, Aug 27, 2014 at 9:54 AM, Milind mili...@gmail.com wrote:

 I see.  This is going to be extremely difficult to explain to end users.
 It doesn't work as they would expect.  Some of the tokenizing rules are
 already somewhat confusing.  Their expectation is that it should work the
 way their searches work in Google.

 It's difficult enough to recognize that because the period is surrounded by
 a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
 tokenized.  So I'd have expected that C0001.DevNm00* would effectively
 become a search for C0001 OR DevNm00*.  But now, because of the presence of
 the wildcard, it's considered as 1 term and the period is not a tokenizer.
 That's actually good, but now the fact that it's still considered as 2
 terms for wildcard searches makes it very unintuitive.  I don't suppose
 that I can do anything about making wildcard search use multiple terms if
 joined together with a tokenizer.  But is there any way that I can force it
 to go through an analyzer prior to doing the search?




 On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  Sorry, but you can only use a wildcard on a single term. C0001.DevNm001
  gets indexed as two terms, c0001 and devnm001, so your wildcard won't
  match any term (at least in this case.)
 
  Also, if your query term includes a wildcard, it will not be fully
  analyzed. Some filters such as lower case are defined as multi-term, so
  they will be performed, but the standard tokenizer is not being called,
 so
  the dot remains and this whole term is treated as one term, unlike the
  index analysis.
 
  -- Jack Krupansky
 
  -Original Message- From: Milind
  Sent: Tuesday, August 26, 2014 12:24 PM
  To: java-user@lucene.apache.org
  Subject: Why does this search fail?
 
 
  I have a field with the value C0001.DevNm001.  If I search for
 
 C0001.DevNm001 -- Get Hit
 DevNm00*   -- Get Hit
 C0001.DevNm00*  -- Get No Hit
 
  The field gets tokenized on the period since it's surrounded by a letter
  and and a number.  The query gets evaluated as a prefix query.  I'd have
  thought that this should have found the document.  Any clues on why this
  doesn't work?
 
  The full code is below.
 
 Directory theDirectory = new RAMDirectory();
 Version theVersion = Version.LUCENE_47;
 Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
 IndexWriterConfig theConfig =
 new IndexWriterConfig(theVersion,
 theAnalyzer);
 IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);
 
 String theFieldName = Name;
 String theFieldValue = C0001.DevNm001;
   Document theDocument = new Document();
   theDocument.add(new TextField(theFieldName, theFieldValue,
  Field.Store.YES));
   theWriter.addDocument(theDocument);
 theWriter.close();
 
 String theQueryStr = theFieldName + :C0001.DevNm00*;
 Query theQuery =
 new QueryParser(theVersion, theFieldName,
  theAnalyzer).parse(theQueryStr);
 System.out.println(theQuery.getClass() + ,  + theQuery);
 IndexReader theIndexReader = DirectoryReader.open(theDirectory);
 IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
 TopScoreDocCollector collector = TopScoreDocCollector.create(10,
  true);
 theSearcher.search(theQuery, collector);
 ScoreDoc[] theHits = collector.topDocs().scoreDocs;
 System.out.println(Hits found:  + theHits.length);
 
  Output:
 
  class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
  Hits found: 0
 
 
  --
  Regards
  Milind
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 


 --
 Regards
 Milind



Re: searching with stemming

2014-06-09 Thread Benson Margulies
You should construct an analysis chain that does what you need. Read the
source of the relevant analyzer and pick the tokenizer and filter(s) that
you need, and don't include stemming.


On Mon, Jun 9, 2014 at 5:57 AM, Jamie ja...@mailarchiva.com wrote:

 Greetings

 Our app currently uses language specific analysers (e.g. EnglishAnalyzer,
 GermanAnalyzer, etc.). We need an option to disable stemming. What's the
 recommended way to do this? These analyzers do not include an option to
 disable stemming, only a parameter to specify a list words for which
 stemming should not apply. Furthermore, my understanding is that the
 StandardAnalyzer is tied to English specifically. I am trying to avoid
 having to override each of these analyzers with an option to disable
 stemming. Is there a better alternative?

 Much appreciate your consideration.

 Jamie



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: searching with stemming

2014-06-09 Thread Benson Margulies
Are you using Solr? If so you are on the wrong mailing list. If not, why do
you need a non-
-anonymous analyzer at all.
On Jun 9, 2014 6:55 AM, Jamie ja...@mailarchiva.com wrote:

 To me, it seems strange that these default analyzers, don't provide
 constructors that enable one to override stemming, etc?

 On 2014/06/09, 12:39 PM, Trejkaz wrote:

 On Mon, Jun 9, 2014 at 7:57 PM, Jamie ja...@mailarchiva.com wrote:

 Greetings

 Our app currently uses language specific analysers (e.g. EnglishAnalyzer,
 GermanAnalyzer, etc.). We need an option to disable stemming. What's the
 recommended way to do this? These analyzers do not include an option to
 disable stemming, only a parameter to specify a list words for which
 stemming should not apply.
 Furthermore, my understanding is that the StandardAnalyzer is tied to
 English specifically.




 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: searching with stemming

2014-06-09 Thread Benson Margulies
Analyzer classes are optional; an analyzer is just a factory for a set of
token stream components. you can usually do just fine with an anonymous
class. Or in your case, the only thing different for each language will be
the stop words, so you can have one analyzer class with a language
parameter.
On Jun 9, 2014 7:02 AM, Jamie ja...@mailarchiva.com wrote:

 I am not using Solr. I am using the default analyzers...

 On 2014/06/09, 12:59 PM, Benson Margulies wrote:

 Are you using Solr? If so you are on the wrong mailing list. If not, why
 do
 you need a non-
 -anonymous analyzer at all.
 On Jun 9, 2014 6:55 AM, Jamie ja...@mailarchiva.com wrote:

  To me, it seems strange that these default analyzers, don't provide
 constructors that enable one to override stemming, etc?

 On 2014/06/09, 12:39 PM, Trejkaz wrote:

  On Mon, Jun 9, 2014 at 7:57 PM, Jamie ja...@mailarchiva.com wrote:

  Greetings

 Our app currently uses language specific analysers (e.g.
 EnglishAnalyzer,
 GermanAnalyzer, etc.). We need an option to disable stemming. What's
 the
 recommended way to do this? These analyzers do not include an option to
 disable stemming, only a parameter to specify a list words for which
 stemming should not apply.
 Furthermore, my understanding is that the StandardAnalyzer is tied to
 English specifically.


  -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Confuse with Kuromoji

2014-04-06 Thread Benson Margulies
You must know what language each text is in, and use an appropriate
analyzer. Some people do this by using a separate field (text_eng,
text_spa, text_jpn). Other people put some extra information at the
beginning of the field, and then make an analyzer that peeks in order to
dispatch to the correct tokenizer.


On Sat, Apr 5, 2014 at 9:59 PM, j7a42e4fd7...@softbank.ne.jp wrote:

 I am pretty new with Lucene, however I have not problem understanding what
 is about.
 My big problem is trying to understand how Kuromoji works. I need to
 implement a search functinality thats supports initially English, Spanish
 and Japanese. I doesn't seem to be a deal with the two firsts, as I can
 just use the analyzersーcommon to index both languages contents, but when it
 comes to Japanese it has it's own analyzer. I could't find any clues about
 combining analyzers, so I still don't if I can combine all languages under
 the same index (which would be ideal, as I expect mix searches in the
 context of my project) or I have to detect the language first and then
 index Japanese texts separately (what it will be a big disadvantage when it
 comes to mixed searches and future localization expansion).
 I found out about Lucene throgh Kuromoji, it will be great to find out a
 solution to be able to use all the greatness that Lucene offers.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Confuse with Kuromoji

2014-04-06 Thread Benson Margulies
On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat herb.roitb...@orcatec.comwrote:

 Just curious, what are some of the things that people do to properly
 tokenize the queries with mixed language collections?  What do you do with
 mixed language queries?


You can either force the user to tell you the language, or ...

   you can run a language detector. They are less accurate for short
strings, or ...

 you can process it in _all_ of the languages and OR up the results.




 On 4/6/2014 4:51 AM, Benson Margulies wrote:

 You must know what language each text is in, and use an appropriate
 analyzer. Some people do this by using a separate field (text_eng,
 text_spa, text_jpn). Other people put some extra information at the
 beginning of the field, and then make an analyzer that peeks in order to
 dispatch to the correct tokenizer.


 On Sat, Apr 5, 2014 at 9:59 PM, j7a42e4fd7...@softbank.ne.jp wrote:

  I am pretty new with Lucene, however I have not problem understanding
 what
 is about.
 My big problem is trying to understand how Kuromoji works. I need to
 implement a search functinality thats supports initially English, Spanish
 and Japanese. I doesn't seem to be a deal with the two firsts, as I can
 just use the analyzersーcommon to index both languages contents, but when
 it
 comes to Japanese it has it's own analyzer. I could't find any clues
 about
 combining analyzers, so I still don't if I can combine all languages
 under
 the same index (which would be ideal, as I expect mix searches in the
 context of my project) or I have to detect the language first and then
 index Japanese texts separately (what it will be a big disadvantage when
 it
 comes to mixed searches and future localization expansion).
 I found out about Lucene throgh Kuromoji, it will be great to find out a
 solution to be able to use all the greatness that Lucene offers.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Benson Margulies
It sounds like you've been asked to implement Named Entity Recognition.
OpenNLP has some capability here. There are also, um, commercial
alternatives.


On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio ye.pe...@gmail.comwrote:

 On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar geetgang...@gmail.com
 wrote:

 Hi,

  My requirement is it should have capabilities to match multiple words as
  one token. for example. When user passes String as International Business
  machine logo or IBM logo it should return International Business Machine
 as
  one token and logo as one token.

 This is an interesting problem. I suppose that if the user enters
 International Business Machines, possibly with some misspelling, you
 want to find all documents containing IBM - and that if he enters
 the string IBM, you want to find documents which contain the string
 International Business Machines, or even only parts of it. So this
 means you need some kind of map relating some acronyms with their
 content parts. There really are two directions here: acronym to
 content and content to acronym.

 One cannot find what an acronym means without some kind of acronym
 dictionary. This means that whatever approach you intend to use, there
 should be an external dictionary involved, which, for each acronym,
 would map a list of possible phrases. Retrieving all phrases matching
 the inputted acronym, you'd inject each part of each phrase as a token
 (removing possible duplicates between phrase parts). That's basically
 it for the direction acronym to content.

 The direction content to acronym is trickier, I believe. One way is
 to generate a second (reversed) map, matching each acronym content
 part to a list of acronyms containing that part. You'd simply inject
 acronyms (and possibly other things) if one part of their content is
 matched (or more than one part, if you want to increase relevance).
 This could however possibly require the definition of a specific
 hashing mechanism, if you want to find approximate (distanced) keys
 (e.g. intenational, with the lacking r, would still find IBM). A
 second way (more coupled to the concept of acronym, so less generic)
 could be to consider that every word starting with a capital letter if
 part of an acronym, buffering sequences of words starting with a
 capital letter, and eventually injecting the resulting acronym, if
 found in the acronym dictionary. This might not be safe, though - the
 user may not have the discipline to capitalize the words being part of
 an acronym (or may even misspell the first letter), or concatenated
 first letters could match an irrelevant acronym (many word sequences
 can give the acronym IBM).

 I do not know whether there already exists some Lucene module which
 processes acronyms, or if someone is working on one. It's definitely
 worth a search though, because writing a good one from scratch could
 mean a few days of work, or more.

 HTH.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: LUCENE-5388 AbstractMethodError

2014-01-30 Thread Benson Margulies
If you are sensitive to things being committed to trunk, that suggests that
you are building your own jars and using the trunk. Are you perfectly sure
that you have built, and are using, a consistent set of jars? It looks as
if you've got some trunk-y stuff and some 4.6.1 stuff.



On Thu, Jan 30, 2014 at 6:51 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi Uwe,

 The bug occurred only after LUCENE-5388 was committed to trunk, looks like
 its the changes to Analyzer and friends. The full stack trace is not much
 more helpful:

 java.lang.AbstractMethodError
 at
 org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:140)
 at
 io.openindex.lucene.analysis.util.QueryDigest.unigrams(QueryDigest.java:196)
 at
 io.openindex.lucene.analysis.util.QueryDigest.calculate(QueryDigest.java:135)
 at
 io.openindex.solr.handler.QueryDigestRequestHandler.handleRequestBody(QueryDigestRequestHandler.java:56)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:785)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:368)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
 at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
 at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:724)

 Here's what happens at the consumer code and where the exception begins:
 TokenStream stream = analyzer.tokenStream(null, new StringReader(input));

 We test trunk with our custom stuff as well, but all our custom stuff is
 nicely built with Maven against the most recent release of Solr and/or
 Lucene. If that stays a problem we may have to build stuff against
 branch_4x instead.

 Thanks,
 Markus

 -Original message-
  From:Uwe Schindler u...@thetaphi.de
  Sent: Thursday 30th January 2014 11:18
  To: java-user@lucene.apache.org
  Subject: RE: LUCENE-5388 AbstractMethodError
 
  Hi,
 
  Can you please post your complete stack trace? I have no idea what
 LUCENE-5388 has to do with that error?
 
  Please make sure that all your Analyzers and all of your Solr
 installation only uses *one set* of Lucen/Solr JAR files from *one*
 version. Mixing Lucene/Solr JARs and mixing with Factories compiled against
 older versions does not work. You have to keep all in sync, and then all
 should be fine.
 
  Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
   -Original Message-
   From: Markus Jelsma [mailto:markus.jel...@openindex.io]
   Sent: 

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-08 Thread Benson Margulies
If you'd like to join in on the doc, see
https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to grant
you access to push to my fork.


On Wed, Jan 8, 2014 at 5:37 AM, Mindaugas Žakšauskas min...@gmail.comwrote:

 Just for the interest, I had a similar problem too as well as other
 people [1]. In my project, I am extending the Tokenizer class and have
 another tokenizer (e.g. ClassicTokenizer) as a delegate.
 Unfortunately, properly overriding all public/protected methods is
 *not* enough, e.g.:

 public void reset() throws IOException {
   super.reset();
   delegate.reset();
 }

 I was still getting the exception of broken read()/close() contract.
 Half day and *lots* of debugging later, I realized that exception is
 only thrown when indexing second document only as the delegate reader
 internally gets replaced with ILLEGAL_STATE_READER after .close() is
 called. My solution to this problem was to make the reset() method
 like this:

 public void reset() throws IOException {
   super.reset();
   delegate.setReader(input);
   delegate.reset();
 }

 Another thing worth mentioning is that it's crucial to have
 super.method() before delegate.method() in all overridden methods.
 Would be nice if all of this was somewhere in the Tokenizer Javadoc,
 or even nicer if the base class was designed with delegation in mind
 (Effective Java (2nd edition), Item 16).

 Hope this helps somebody.

 [1]
 http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673

 Regards,
 Mindaugas

 On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies ben...@basistech.com
 wrote:
  Yes I Do.
 
 
  On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir rcm...@gmail.com wrote:
 
  Benson, do you want to open an issue to fix this constructor to not
  take Reader? (there might be one already, but lets make a new one).
 
  These things are supposed to be reused, and have setReader for that
  purpose. i think its confusing and contributes to bugs that you have
  to have logic in e.g. the ctor THEN ALSO in reset().
 
  if someone does it correctly in the ctor, but they only test one
  time, they might think everything is working..
 
  On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies ben...@basistech.com
  wrote:
   For the record of other people who implement tokenizers:
  
   Say that your tokenizer has a constructor, like:
  
public MyTokenizer(Reader reader, ) {
  super(reader);
  myWrappedInputDevice = new MyWrappedInputDevice(reader);
   }
  
   Not a good idea. Tokenizer carefully manages the data flow from the
   constructor arg to the 'input' field. The correct form is:
  
public MyTokenizer(Reader reader, ) {
  super(reader);
  myWrappedInputDevice = new MyWrappedInputDevice(this.input);
   }
  
  
  
   On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir rcm...@gmail.com wrote:
  
   See Tokenizer.java for the state machine logic. In general you should
   not have to do anything if the tokenizer is well-behaved (e.g. close
   calls super.close() and so on).
  
  
  
   On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies 
 bimargul...@gmail.com
  
   wrote:
In 4.6.0,
  
 org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException
   
fails if incrementToken fails to throw if there's a missing reset.
   
How am I supposed to organize this in a Tokenizer? A quick look at
CharTokenizer did not reveal any code for the purpose.
   
   
 -
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
   
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-08 Thread Benson Margulies
I'm not in the delegate business, just a straight subclass. So I think they
are complementary. Gimme your github identity, and you are, as far as I am
concerned, more than welcome to add a section on delegates.



On Wed, Jan 8, 2014 at 7:38 AM, Mindaugas Žakšauskas min...@gmail.comwrote:

 Hi,

 Sure, why not - I'm just not sure if my approach (of setting reader in
 reset()) is preferred over yours (using this.input instead of input in
 ctor)? Or are they both equally good?

 m.

 On Wed, Jan 8, 2014 at 12:18 PM, Benson Margulies ben...@basistech.com
 wrote:
  If you'd like to join in on the doc, see
  https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to
 grant
  you access to push to my fork.
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Benson Margulies
In 4.6.0, org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException

fails if incrementToken fails to throw if there's a missing reset.

How am I supposed to organize this in a Tokenizer? A quick look at
CharTokenizer did not reveal any code for the purpose.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Benson Margulies
For the record of other people who implement tokenizers:

Say that your tokenizer has a constructor, like:

 public MyTokenizer(Reader reader, ) {
   super(reader);
   myWrappedInputDevice = new MyWrappedInputDevice(reader);
}

Not a good idea. Tokenizer carefully manages the data flow from the
constructor arg to the 'input' field. The correct form is:

 public MyTokenizer(Reader reader, ) {
   super(reader);
   myWrappedInputDevice = new MyWrappedInputDevice(this.input);
}



On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir rcm...@gmail.com wrote:

 See Tokenizer.java for the state machine logic. In general you should
 not have to do anything if the tokenizer is well-behaved (e.g. close
 calls super.close() and so on).



 On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies bimargul...@gmail.com
 wrote:
  In 4.6.0,
 org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException
 
  fails if incrementToken fails to throw if there's a missing reset.
 
  How am I supposed to organize this in a Tokenizer? A quick look at
  CharTokenizer did not reveal any code for the purpose.
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Benson Margulies
Yes I Do.


On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir rcm...@gmail.com wrote:

 Benson, do you want to open an issue to fix this constructor to not
 take Reader? (there might be one already, but lets make a new one).

 These things are supposed to be reused, and have setReader for that
 purpose. i think its confusing and contributes to bugs that you have
 to have logic in e.g. the ctor THEN ALSO in reset().

 if someone does it correctly in the ctor, but they only test one
 time, they might think everything is working..

 On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies ben...@basistech.com
 wrote:
  For the record of other people who implement tokenizers:
 
  Say that your tokenizer has a constructor, like:
 
   public MyTokenizer(Reader reader, ) {
 super(reader);
 myWrappedInputDevice = new MyWrappedInputDevice(reader);
  }
 
  Not a good idea. Tokenizer carefully manages the data flow from the
  constructor arg to the 'input' field. The correct form is:
 
   public MyTokenizer(Reader reader, ) {
 super(reader);
 myWrappedInputDevice = new MyWrappedInputDevice(this.input);
  }
 
 
 
  On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir rcm...@gmail.com wrote:
 
  See Tokenizer.java for the state machine logic. In general you should
  not have to do anything if the tokenizer is well-behaved (e.g. close
  calls super.close() and so on).
 
 
 
  On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies bimargul...@gmail.com
 
  wrote:
   In 4.6.0,
  org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException
  
   fails if incrementToken fails to throw if there's a missing reset.
  
   How am I supposed to organize this in a Tokenizer? A quick look at
   CharTokenizer did not reveal any code for the purpose.
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Where is the source for the .dat files in Kuromoji?

2013-12-02 Thread Benson Margulies
There are a handful of binary files
in ./src/resources/org/apache/lucene/analysis/ja/dict/ with filenames
ending in .dat.

Trailing around in the source, it seems as if at least one of these derives
from a source file named unk.def.  In turn, this file comes from a
dependency. should the build generate the file rather than having it in the
tree and shipped as part of the source release?


Re: Where is the source for the .dat files in Kuromoji?

2013-12-02 Thread Benson Margulies
Thanks.


On Mon, Dec 2, 2013 at 12:21 PM, Uwe Schindler u...@thetaphi.de wrote:

 Hi Benson,

 If you run ant regenerate, it downloads the source files (which is ant
 download-dict) and then rebuilds (ant build-dict) the FSTs and other
 binary stuff stored in the dat file. See also the ivy.xml.

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: Benson Margulies [mailto:ben...@basistech.com]
  Sent: Monday, December 02, 2013 6:12 PM
  To: java-user@lucene.apache.org; Christian Moen
  Subject: Where is the source for the .dat files in Kuromoji?
 
  There are a handful of binary files
  in ./src/resources/org/apache/lucene/analysis/ja/dict/ with filenames
  ending in .dat.
 
  Trailing around in the source, it seems as if at least one of these
 derives from
  a source file named unk.def.  In turn, this file comes from a
 dependency.
  should the build generate the file rather than having it in the tree and
  shipped as part of the source release?


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Where is the source for the .dat files in Kuromoji?

2013-12-02 Thread Benson Margulies
On Mon, Dec 2, 2013 at 6:27 PM, Christian Moen c...@atilika.com wrote:

 Hello Benson,

 The sources for the .dat files are available from


 https://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz

 http://atilika.com/releases/mecab-ipadic/mecab-ipadic-2.7.0-20070801.tar.gz





 and a range of other places.

 I’m not sure I follow what you’re saying regarding unk.def -- it’s to my
 knowledge used as-is from the above sources when the binary .dat files are
 made.  (See lucene/analysis/kuromoji/src/tools in the Lucene code tree.)

 Perhaps I’m missing something.  Could you clarify how you think things
 should be done?


I'm not clear that there's anything that anyone would complain of. The
question is, are the .dat files part of the source bundle that is the
'official release'? I just fetched from git, not from the official release,
so I don't know.








 Many thanks,

 Christian Moen
 アティリカ株式会社
 http://www.atilika.com

 On Dec 3, 2013, at 2:11 AM, Benson Margulies ben...@basistech.com wrote:

  There are a handful of binary files in
 ./src/resources/org/apache/lucene/analysis/ja/dict/ with filenames ending
 in .dat.
 
  Trailing around in the source, it seems as if at least one of these
 derives from a source file named unk.def.  In turn, this file comes from
 a dependency. should the build generate the file rather than having it in
 the tree and shipped as part of the source release?
 
 




Re: Modify the StandardTokenizerFactory to concatenate all words

2013-11-05 Thread Benson Margulies
How would you expect to recognize that 'Toy Story' is a thing?


On Tue, Nov 5, 2013 at 6:32 PM, Kevin glidekensing...@gmail.com wrote:

 Currently I'm using StandardTokenizerFactory which tokenizes the words
 bases on spaces. For Toy Story it will create tokens toy and story.
 Ideally, I would want to extend the functionality
 ofStandardTokenizerFactory to
 create tokens toy, story, and toy story. How do I do that?



Threads and LuceneTestCase in 3.6.0

2013-10-31 Thread Benson Margulies
I just backported some code to 3.6.0, and it includes tests that use

org.apache.lucene.analysis.BaseTokenStreamTestCase#checkRandomData(java.util.Random,
org.apache.lucene.analysis.Analyzer, int, int)

The tests that use this method fail in 3.6.0 in ways that suggest that
multiple threads are hitting my token filter in ways that it's not intended
to support.

I've never had a failure like that with 4.1 - 4.5.

Does anyone recall if anything changed here?


Re: new consistency check for token filters in 4.5.1

2013-10-30 Thread Benson Margulies
OK, thanks, for some reason the test of my tokenizer didn't fail but the
test of my token filter with my tokenizer hit the problem. All fixed.



On Wed, Oct 30, 2013 at 2:23 AM, Uwe Schindler u...@thetaphi.de wrote:

 I think this is more a result of the Tokenizer on top, does not correctly
 implementing end().
 In Lucene 4.6 you will get much better error messages
 (IllegalStateException) because we improved this detection, also during
 runtime.

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

  -Original Message-
  From: Benson Margulies [mailto:ben...@basistech.com]
  Sent: Wednesday, October 30, 2013 12:30 AM
  To: java-user@lucene.apache.org
  Subject: new consistency check for token filters in 4.5.1
 
My token filter has no end() method at all. Am I required to have an
 end
  method()?
 
  BaseLinguisticsTokenFilterTest.testSegmentationReadings:175-
  Assert.assertTrue:41-Assert.fail:88
  super.end()/clearAttributes() was not called correctly in end()
 
  BaseLinguisticsTokenFilterTest.testSpacesInLemma:189-
  Assert.assertTrue:41-Assert.fail:88
  super.end()/clearAttributes() was not called correctly in end()


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




new consistency check for token filters in 4.5.1

2013-10-29 Thread Benson Margulies
  My token filter has no end() method at all. Am I required to have an end
method()?

BaseLinguisticsTokenFilterTest.testSegmentationReadings:175-Assert.assertTrue:41-Assert.fail:88
super.end()/clearAttributes() was not called correctly in end()

BaseLinguisticsTokenFilterTest.testSpacesInLemma:189-Assert.assertTrue:41-Assert.fail:88
super.end()/clearAttributes() was not called correctly in end()


Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
I'm working on tool that wants to construct analyzers 'at arms length' -- a
bit like from a solr schema -- so that multiple dueling analyzers could be
in their own class loaders at one time. I want to just define a simple
configuration for char filters, tokenizer, and token filter. So it would
be, well, convenient if there were a tokenizer factory at the lucene level
as there is a token filter factory. I can use Solr easily enough for now,
but I'd consider it cleaner if I could define this entirely at the Lucene
level.


Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
OK, so, here I go again making a public idiot of myself. Could it be that
the tokenizer factory is 'relatively recent' as in since 4.1?




On Mon, Oct 28, 2013 at 7:39 AM, Benson Margulies ben...@basistech.comwrote:

 I'm working on tool that wants to construct analyzers 'at arms length' --
 a bit like from a solr schema -- so that multiple dueling analyzers could
 be in their own class loaders at one time. I want to just define a simple
 configuration for char filters, tokenizer, and token filter. So it would
 be, well, convenient if there were a tokenizer factory at the lucene level
 as there is a token filter factory. I can use Solr easily enough for now,
 but I'd consider it cleaner if I could define this entirely at the Lucene
 level.




Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
Just how 'experimental' is the SPI system at this point, if that's a
reasonable question?


On Mon, Oct 28, 2013 at 8:41 AM, Uwe Schindler u...@thetaphi.de wrote:

 Hi Benson,

 the base factory class and the abstract Tokenizer, TpokenFilter and
 CharFilter factory classes are all in Lucene's analyzers-commons module
 (since 4.0). They are no longer part of Solr.

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: Benson Margulies [mailto:ben...@basistech.com]
  Sent: Monday, October 28, 2013 12:41 PM
  To: java-user@lucene.apache.org
  Subject: Re: Why is there a token filter factory abstraction but not a
 tokenizer
  factory abstraction in Lucene?
 
  OK, so, here I go again making a public idiot of myself. Could it be
 that the
  tokenizer factory is 'relatively recent' as in since 4.1?
 
 
 
 
  On Mon, Oct 28, 2013 at 7:39 AM, Benson Margulies
  ben...@basistech.comwrote:
 
   I'm working on tool that wants to construct analyzers 'at arms length'
   -- a bit like from a solr schema -- so that multiple dueling analyzers
   could be in their own class loaders at one time. I want to just define
   a simple configuration for char filters, tokenizer, and token filter.
   So it would be, well, convenient if there were a tokenizer factory at
   the lucene level as there is a token filter factory. I can use Solr
   easily enough for now, but I'd consider it cleaner if I could define
   this entirely at the Lucene level.
  
  


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
We have been in the habit of naming of classes on the theory that Java
packages are doing work in the namespace.

So, we'd name a class:
com.basistech.something.BaseLinguisticsTokenFilterFactory

So that means that our name in the SPI system is just 'BaseLinguistics'.
That seems a bit problematic. I don't suppose there are some guidelines?


On Mon, Oct 28, 2013 at 9:43 AM, Benson Margulies ben...@basistech.comwrote:

 Just how 'experimental' is the SPI system at this point, if that's a
 reasonable question?


 On Mon, Oct 28, 2013 at 8:41 AM, Uwe Schindler u...@thetaphi.de wrote:

 Hi Benson,

 the base factory class and the abstract Tokenizer, TpokenFilter and
 CharFilter factory classes are all in Lucene's analyzers-commons module
 (since 4.0). They are no longer part of Solr.

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: Benson Margulies [mailto:ben...@basistech.com]
  Sent: Monday, October 28, 2013 12:41 PM
  To: java-user@lucene.apache.org
  Subject: Re: Why is there a token filter factory abstraction but not a
 tokenizer
  factory abstraction in Lucene?
 
  OK, so, here I go again making a public idiot of myself. Could it be
 that the
  tokenizer factory is 'relatively recent' as in since 4.1?
 
 
 
 
  On Mon, Oct 28, 2013 at 7:39 AM, Benson Margulies
  ben...@basistech.comwrote:
 
   I'm working on tool that wants to construct analyzers 'at arms length'
   -- a bit like from a solr schema -- so that multiple dueling analyzers
   could be in their own class loaders at one time. I want to just define
   a simple configuration for char filters, tokenizer, and token filter.
   So it would be, well, convenient if there were a tokenizer factory at
   the lucene level as there is a token filter factory. I can use Solr
   easily enough for now, but I'd consider it cleaner if I could define
   this entirely at the Lucene level.
  
  


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





Anyone interested in a worked-out example of the SPIs for analyzer components?

2013-10-28 Thread Benson Margulies
I just built myself a sort of Solr-schema-in-a-test-tube. It's a class that
builds a classloader on some JAR files and then uses the SPI mechanism to
manufacture Analyzer objects made out of tokenizers and filters.

I can make this visible in github, or even attach it to a JIRA, if anyone
is interested.

For my own nefarious reasons, this acquires the JAR files from Maven
repositories via Aether, but it wouldn't be hard to adjust for use with
plain old pathnames or something.


Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Benson Margulies
It might be helpful if you would explain, at a higher level, what you
are trying to accomplish. Where do these things come from? What
higher-level problem are you trying to solve?

On Sun, Oct 20, 2013 at 7:12 PM, saisantoshi saisantosh...@gmail.com wrote:
 Thanks.

 So, if I understand correctly, StandardAnalyzer wont work for the following
 below as it strips out the special characters and does search only on
 searchText ( in this case).

 queryText = *searchText*

 If we want to do a search like *** then we need to use
 WhiteSpaceAnalyzer. Please let me know if my understanding is correct.

 Also, I am not sure as the following is mentioned in the lucene docs? Is the
 below not for StandardAnalyzer then? It is not mentioned that it wont work
 for StandardAnalyzer.

 /*
 Escaping Special Characters

 Lucene supports escaping special characters that are part of the query
 syntax. The current list special characters are

 + -  || ! ( ) { } [ ] ^  ~ * ? : \ /

 To escape these character use the \ before the character. For example to
 search for (1+1):2 use the query:

 \(1\+1\)\:2

 */

 Thanks,
 Sai.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096727.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Exploiting a whole lot of memory

2013-10-10 Thread Benson Margulies
On Wed, Oct 9, 2013 at 7:18 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Wed, Oct 9, 2013 at 7:13 PM, Benson Margulies ben...@basistech.com
 wrote:
  On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless 
  luc...@mikemccandless.com wrote:
 
  DirectPostingsFormat?
 
  It stores all terms + postings as simple java arrays, uncompressed.
 
 
  This definitely speeded things up in my benchmark, but I'm greedy for
 more.
   I just made a codec that returns it as the postings guy, is that the
 whole
  recipe?. Does it make sense to extend it any further to any of the other
  codec pieces?

 Yes, that's all you should need to do (you should have seen RAM usage
 go up too, to confirm :) ).

 Really this just addressed one hotspot (decoding terms/postings from
 the index); the query matching + scoring is also costly, and if you do
 other stuff (highlighting, spell correction) that can be costly too
 ... what kind of queries are you running / where are the hotspots in
 profiling?




Profile shows a lot of time in   org.apache.lucene.search.BooleanScorer$
BooleanScorerCollector.collect(int).

We know that a typical query inspects about 1/2 of the documents in the
index.




 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Exploiting a whole lot of memory

2013-10-09 Thread Benson Margulies
On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 DirectPostingsFormat?

 It stores all terms + postings as simple java arrays, uncompressed.


This definitely speeded things up in my benchmark, but I'm greedy for more.
 I just made a codec that returns it as the postings guy, is that the whole
recipe?. Does it make sense to extend it any further to any of the other
codec pieces?


 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies ben...@basistech.com
 wrote:
  Consider a Lucene index consisting of 10m documents with a total disk
  footprint of 3G. Consider an application that treats this index as
  read-only, and runs very complex queries over it. Queries with many
 terms,
  some of them 'fuzzy' and 'should' terms and a dismax. And, finally,
  consider doing all this on a box with over 100G of physical memory, some
  cores, and nothing else to do with its time.
 
  I should probably just stop here and see what thoughts come back, but
 I'll
  go out on a limb and type the word 'codec'. The MMapDirectory, of course,
  cheerfully gets to keep every single bit in memory. And then each query
  runs, exercising the  the codec, building up a flurry of Java objects,
 all
  of which turn into garbage and we start all over. So, I find myself
  wondering, is there some sort of an opportunity for a codec-that-caches
 in
  here? In other words, I'd like to sell some of my space to buy some time.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Exploiting a whole lot of memory

2013-10-09 Thread Benson Margulies
On Wed, Oct 9, 2013 at 7:18 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Wed, Oct 9, 2013 at 7:13 PM, Benson Margulies ben...@basistech.com
 wrote:
  On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless 
  luc...@mikemccandless.com wrote:
 
  DirectPostingsFormat?
 
  It stores all terms + postings as simple java arrays, uncompressed.
 
 
  This definitely speeded things up in my benchmark, but I'm greedy for
 more.
   I just made a codec that returns it as the postings guy, is that the
 whole
  recipe?. Does it make sense to extend it any further to any of the other
  codec pieces?

 Yes, that's all you should need to do (you should have seen RAM usage
 go up too, to confirm :) ).


Yes I did that and saw that.




 Really this just addressed one hotspot (decoding terms/postings from
 the index); the query matching + scoring is also costly, and if you do
 other stuff (highlighting, spell correction) that can be costly too
 ... what kind of queries are you running / where are the hotspots in
 profiling?


no 'other stuff' just matching and scoring -- of an embarrassingly complex
query. I will post some results of profiling tomorrow. I had profiled
extensively with lucene 3, we just got the code moved to lucene 4.3, and
the very first thing I did was run this. In lucene 3 there was a very busy
PriorityQueue in there somewhere; but I don't want to waste time and
bandwidth on details until they are 4.x details.



 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Analyzer classes versus the constituent components

2013-10-08 Thread Benson Margulies
Is there some advice around about when it's appropriate to create an
Analyzer class, as opposed to just Tokenizer and TokenFilter classes?

The advantage of the constituent elements is that they allow the
consuming application to add more filters. The only disadvantage I see
is that the following is a bit on the verbose side. Is there some
advantage or use of an Analyzer class that I'm missing?

private Analyzer newAnalyzer() {
return new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName,
 Reader reader) {
Tokenizer source = tokenizerFactory.create(reader,
LanguageCode.JAPANESE);
com.basistech.rosette.bl.Analyzer rblAnalyzer;
try {
rblAnalyzer = analyzerFactory.create(LanguageCode.JAPANESE);
} catch (IOException e) {
throw new RuntimeException(Error creating RBL
analyzer, e);
}
BaseLinguisticsTokenFilter filter = new
BaseLinguisticsTokenFilter(source, rblAnalyzer);
return new TokenStreamComponents(source, filter);
}
};
}

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
Consider a Lucene index consisting of 10m documents with a total disk
footprint of 3G. Consider an application that treats this index as
read-only, and runs very complex queries over it. Queries with many terms,
some of them 'fuzzy' and 'should' terms and a dismax. And, finally,
consider doing all this on a box with over 100G of physical memory, some
cores, and nothing else to do with its time.

I should probably just stop here and see what thoughts come back, but I'll
go out on a limb and type the word 'codec'. The MMapDirectory, of course,
cheerfully gets to keep every single bit in memory. And then each query
runs, exercising the  the codec, building up a flurry of Java objects, all
of which turn into garbage and we start all over. So, I find myself
wondering, is there some sort of an opportunity for a codec-that-caches in
here? In other words, I'd like to sell some of my space to buy some time.


Re: Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
Mike, where do I find DirectPostingFormat?


On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 DirectPostingsFormat?

 It stores all terms + postings as simple java arrays, uncompressed.

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies ben...@basistech.com
 wrote:
  Consider a Lucene index consisting of 10m documents with a total disk
  footprint of 3G. Consider an application that treats this index as
  read-only, and runs very complex queries over it. Queries with many
 terms,
  some of them 'fuzzy' and 'should' terms and a dismax. And, finally,
  consider doing all this on a box with over 100G of physical memory, some
  cores, and nothing else to do with its time.
 
  I should probably just stop here and see what thoughts come back, but
 I'll
  go out on a limb and type the word 'codec'. The MMapDirectory, of course,
  cheerfully gets to keep every single bit in memory. And then each query
  runs, exercising the  the codec, building up a flurry of Java objects,
 all
  of which turn into garbage and we start all over. So, I find myself
  wondering, is there some sort of an opportunity for a codec-that-caches
 in
  here? In other words, I'd like to sell some of my space to buy some time.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
Oh, drat, I left out an 's'. I got it now.


On Tue, Oct 8, 2013 at 7:40 PM, Benson Margulies ben...@basistech.comwrote:

 Mike, where do I find DirectPostingFormat?


 On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 DirectPostingsFormat?

 It stores all terms + postings as simple java arrays, uncompressed.

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies ben...@basistech.com
 wrote:
  Consider a Lucene index consisting of 10m documents with a total disk
  footprint of 3G. Consider an application that treats this index as
  read-only, and runs very complex queries over it. Queries with many
 terms,
  some of them 'fuzzy' and 'should' terms and a dismax. And, finally,
  consider doing all this on a box with over 100G of physical memory, some
  cores, and nothing else to do with its time.
 
  I should probably just stop here and see what thoughts come back, but
 I'll
  go out on a limb and type the word 'codec'. The MMapDirectory, of
 course,
  cheerfully gets to keep every single bit in memory. And then each query
  runs, exercising the  the codec, building up a flurry of Java objects,
 all
  of which turn into garbage and we start all over. So, I find myself
  wondering, is there some sort of an opportunity for a codec-that-caches
 in
  here? In other words, I'd like to sell some of my space to buy some
 time.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: How to make good use of the multithreaded IndexSearcher?

2013-10-01 Thread Benson Margulies
On Tue, Oct 1, 2013 at 3:58 PM, Desidero desid...@gmail.com wrote:
 Benson,

 Rather than forcing a random number of small segments into the index using
 maxMergedSegmentMB, it might be better to split your index into multiple
 shards. You can create a specific number of balanced shards to control the
 parallelism and then forceMerge each shard down to 1 segment to avoid
 spawning extra threads per shard. Once that's done, you just open all of
 the shards with a MultiReader and use that with the IndexSearcher and an
 ExecutorService.

 The downside to this is that it doesn't play nicely with near real-time
 search, but if you have a relatively static index that gets pushed to
 slaves periodically it gets the job done.

 As Mike said, it'd be nicer if there was a way to split the docID space
 into virtual shards, but it's not currently available. I'm not sure if
 anyone is even looking into it.

Thanks, folks, for all the help. I'm musing about the top-level issue
here, which is whether the important case is many independent queries
or latency of just one.  In the case where it's just one, we'll follow
the shard-related advice.





 Regards,
 Matt


 On Tue, Oct 1, 2013 at 7:09 AM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 You might want to set a smallish maxMergedSegmentMB in
 TieredMergePolicy to force enough segments in the index ... sort of
 the opposite of optimizing.

 Really, IndexSearcher's approach to using one thread per segment is
 rather silly, and, it's annoying/bad to expose change in behavior due
 to segment structure.

 I think it'd be better to carve up the overall docID space into N
 virtual shards.  Ie, if you have 100M docs, then one thread searches
 docs 0-10M, another 10M-20M, etc.  Nobody has created such a searcher
 impl but it should not be hard and it would be agnostic to the segment
 structure.

 But then again, this need (using concurrent hardware to reduce latency
 of a single query) is somewhat rare; most apps are fine using the
 concurrency across queries rather than within one query.

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Oct 1, 2013 at 7:09 AM, Adrien Grand jpou...@gmail.com wrote:
  Hi Benson,
 
  On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies ben...@basistech.com
 wrote:
  The multithreaded index searcher fans out across segments. How
 aggressively
  does 'optimize' reduce the number of segments? If the segment count goes
  way down, is there some other way to exploit multiple cores?
 
  forceMerge[1], formerly known as optimize, takes a parameter to
  configure how many segments should remain in the index.
 
  Regarding multi-core usage, if your query load is high enough to use
  all you CPUs (there are alwas #cores queries running in parrallel),
  there is generally no need to use the multi-threaded IndexSearcher.
  The multi-threaded index searcher can however help in case all CPU
  power is not in use or if you care more about latency than throughput.
  It indeed leverages the fact that the index is splitted into segments
  to parallelize query execution, so a fully merged index will actually
  run the query in a single thread in any case.
 
  There is no way to make query execution efficiently use several cores
  on a single-segment index so if you really want to parallelize query
  execution, you will have to shard the index to do at the index level
  what the multi-threaded IndexSearcher does at the segment level.
 
  Side notes:
   - A single segment index only runs more efficiently queries which are
  terms-dictionary-intensive, it is generally discouraged to run
  forceMerge on an index unless this index is read-only.
   - The multi-threaded index searcher only parallelizes query execution
  in certain cases. In particular, it never parallelizes execution when
  the method takes a collector. This means that if you want to use
  TotalHitCountCollector to count matches, you will have to do the
  parallelization by yourself.
 
  [1]
 http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#forceMerge%28int%29
 
  --
  Adrien
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How to make good use of the multithreaded IndexSearcher?

2013-09-30 Thread Benson Margulies
The multithreaded index searcher fans out across segments. How aggressively
does 'optimize' reduce the number of segments? If the segment count goes
way down, is there some other way to exploit multiple cores?


Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Benson Margulies
Thanks, I might pitch in.


On Mon, Sep 16, 2013 at 12:58 PM, Robert Muir rcm...@gmail.com wrote:

 Mostly because our tokenizers like StandardTokenizer will tokenize the
 same way regardless of normalization form or whether its normalized at
 all?

 But for other tokenizers, such a charfilter should be useful: there is
 a JIRA for it, but it has some unresolved issues

 https://issues.apache.org/jira/browse/LUCENE-4072

 On Sun, Sep 15, 2013 at 7:05 PM, Benson Margulies bimargul...@gmail.com
 wrote:
  Can anyone shed light as to why this is a token filter and not a char
  filter? I'm wishing for one of these _upstream_ of a tokenizer, so that
 the
  tokenizer's lookups in its dictionaries are seeing normalized contents.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Benson Margulies
Can anyone shed light as to why this is a token filter and not a char
filter? I'm wishing for one of these _upstream_ of a tokenizer, so that the
tokenizer's lookups in its dictionaries are seeing normalized contents.


Re: PositionLengthAttribute

2013-09-07 Thread Benson Margulies
In Japanese, compounds are just decompositions of the input string. In
other languages, compounds can manufacture entire tokens from thin
air. In those cases, it's something of a question how to decide on the
offsets. I think that you're right, eventually, insofar as there's
some offset in the original that might as well be blamed for any given
component.


On Fri, Sep 6, 2013 at 9:37 PM, Robert Muir rcm...@gmail.com wrote:
 On Fri, Sep 6, 2013 at 9:32 PM, Benson Margulies ben...@basistech.com wrote:
 On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir rcm...@gmail.com wrote:
 its the latter. the way its designed to work i think is illustrated
 best in kuromoji analyzer where it heuristically decompounds nouns:

 if it decompounds ABCD into AB + CD, then the tokens are AB and CD.
 these both have posinc=1.
 however (to compensate for precision issue you mentioned on the other
 thread), it keeps the full compound as a synonym too (there are some
 papers benchmarking this approach for decompounding, just think of IDF
 etc sorting things out).
 so that ABCD synonym has position increment 0, and it sits at the
 same position as the first token (AB). but it has positionLength=2,
 which basically keeps the information in the chain that this synonym
 spans across both AB and CD.

 so the output is like this: AB(posinc=1,posLength=1),
 ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1)

 I suppose this works best if you actually know the offsets of the
 pieces. In disassembling German, this is not always straightforward.


 i dont really see how it has anything to do with natural languages?
 its just the way you represent the compound components in the
 tokenstream.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: PositionLengthAttribute

2013-09-07 Thread Benson Margulies
On Sat, Sep 7, 2013 at 8:39 AM, Robert Muir rcm...@gmail.com wrote:
 On Sat, Sep 7, 2013 at 7:44 AM, Benson Margulies ben...@basistech.com wrote:
 In Japanese, compounds are just decompositions of the input string. In
 other languages, compounds can manufacture entire tokens from thin
 air. In those cases, it's something of a question how to decide on the
 offsets. I think that you're right, eventually, insofar as there's
 some offset in the original that might as well be blamed for any given
 component.


 Why change the offsets then? Offsets are for highlighting. Let the
 whole compound be highlighted when its a match in search results. Its
 transparent and totally accurate as to what is happening: this is why
 we do highlighting, to aid the user can make a relevance assessment
 about the document, not to try to assist the end user to debug the
 analysis chain or anything like that.

Thanks, that's very helpful. I spend all my time crawling around the
underside of this stuff and I lack perspective.



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: LookaheadTokenFilter

2013-09-07 Thread Benson Margulies
nextToken() calls peekToken(). That seems to prevent my lookahead
processing from seeing that item later. Am I missing something?


On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies ben...@basistech.com wrote:
 I think that the penny just dropped, and I should not be using this class.

 If I call peekToken 10 times while sitting at token 0, this class will
 stack up all 10 of these _at token position 0_. That's not really very
 helpful for what I'm doing. I need to borrow code from this class and
 not use it.

 On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies ben...@basistech.com wrote:
 Michael,

 I'm apparently not fully deconfused yet.

 I've got a very simple incrementToken function. It calls peekToken to
 stack up the tokens.

 afterPosition is never called; I expected it to be called as each of
 the peeked tokens gets next-ed back out.

 I assume that I'm missing something simple.


 public boolean incrementToken() throws IOException {
 if (positions.getMaxPos()  0) {
 peekSentence();
 }
 return nextToken();
 }



 On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com 
 wrote:
 On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com 
 wrote:
  I'm trying to work through the logic of reading ahead until I've seen
  marker for the end of a sentence, then applying some analysis to all of 
  the
  tokens of the sentence, and then changing some attributes of each token 
  to
  reflect the results.
 
  The queue of tokens for a position is just a State, so there isn't an API
  there to set any values.
 
  So do I need to subclass Position for myself, store the additional
  information in there, and set the attributes as each token comes by on 
  the
  output side?

 Yes, that sounds right.  Either that or, on emitting the eventual
 Tokens, apply your logic there (because at that point, after
 restoreState, you have access to all the attr values for that token).

  I would be grateful for a bit more explanation of afterPosition versus
  incrementToken; some of the mock classes call peek from afterPosition, 
  and
  I expected to see peek called in incrementToken based on the javadoc.

 afterPosition is where your subclass can insert new tokens.

 I think (it's been a while here...) you are allowed to call peekToken
 in afterPosition; this is necessary if your logic about inserting
 additional tokens leaving a given position depends on future tokens.

 But: are you doing any new token insertion?  Or are you just tweaking
 the attributes of the tokens that pass through the filter?  If it's
 the latter then this class may be overkill ... you could make a simple
 TokenFilter.incrementToken that just enumerates  saves all input
 tokens, does its processing, then returns those tokens one by one,
 instead.

 I'm not adding tokens yet, but I will be soon, so all of this isn't
 entirely crazy. The underlying capability here includes decompounding.
 (I have mixed feelings about just adding all the fragments to the
 token stream, as it can reduce precision, but there isn't an obvious
 alternative (except perhaps to suppress the super-common ones)).

 So, to summarize, logic might be:

 in incrementToken:

 If positions.getMaxPos()  -1. just return nextToken(). If not, loop
 calling peekToken to acquire a sentence, process the sentence, and
 attach the lemmas and compound-pieces to the Position subclass
 objects.

 in afterPosition, as each token comes 'into focus', splat the lemma
 from the Position into the char term attribute, and insert new tokens
 as needed for the compound components.

 Thanks,
 benson







 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: LookaheadTokenFilter

2013-09-07 Thread Benson Margulies
I think I had better build you a test case for this situation, and
attach it to a JIRA.

On Sat, Sep 7, 2013 at 3:33 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Something is wrong; I'm not sure what offhand, but calling peekToken
 10 times should not stack all tokens @ position 0; it should stack the
 tokens at the positions where they occurred.  Are you sure the posIncr
 att is sometimes 1 (i.e., the position is in fact moving forward for
 some tokens)?

 nextToken() only calls peekToken() once the lookahead buffer is exhausted.

 afterPosition() should be called within nextToken(), for each
 position, once all tokens leaving that position are done.

 You use case *should* be working: inside your incrementToken() you
 call peekToken() over and over until you've seen the full sentence
 (saving away any state in your subclass of Position), then nextToken()
 to emit the buffered tokens, and to insert your own tokens when
 afterPosition() is called ...

 Mike McCandless

 http://blog.mikemccandless.com


 On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies ben...@basistech.com wrote:
 nextToken() calls peekToken(). That seems to prevent my lookahead
 processing from seeing that item later. Am I missing something?


 On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies ben...@basistech.com 
 wrote:
 I think that the penny just dropped, and I should not be using this class.

 If I call peekToken 10 times while sitting at token 0, this class will
 stack up all 10 of these _at token position 0_. That's not really very
 helpful for what I'm doing. I need to borrow code from this class and
 not use it.

 On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies ben...@basistech.com 
 wrote:
 Michael,

 I'm apparently not fully deconfused yet.

 I've got a very simple incrementToken function. It calls peekToken to
 stack up the tokens.

 afterPosition is never called; I expected it to be called as each of
 the peeked tokens gets next-ed back out.

 I assume that I'm missing something simple.


 public boolean incrementToken() throws IOException {
 if (positions.getMaxPos()  0) {
 peekSentence();
 }
 return nextToken();
 }



 On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com 
 wrote:
 On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com 
 wrote:
  I'm trying to work through the logic of reading ahead until I've seen
  marker for the end of a sentence, then applying some analysis to all 
  of the
  tokens of the sentence, and then changing some attributes of each 
  token to
  reflect the results.
 
  The queue of tokens for a position is just a State, so there isn't an 
  API
  there to set any values.
 
  So do I need to subclass Position for myself, store the additional
  information in there, and set the attributes as each token comes by on 
  the
  output side?

 Yes, that sounds right.  Either that or, on emitting the eventual
 Tokens, apply your logic there (because at that point, after
 restoreState, you have access to all the attr values for that token).

  I would be grateful for a bit more explanation of afterPosition versus
  incrementToken; some of the mock classes call peek from afterPosition, 
  and
  I expected to see peek called in incrementToken based on the javadoc.

 afterPosition is where your subclass can insert new tokens.

 I think (it's been a while here...) you are allowed to call peekToken
 in afterPosition; this is necessary if your logic about inserting
 additional tokens leaving a given position depends on future tokens.

 But: are you doing any new token insertion?  Or are you just tweaking
 the attributes of the tokens that pass through the filter?  If it's
 the latter then this class may be overkill ... you could make a simple
 TokenFilter.incrementToken that just enumerates  saves all input
 tokens, does its processing, then returns those tokens one by one,
 instead.

 I'm not adding tokens yet, but I will be soon, so all of this isn't
 entirely crazy. The underlying capability here includes decompounding.
 (I have mixed feelings about just adding all the fragments to the
 token stream, as it can reduce precision, but there isn't an obvious
 alternative (except perhaps to suppress the super-common ones)).

 So, to summarize, logic might be:

 in incrementToken:

 If positions.getMaxPos()  -1. just return nextToken(). If not, loop
 calling peekToken to acquire a sentence, process the sentence, and
 attach the lemmas and compound-pieces to the Position subclass
 objects.

 in afterPosition, as each token comes 'into focus', splat the lemma
 from the Position into the char term attribute, and insert new tokens
 as needed for the compound components.

 Thanks,
 benson







 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java

Re: LookaheadTokenFilter

2013-09-07 Thread Benson Margulies
LUCENE-5202. It seems to show the problem of the extra peek. I'm still
struggling to make sense of the 'problem' of not always calling
afterPosition(); that may be entirely my own confusion.

On Sat, Sep 7, 2013 at 4:21 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 That would be awesome, thanks!

 Mike McCandless

 http://blog.mikemccandless.com


 On Sat, Sep 7, 2013 at 3:40 PM, Benson Margulies ben...@basistech.com wrote:
 I think I had better build you a test case for this situation, and
 attach it to a JIRA.

 On Sat, Sep 7, 2013 at 3:33 PM, Michael McCandless
 luc...@mikemccandless.com wrote:
 Something is wrong; I'm not sure what offhand, but calling peekToken
 10 times should not stack all tokens @ position 0; it should stack the
 tokens at the positions where they occurred.  Are you sure the posIncr
 att is sometimes 1 (i.e., the position is in fact moving forward for
 some tokens)?

 nextToken() only calls peekToken() once the lookahead buffer is exhausted.

 afterPosition() should be called within nextToken(), for each
 position, once all tokens leaving that position are done.

 You use case *should* be working: inside your incrementToken() you
 call peekToken() over and over until you've seen the full sentence
 (saving away any state in your subclass of Position), then nextToken()
 to emit the buffered tokens, and to insert your own tokens when
 afterPosition() is called ...

 Mike McCandless

 http://blog.mikemccandless.com


 On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies ben...@basistech.com 
 wrote:
 nextToken() calls peekToken(). That seems to prevent my lookahead
 processing from seeing that item later. Am I missing something?


 On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies ben...@basistech.com 
 wrote:
 I think that the penny just dropped, and I should not be using this class.

 If I call peekToken 10 times while sitting at token 0, this class will
 stack up all 10 of these _at token position 0_. That's not really very
 helpful for what I'm doing. I need to borrow code from this class and
 not use it.

 On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies ben...@basistech.com 
 wrote:
 Michael,

 I'm apparently not fully deconfused yet.

 I've got a very simple incrementToken function. It calls peekToken to
 stack up the tokens.

 afterPosition is never called; I expected it to be called as each of
 the peeked tokens gets next-ed back out.

 I assume that I'm missing something simple.


 public boolean incrementToken() throws IOException {
 if (positions.getMaxPos()  0) {
 peekSentence();
 }
 return nextToken();
 }



 On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com 
 wrote:
 On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies 
 ben...@basistech.com wrote:
  I'm trying to work through the logic of reading ahead until I've seen
  marker for the end of a sentence, then applying some analysis to all 
  of the
  tokens of the sentence, and then changing some attributes of each 
  token to
  reflect the results.
 
  The queue of tokens for a position is just a State, so there isn't 
  an API
  there to set any values.
 
  So do I need to subclass Position for myself, store the additional
  information in there, and set the attributes as each token comes by 
  on the
  output side?

 Yes, that sounds right.  Either that or, on emitting the eventual
 Tokens, apply your logic there (because at that point, after
 restoreState, you have access to all the attr values for that token).

  I would be grateful for a bit more explanation of afterPosition 
  versus
  incrementToken; some of the mock classes call peek from 
  afterPosition, and
  I expected to see peek called in incrementToken based on the javadoc.

 afterPosition is where your subclass can insert new tokens.

 I think (it's been a while here...) you are allowed to call peekToken
 in afterPosition; this is necessary if your logic about inserting
 additional tokens leaving a given position depends on future tokens.

 But: are you doing any new token insertion?  Or are you just tweaking
 the attributes of the tokens that pass through the filter?  If it's
 the latter then this class may be overkill ... you could make a simple
 TokenFilter.incrementToken that just enumerates  saves all input
 tokens, does its processing, then returns those tokens one by one,
 instead.

 I'm not adding tokens yet, but I will be soon, so all of this isn't
 entirely crazy. The underlying capability here includes decompounding.
 (I have mixed feelings about just adding all the fragments to the
 token stream, as it can reduce precision, but there isn't an obvious
 alternative (except perhaps to suppress the super-common ones)).

 So, to summarize, logic might be:

 in incrementToken:

 If positions.getMaxPos()  -1. just return nextToken(). If not, loop
 calling peekToken to acquire a sentence, process

Re: LookaheadTokenFilter

2013-09-06 Thread Benson Margulies
On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
luc...@mikemccandless.com wrote:

 On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com wrote:
  I'm trying to work through the logic of reading ahead until I've seen
  marker for the end of a sentence, then applying some analysis to all of the
  tokens of the sentence, and then changing some attributes of each token to
  reflect the results.
 
  The queue of tokens for a position is just a State, so there isn't an API
  there to set any values.
 
  So do I need to subclass Position for myself, store the additional
  information in there, and set the attributes as each token comes by on the
  output side?

 Yes, that sounds right.  Either that or, on emitting the eventual
 Tokens, apply your logic there (because at that point, after
 restoreState, you have access to all the attr values for that token).

  I would be grateful for a bit more explanation of afterPosition versus
  incrementToken; some of the mock classes call peek from afterPosition, and
  I expected to see peek called in incrementToken based on the javadoc.

 afterPosition is where your subclass can insert new tokens.

 I think (it's been a while here...) you are allowed to call peekToken
 in afterPosition; this is necessary if your logic about inserting
 additional tokens leaving a given position depends on future tokens.

 But: are you doing any new token insertion?  Or are you just tweaking
 the attributes of the tokens that pass through the filter?  If it's
 the latter then this class may be overkill ... you could make a simple
 TokenFilter.incrementToken that just enumerates  saves all input
 tokens, does its processing, then returns those tokens one by one,
 instead.

I'm not adding tokens yet, but I will be soon, so all of this isn't
entirely crazy. The underlying capability here includes decompounding.
(I have mixed feelings about just adding all the fragments to the
token stream, as it can reduce precision, but there isn't an obvious
alternative (except perhaps to suppress the super-common ones)).

So, to summarize, logic might be:

in incrementToken:

If positions.getMaxPos()  -1. just return nextToken(). If not, loop
calling peekToken to acquire a sentence, process the sentence, and
attach the lemmas and compound-pieces to the Position subclass
objects.

in afterPosition, as each token comes 'into focus', splat the lemma
from the Position into the char term attribute, and insert new tokens
as needed for the compound components.

Thanks,
benson







 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



PositionLengthAttribute

2013-09-06 Thread Benson Margulies
I'm confused by the comment about compound components here.

If a single token fissions into multiple tokens, then what belongs in
the PositionLengthAttribute. I'm wanting to store a fraction in here!
Or is the idea to store N in the 'mother' token and then '1' in each
of the babies?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: LookaheadTokenFilter

2013-09-06 Thread Benson Margulies
Michael,

I'm apparently not fully deconfused yet.

I've got a very simple incrementToken function. It calls peekToken to
stack up the tokens.

afterPosition is never called; I expected it to be called as each of
the peeked tokens gets next-ed back out.

I assume that I'm missing something simple.


public boolean incrementToken() throws IOException {
if (positions.getMaxPos()  0) {
peekSentence();
}
return nextToken();
}



On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com wrote:
 On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com 
 wrote:
  I'm trying to work through the logic of reading ahead until I've seen
  marker for the end of a sentence, then applying some analysis to all of the
  tokens of the sentence, and then changing some attributes of each token to
  reflect the results.
 
  The queue of tokens for a position is just a State, so there isn't an API
  there to set any values.
 
  So do I need to subclass Position for myself, store the additional
  information in there, and set the attributes as each token comes by on the
  output side?

 Yes, that sounds right.  Either that or, on emitting the eventual
 Tokens, apply your logic there (because at that point, after
 restoreState, you have access to all the attr values for that token).

  I would be grateful for a bit more explanation of afterPosition versus
  incrementToken; some of the mock classes call peek from afterPosition, and
  I expected to see peek called in incrementToken based on the javadoc.

 afterPosition is where your subclass can insert new tokens.

 I think (it's been a while here...) you are allowed to call peekToken
 in afterPosition; this is necessary if your logic about inserting
 additional tokens leaving a given position depends on future tokens.

 But: are you doing any new token insertion?  Or are you just tweaking
 the attributes of the tokens that pass through the filter?  If it's
 the latter then this class may be overkill ... you could make a simple
 TokenFilter.incrementToken that just enumerates  saves all input
 tokens, does its processing, then returns those tokens one by one,
 instead.

 I'm not adding tokens yet, but I will be soon, so all of this isn't
 entirely crazy. The underlying capability here includes decompounding.
 (I have mixed feelings about just adding all the fragments to the
 token stream, as it can reduce precision, but there isn't an obvious
 alternative (except perhaps to suppress the super-common ones)).

 So, to summarize, logic might be:

 in incrementToken:

 If positions.getMaxPos()  -1. just return nextToken(). If not, loop
 calling peekToken to acquire a sentence, process the sentence, and
 attach the lemmas and compound-pieces to the Position subclass
 objects.

 in afterPosition, as each token comes 'into focus', splat the lemma
 from the Position into the char term attribute, and insert new tokens
 as needed for the compound components.

 Thanks,
 benson







 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: LookaheadTokenFilter

2013-09-06 Thread Benson Margulies
I think that the penny just dropped, and I should not be using this class.

If I call peekToken 10 times while sitting at token 0, this class will
stack up all 10 of these _at token position 0_. That's not really very
helpful for what I'm doing. I need to borrow code from this class and
not use it.

On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies ben...@basistech.com wrote:
 Michael,

 I'm apparently not fully deconfused yet.

 I've got a very simple incrementToken function. It calls peekToken to
 stack up the tokens.

 afterPosition is never called; I expected it to be called as each of
 the peeked tokens gets next-ed back out.

 I assume that I'm missing something simple.


 public boolean incrementToken() throws IOException {
 if (positions.getMaxPos()  0) {
 peekSentence();
 }
 return nextToken();
 }



 On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com wrote:
 On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com 
 wrote:
  I'm trying to work through the logic of reading ahead until I've seen
  marker for the end of a sentence, then applying some analysis to all of 
  the
  tokens of the sentence, and then changing some attributes of each token to
  reflect the results.
 
  The queue of tokens for a position is just a State, so there isn't an API
  there to set any values.
 
  So do I need to subclass Position for myself, store the additional
  information in there, and set the attributes as each token comes by on the
  output side?

 Yes, that sounds right.  Either that or, on emitting the eventual
 Tokens, apply your logic there (because at that point, after
 restoreState, you have access to all the attr values for that token).

  I would be grateful for a bit more explanation of afterPosition versus
  incrementToken; some of the mock classes call peek from afterPosition, and
  I expected to see peek called in incrementToken based on the javadoc.

 afterPosition is where your subclass can insert new tokens.

 I think (it's been a while here...) you are allowed to call peekToken
 in afterPosition; this is necessary if your logic about inserting
 additional tokens leaving a given position depends on future tokens.

 But: are you doing any new token insertion?  Or are you just tweaking
 the attributes of the tokens that pass through the filter?  If it's
 the latter then this class may be overkill ... you could make a simple
 TokenFilter.incrementToken that just enumerates  saves all input
 tokens, does its processing, then returns those tokens one by one,
 instead.

 I'm not adding tokens yet, but I will be soon, so all of this isn't
 entirely crazy. The underlying capability here includes decompounding.
 (I have mixed feelings about just adding all the fragments to the
 token stream, as it can reduce precision, but there isn't an obvious
 alternative (except perhaps to suppress the super-common ones)).

 So, to summarize, logic might be:

 in incrementToken:

 If positions.getMaxPos()  -1. just return nextToken(). If not, loop
 calling peekToken to acquire a sentence, process the sentence, and
 attach the lemmas and compound-pieces to the Position subclass
 objects.

 in afterPosition, as each token comes 'into focus', splat the lemma
 from the Position into the char term attribute, and insert new tokens
 as needed for the compound components.

 Thanks,
 benson







 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: PositionLengthAttribute

2013-09-06 Thread Benson Margulies
On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir rcm...@gmail.com wrote:
 its the latter. the way its designed to work i think is illustrated
 best in kuromoji analyzer where it heuristically decompounds nouns:

 if it decompounds ABCD into AB + CD, then the tokens are AB and CD.
 these both have posinc=1.
 however (to compensate for precision issue you mentioned on the other
 thread), it keeps the full compound as a synonym too (there are some
 papers benchmarking this approach for decompounding, just think of IDF
 etc sorting things out).
 so that ABCD synonym has position increment 0, and it sits at the
 same position as the first token (AB). but it has positionLength=2,
 which basically keeps the information in the chain that this synonym
 spans across both AB and CD.

 so the output is like this: AB(posinc=1,posLength=1),
 ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1)

I suppose this works best if you actually know the offsets of the
pieces. In disassembling German, this is not always straightforward.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



LookaheadTokenFilter

2013-09-05 Thread Benson Margulies
This useful-looking item is in the test-framework jar. Is there some subtle
reason that it isn't in the common analyzer jar? Some reason why I'd regret
using it?


LookaheadTokenFilter

2013-09-05 Thread Benson Margulies
I'm trying to work through the logic of reading ahead until I've seen
marker for the end of a sentence, then applying some analysis to all of the
tokens of the sentence, and then changing some attributes of each token to
reflect the results.

The queue of tokens for a position is just a State, so there isn't an API
there to set any values.

So do I need to subclass Position for myself, store the additional
information in there, and set the attributes as each token comes by on the
output side?

I would be grateful for a bit more explanation of afterPosition versus
incrementToken; some of the mock classes call peek from afterPosition, and
I expected to see peek called in incrementToken based on the javadoc.


Re: Issue with documentation for org.apache.lucene.analysis.synonym.SynonymMap.Builder.add() method

2012-09-06 Thread Benson Margulies
On Thu, Sep 6, 2012 at 1:59 PM, Robert Muir rcm...@gmail.com wrote:

 Thanks for reporting this Mark.

 I think it was not intended to have actual null characters here (or
 probably anywhere in javadocs).

 Our javadocs checkers should be failing on stuff like this...

 On Thu, Sep 6, 2012 at 1:52 PM, Mark Parker godef...@gmail.com wrote:
  I'm building documentation from the Lucene 4.0.0-BETA source (though
  this was also an issue with the ALPHA source), and the output has null
  characters in it. I believe that this is because the source looks like
  this:
 
  /**
   * Add a phrase-phrase synonym mapping.
   * Phrases are character sequences where words are
   * separated with character zero (\u).  Empty words
   * (two \us in a row) are not allowed in the input nor
   * the output!
   *
   * @param input input phrase
   * @param output output phrase
   * @param includeOrig true if the original should be included
   */
 
  These \u characters are converted to null (\0) characters in the
  output, which are invalid in XML (I'm outputting XML). Indeed, this is
  a problem in the built documentation at the Apache Lucene site
  (
 http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.Builder.html
 )
  where the documentation looks like this (in my browser):
 


Converted to U+000 by what, I wonder? Javadoc shouldn't be doing that. If
it does,  I wonder if we need \\u instead?


  Add a phrase-phrase synonym mapping. Phrases are character sequences
  where words are separated with character zero (). Empty words (two s
  in a row) are not allowed in the input nor the output!
 
  The actual HTML file does have null characters at the two locations,
  which may be technically correct, but not very helpful. I believe the
  \u in the source ought to be escaped in some way, so that
  something more meaningful than \0 ends up in the output. I'd submit a
  patch, just for the prestige of it, but I don't have the slightest
  idea what the change should be, not being a Java guy at all.
 
  For those interested in why I'm messing with this, then, I'm using
  IKVM to convert the Java Lucene libraries to .NET assemblies (well,
  one assembly) and converting the javadoc comments to XML documentation
  for good IntelliSense in Visual Studio. It works wonderfully, and we
  use it in very successful commercial software!
 
  Note that I'm not subscribed to the list, so please CC me if there are
  questions.
 
  Mark
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 



 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Payload class

2012-08-29 Thread Benson Margulies
I'm failing to find advice in MIGRATE.txt on how to replace 'new
Payload(...)' in migrating to 4.0.  What am I missing?


ResourceLoader?

2012-08-29 Thread Benson Margulies
Our Solr 3.x code used init(ResourceLoader) and then called the loader to
read a file.

What's the new approach to reading content from files in the 'usual place'?


Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
That's what I meant, thanks.

On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies ben...@basistech.com
 wrote:
  Our Solr 3.x code used init(ResourceLoader) and then called the loader to
  read a file.
 
  What's the new approach to reading content from files in the 'usual
 place'?

 I'm not aware of init(ResourceLoader), only inform(ResourceLoader). is
 that what you meant?

 I added some javadocs on the lifecycle of these factories the other
 day (please review, possible doc bugs!):

 https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html

 Here are some examples:

 Parses a tab-separated file (using getLines: UTF-8):

 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilterFactory.java

 Parses a file of its own format (using specified encoding):

 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizerFactory.java

 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
I'm confused. Isn't inform/ResourceLoader deprecated? But your example use
it?


On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies ben...@basistech.com
 wrote:
  Our Solr 3.x code used init(ResourceLoader) and then called the loader to
  read a file.
 
  What's the new approach to reading content from files in the 'usual
 place'?

 I'm not aware of init(ResourceLoader), only inform(ResourceLoader). is
 that what you meant?

 I added some javadocs on the lifecycle of these factories the other
 day (please review, possible doc bugs!):

 https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html

 Here are some examples:

 Parses a tab-separated file (using getLines: UTF-8):

 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilterFactory.java

 Parses a file of its own format (using specified encoding):

 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizerFactory.java

 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Using a char filter in solr createComponents

2012-08-29 Thread Benson Margulies
I'm close to the bottom of my list here.

I've got an Analyzer that, in 3.1, set up a CharFilter in the tokenStream
method. So now I have to migrate that to createComponents. Can someone give
me a shove in the right direction?


Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com
 wrote:
  I'm confused. Isn't inform/ResourceLoader deprecated? But your example
 use
  it?
 

 Where is it deprecated? What does the deprecation message say?


I see. It moved from one package to another. Sorry for the noise.


 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
Hang on:

[deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in
org.apache.solr.util.plugin has been deprecated



On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com
 wrote:
  I'm confused. Isn't inform/ResourceLoader deprecated? But your example
 use
  it?
 

 Where is it deprecated? What does the deprecation message say?

 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 10:42 AM, Robert Muir rcm...@gmail.com wrote:

 Right and what does the @deprecated message say :)


Yes, indeed, sorry. I got caught in a maze of twisty passages and my brain
turned off. I'm better now.



 On Wed, Aug 29, 2012 at 10:40 AM, Benson Margulies ben...@basistech.com
 wrote:
  Hang on:
 
  [deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in
  org.apache.solr.util.plugin has been deprecated
 
 
 
  On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote:
 
  On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies 
 ben...@basistech.com
  wrote:
   I'm confused. Isn't inform/ResourceLoader deprecated? But your example
  use
   it?
  
 
  Where is it deprecated? What does the deprecation message say?
 
  --
  lucidworks.com
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 



 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
I've read the javadoc through a few times, but I confess that I'm still
feeling dense.

Are all tokenizers responsible for implementing some way of retaining the
contents of their reader, so that a call to reset without a call to
setReader rewinds? I note that CharTokenizer doesn't implement #reset,
which leads me to suspect that I'm not responsible for the rewind behavior.


Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 3:37 PM, Robert Muir rcm...@gmail.com wrote:

 ok, lets help improve it: I think these have likely always been confusing.

 before they were both reset: reset() and reset(Reader), even though
 they are unrelated. I thought the rename would help this :)

 Does the TokenStream workfloat here help?

 http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html
 Basically reset() is a mandatory thing the consumer must call. it just
 means 'reset any mutable state so you can be reused for processing
 again'.


I really did read this. setReader I get; I don't understand what reset
accomplishes. What does it mean to reuse one a TokenStream without calling
setReader to supply a new input? If it means reuse the old input, who does
the rewinding?





 This is something on any TokenStream: Tokenizers, TokenFilters, or
 even some direct descendent you make that parses byte arrays, or
 whatever.

 This means if you are keeping some state across tokens (like
 stopfilter's #skippedTokens). here is where you would set that = 0
 again.

 setReader(Reader) is only on Tokenizer, it means replace the Reader
 with a different one to be processed.
 The fact that CharTokenizer is doing 'reset()-like-stuff' in here is
 bogus IMO, but I dont think it will cause any bugs. Don't emulate it
 :)

 On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies ben...@basistech.com
 wrote:
  I've read the javadoc through a few times, but I confess that I'm still
  feeling dense.
 
  Are all tokenizers responsible for implementing some way of retaining the
  contents of their reader, so that a call to reset without a call to
  setReader rewinds? I note that CharTokenizer doesn't implement #reset,
  which leads me to suspect that I'm not responsible for the rewind
 behavior.



 --
 lucidworks.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
 Some interlinear commentary on the doc.

* Resets this stream to the beginning.

To me this implies a rewind.  As previously noted, I don't see how this
works for the existing implementations.

   * As all TokenStreams must be reusable,
   * any implementations which have state that needs to be reset between
usages
   * of the TokenStream, must implement this method. Note that if your
TokenStream
   * caches tokens and feeds them back again after a reset,

What's the alternative? What happens with all the existing Tokenizers that
have no special implementation of #reset()?

   * it is imperative
   * that you clone the tokens when you store them away (on the first pass)
as
   * well as when you return them (on future passes after {@link #reset()}).


Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
I think I'm beginning to get the idea. Is the following plausible?

At the bottom of the stack, there's an actual source of data -- like a
tokenizer. For one of those, reset() is a bit silly, and something like
setReader is the brains of the operation.

Some number of other components may be stacked up on top of the source of
data, and these may have local state. Calling #reset prepared them for new
data to emerge from the actual source of data.


Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
If I'm following, you've created a division of labor between setReader and
reset.

We have a tokenizer that has a good deal of state, since it has to split
the input into chunks. If I'm following here, you'd recommend that we do
nothing special in setReader, but have #reset fix up all the state on the
assumption that we are are starting from the beginning of something, and
we'd reinitialize our chunker over what was sitting in the protected
'input'. If someone called #setReader and neglected to call #reset, awful
things would happen, but you've warned them.

To me, it seemed natural to overload #setReader so that our tokenizer was
in a consistent state once it was called. It occurs to me to wonder about
order: if #reset is called before #setReader, I'm up creek unless I copy my
reset implementation into a local override of #setReader.


Re: DisjunctionMaxQuery and scoring

2012-04-20 Thread Benson Margulies
Uwe and Robert,

Thanks. David and I are two peas in one pod here at Basis.

--benson

On Fri, Apr 20, 2012 at 2:33 AM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To
 achieve this, you have to change the coord function in your
 similarity/BooleanWeight used for this query.

 Either way: If you want a group of terms that get only one score if at least
 one of the terms match (SQL IN), but not add them at all,
 DisjunctionMaxQuery is fine. I think this is what Benson asked for.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Friday, April 20, 2012 8:16 AM
 To: java-user@lucene.apache.org; david_murgatr...@hotmail.com
 Subject: RE: DisjunctionMaxQuery and scoring

 Hi,
  I think
   BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish
  the desired name IN (dick, rich) scoring behavior. This is because
 (name:dick |
  name:rich) with coord=false would score the 'document' Dick Rich
  higher than Rich because the former has two term matches and the
  latter only
 one.
  In contrast, I think the desire is that one and only one of the terms
  in
 the
  document match those in the BooleanQuery so that Rich would score
  higher than Dick Rich, given document length normalization. It's
  almost like a
 desire
  for BooleanQuery bq = new BooleanQuery(false);
    bq.set*Maximum*NumberShouldMatch(1);

 I that case DisjunctionMaxQuery is the way to go (it will only count the
 hit with
 highest score and not add scores (coord or not coord doesn't matter here).


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
I am trying to solve a problem using DisjunctionMaxQuery.


Consider a query like:

a:b OR c:d OR e:f OR ...
name:richard OR name:dick OR name:dickie OR name:rich ...

At most, one of the richard names matches. So the match score gets
dragged down by the long list of things that don't match, as the list
can get quite long.

It seemed to me, upon reading the documentation, that I could cure
this problem by creating a query tree that used DisjunctionMaxQuery
around all those nicknames. However, when I built a boolean query that
had, as a clause, a DisjunctionMaxQuery in the place of a pile of
these individual Term queries, the score and the explanation did not
change at all -- in particular, the coord term shows the same number
of total terms. So it looks as if the children of the disjunction
still count.

Is there a way to control that term? Or a better way to express this?
Thinking SQL for a moment, what I'm trying to express is

   name IN (richard, dick, dickie, rich)

as a single term query. Reading the javadoc, I am seeing
MultiTermQuery, and I'm that it is what we want.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 I am trying to solve a problem using DisjunctionMaxQuery.


 Consider a query like:

 a:b OR c:d OR e:f OR ...
 name:richard OR name:dick OR name:dickie OR name:rich ...

 At most, one of the richard names matches. So the match score gets
 dragged down by the long list of things that don't match, as the list
 can get quite long.

 It seemed to me, upon reading the documentation, that I could cure
 this problem by creating a query tree that used DisjunctionMaxQuery
 around all those nicknames. However, when I built a boolean query that
 had, as a clause, a DisjunctionMaxQuery in the place of a pile of
 these individual Term queries, the score and the explanation did not
 change at all -- in particular, the coord term shows the same number
 of total terms. So it looks as if the children of the disjunction
 still count.

 Is there a way to control that term? Or a better way to express this?
 Thinking SQL for a moment, what I'm trying to express is

   name IN (richard, dick, dickie, rich)


 I think you just want to disable coord() here? You can do this for
 that particular boolean query by passing true to the ctor:

  public BooleanQuery(boolean disableCoord)

Rob,

How do nested queries work with respect to this? If I build a boolean
query one of whose clauses is a BooleanQuery with coord turned off,
does just the nested query insides get left out of 'coord'?

If so, then your answer certainly seems to be what the doctor ordered.

--benson



 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
Turning on disableCoord for a nested boolean query does not seem to
change the overall maxCoord term as displayed in explain.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 I am trying to solve a problem using DisjunctionMaxQuery.


 Consider a query like:

 a:b OR c:d OR e:f OR ...
 name:richard OR name:dick OR name:dickie OR name:rich ...

 At most, one of the richard names matches. So the match score gets
 dragged down by the long list of things that don't match, as the list
 can get quite long.

 It seemed to me, upon reading the documentation, that I could cure
 this problem by creating a query tree that used DisjunctionMaxQuery
 around all those nicknames. However, when I built a boolean query that
 had, as a clause, a DisjunctionMaxQuery in the place of a pile of
 these individual Term queries, the score and the explanation did not
 change at all -- in particular, the coord term shows the same number
 of total terms. So it looks as if the children of the disjunction
 still count.

 Is there a way to control that term? Or a better way to express this?
 Thinking SQL for a moment, what I'm trying to express is

   name IN (richard, dick, dickie, rich)


 I think you just want to disable coord() here? You can do this for
 that particular boolean query by passing true to the ctor:

  public BooleanQuery(boolean disableCoord)

 Rob,

 How do nested queries work with respect to this? If I build a boolean
 query one of whose clauses is a BooleanQuery with coord turned off,
 does just the nested query insides get left out of 'coord'?

 If so, then your answer certainly seems to be what the doctor ordered.


 it applies only to that query itself. So if this BQ is a clause to
 another BQ that has coord enabled,
 that would not change the top-level BQ's coord.

 Note: if you don't want coord at all, then you can also plug in a
 Similarity that returns 1,
 or pick another Similarity like BM25: in trunk only the vector space
 impl even does anything for coord()

Robert, I'm sorry that my density is approaching lead. My problem is
that I want coord, but I want to control which terms are counted and
which are not. I suppose I can accomplish this with my own scorer. My
hope was that there was a way to express This group of terms counts
as one for coord.

In other words, for a subset of fields in the query, I want to scale
the entire score by the fraction of them that match.

Another way to think about this, which might be no use at all, is to
wonder: is there a way to charge a score penalty for failure to match
a particular query term? That would, from another direction, address
the underlying effect I'm trying to get.





 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 5:10 PM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 19, 2012 at 5:05 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 I am trying to solve a problem using DisjunctionMaxQuery.


 Consider a query like:

 a:b OR c:d OR e:f OR ...
 name:richard OR name:dick OR name:dickie OR name:rich ...

 At most, one of the richard names matches. So the match score gets
 dragged down by the long list of things that don't match, as the list
 can get quite long.

 It seemed to me, upon reading the documentation, that I could cure
 this problem by creating a query tree that used DisjunctionMaxQuery
 around all those nicknames. However, when I built a boolean query that
 had, as a clause, a DisjunctionMaxQuery in the place of a pile of
 these individual Term queries, the score and the explanation did not
 change at all -- in particular, the coord term shows the same number
 of total terms. So it looks as if the children of the disjunction
 still count.

 Is there a way to control that term? Or a better way to express this?
 Thinking SQL for a moment, what I'm trying to express is

   name IN (richard, dick, dickie, rich)


 I think you just want to disable coord() here? You can do this for
 that particular boolean query by passing true to the ctor:

  public BooleanQuery(boolean disableCoord)

 Rob,

 How do nested queries work with respect to this? If I build a boolean
 query one of whose clauses is a BooleanQuery with coord turned off,
 does just the nested query insides get left out of 'coord'?

 If so, then your answer certainly seems to be what the doctor ordered.


 it applies only to that query itself. So if this BQ is a clause to
 another BQ that has coord enabled,
 that would not change the top-level BQ's coord.

 Note: if you don't want coord at all, then you can also plug in a
 Similarity that returns 1,
 or pick another Similarity like BM25: in trunk only the vector space
 impl even does anything for coord()

 Robert, I'm sorry that my density is approaching lead. My problem is
 that I want coord, but I want to control which terms are counted and
 which are not. I suppose I can accomplish this with my own scorer. My
 hope was that there was a way to express This group of terms counts
 as one for coord.

 So just structure your boolean query appropriately?

 BQ1(coord=true)
  BQ2(coord=false): 25 terms
  BQ3(coord=false): 87 terms

 BQ1's coord is based on how many subscorers match (out of 2, BQ2 and
 BQ3). If both match its 2/2 otherwise 1/2.

 But in this example BQ2 and BQ3 disable coord themselves, hiding the
 fact they accept 25 and 87 terms respectively and appearing as a
 single sub for coord().

 Does this make sense? you can extend this idea to control this however
 you want by structuring the BQ appropriately so your BQ's with
 synonyms have coord=0

Robert,

This makes perfect sense, it is what I thought you meant to begin
with. I tried it and thought that it did not work. Or, perhaps, I am
misreading the 'explain' output. Or, more likely, I goofed altogether.
I'll go back and recheck my results and post some explain output if I
can't find my mistake.

--benson





 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
I see why I'm so confused, but I think I need to construct a simpler test case.

My top-level BooleanQuery, which has disableCoord=false, has 22
clauses. All but three are ordinary SHOULD TermQueries. the remainder
are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
(that's a bug).

However, at the end of the explain trace, I see:

0.45 = coord(9/20) I think that my nested Boolean, for which I've been
flipping coord on and off to see what happens, is somehow not
participating at all. So switching it's coord on and off has no
effect.

Why 20? Why not 22? Is this just an explain quirk? Should I shove all
this code up to 3.6 from 2.9.3 before bugging you further?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
FWIW, there seems to be an explain bug in 2.9.1 that is fixed in
3.6.0, so I'm no longer confused about the actual behavior.


On Thu, Apr 19, 2012 at 8:32 PM, David Murgatroyd dmu...@gmail.com wrote:
 [apologies for the earlier errant send]

 I think
  BooleanQuery bq = new BooleanQuery(false);
 doesn't quite accomplish the desired name IN (dick, rich) scoring
 behavior. This is because (name:dick | name:rich) with coord=false would
 score the 'document' Dick Rich higher than Rich because the former has
 two term matches and the latter only one. In contrast, I think the desire
 is that one and only one of the terms in the document match those in the
 BooleanQuery so that Rich would score higher than Dick Rich, given
 document length normalization. It's almost like a desire for
 BooleanQuery bq = new BooleanQuery(false);
  bq.set*Maximum*NumberShouldMatch(1);

 Is there a good way to accomplish this?

 On Thu, Apr 19, 2012 at 7:37 PM, Robert Muir rcm...@gmail.com wrote:

 On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies bimargul...@gmail.com
 wrote:
  I see why I'm so confused, but I think I need to construct a simpler
 test case.
 
  My top-level BooleanQuery, which has disableCoord=false, has 22
  clauses. All but three are ordinary SHOULD TermQueries. the remainder
  are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
  (that's a bug).
 
  However, at the end of the explain trace, I see:
 
  0.45 = coord(9/20) I think that my nested Boolean, for which I've been
  flipping coord on and off to see what happens, is somehow not
  participating at all. So switching it's coord on and off has no
  effect.
 
  Why 20? Why not 22? Is this just an explain quirk?

 I am not sure (also not sure i understand your example totally), but
 at the same time could be as simple as the fact you have 2 prohibited
 (MUST_NOT) clauses. These don't count towards coord()

 I think its hard to tell from your description (just since it doesn't
 have all the details). an explain or test case or something like that
 would might be more efficient if its still not making sense...

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Repeatability of results

2012-04-02 Thread Benson Margulies
We've observed something that, in some ways, is not surprising.

If you take a set of documents that are close in 'score' to some query,

 and shuffle them in different orders

 and then see what results you get in what order from the reference query,

the scores will vary according to the insertion order.

I can't see any way to argue that it's wrong, but we find it
inconvenient when we are testing something and we want to multithread
the test to speed it up, thus making the insertion order
nondeterministic.

It occurred to me that perhaps you all have some similar concerns in
testing lucene itself, and might have some advice about how to get
around it, thus this email.

We currently observe this with 2.9.1 and 3.5.0.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Repeatability of results

2012-04-02 Thread Benson Margulies
On Mon, Apr 2, 2012 at 5:33 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Hmm that's odd.

 If the scores were identical I'd expect different sort order, since we
 tie-break by internal docID.

 But if the scores are different... the insertion order shouldn't
 matter.  And, the score should not change as a function of insertion
 order...

Well, I assumed that TF-IDF would wiggle.


 Do you have a small test case?

SInce this surprises you, I will build a test case.



 Mike McCandless

 http://blog.mikemccandless.com

 On Mon, Apr 2, 2012 at 5:28 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 We've observed something that, in some ways, is not surprising.

 If you take a set of documents that are close in 'score' to some query,

  and shuffle them in different orders

  and then see what results you get in what order from the reference query,

 the scores will vary according to the insertion order.

 I can't see any way to argue that it's wrong, but we find it
 inconvenient when we are testing something and we want to multithread
 the test to speed it up, thus making the insertion order
 nondeterministic.

 It occurred to me that perhaps you all have some similar concerns in
 testing lucene itself, and might have some advice about how to get
 around it, thus this email.

 We currently observe this with 2.9.1 and 3.5.0.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Benson Margulies
fileformat.info

On Mar 30, 2012, at 1:04 PM, Denis Brodeur denisbrod...@gmail.com wrote:

 Thanks Robert.  That makes sense.  Do you have a link handy where I can
 find this information? i.e. word boundary/punctuation for any unicode
 character set?

 On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir rcm...@gmail.com wrote:

 On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur denisbrod...@gmail.com
 wrote:
 Hello, I'm currently working out some problems when searching for Tibetan
 Characters.  More specifically: /u0f10-/u0f19.  We are using the

 unicode doesn't consider most of these characters part of a word: most
 are punctuation and symbols
 (except 0f18 and 0f19 which are combining characters that combine with
 digits).

 for example 0f14 is a text delimiter.

 in general standardtokenizer discards punctuation and is geared at
 word boundaries, just like
 you would have trouble searching on characters like '(', etc in
 english. So i think its totally expected.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
I've posted a self-contained test case to github of a mystery.

git://github.com/bimargulies/lucene-4-update-case.git

The code can be seen at
https://github.com/bimargulies/lucene-4-update-case/blob/master/src/test/java/org/apache/lucene/BadFieldTokenizedFlagTest.java.

I write a doc to an index, close the index, then reopen and do a
delete/add on the doc to add a field. If I iterate the docs in the
index, all looks well, but when I try to query for the doc, it isn't
found.

To be a bit more specific, the doc has a field field1 which is a
StringField.TYPE_STORED, and it is a query on that field which comes
up empty.

I expect to learn that I've missed something obvious, and I offer
thanks and apologies in advance.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt
appears to be missing one critical hint. If you have existing code
that called IndexReader.terms(), where do you start to get a
FieldsEnum?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler u...@thetaphi.de wrote:
 AtomicReader.fields()

I went and read up AtomicReader in CHANGES.txt. Should I call
SegmentReader.getReader(IOContext)?

I just posted a patch to CHANGES.txt to clarify before I read your
email, shall I improve it to use this instead of

MultiFields.getFields(indexReader).iterator();

which I came up with by fishing around for myself?


 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Tuesday, March 06, 2012 2:50 PM
 To: java-user@lucene.apache.org
 Subject: A little more CHANGES.txt help on terms(), please

 Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt appears
 to be missing one critical hint. If you have existing code that called
 IndexReader.terms(), where do you start to get a FieldsEnum?

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:09 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 I think MIGRATE.txt talks about this?

Yes it does, but it doesn't actually answer the specific question. See
LUCENE-3853 where I added what seems to be missing. If it's somewhere
else in the file I apologize.


 Mike McCandless

 http://blog.mikemccandless.com

 On Tue, Mar 6, 2012 at 8:50 AM, Benson Margulies bimargul...@gmail.com 
 wrote:
 Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt
 appears to be missing one critical hint. If you have existing code
 that called IndexReader.terms(), where do you start to get a
 FieldsEnum?

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Oh, I see, I didn't read far enough down. Well, the patch still
repairs a bug in the code fragment relative to the Term enumeration.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Oh, ouch, there's no SegmentReader.getReader, I was reading IndexWriter. Sorry.

On Tue, Mar 6, 2012 at 9:14 AM, Benson Margulies bimargul...@gmail.com wrote:
 On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler u...@thetaphi.de wrote:
 AtomicReader.fields()

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote:
 I think the issue is that your analyzer is standardanalyzer, yet field
 text value is value-1

Robert,

Why is this field analyzed at all? It's built with StringField.TYPE_STORED.

I'll push another copy that shows that it works fine when the doc is
first added, and gets bad after the 'update', when the field acquires
the 'tokenized' boolean mysteriously.

--benson



 So standardanalyzer will tokenize this into two terms: value and 1

 But later, you proceed to do TermQueries on value-1. This term won't
 exist... TermQuery etc that take Term don't analyze any text.

 Instead usually higher-level things like QueryParsers analyze text into Terms.

 On Tue, Mar 6, 2012 at 8:35 AM, Benson Margulies bimargul...@gmail.com 
 wrote:
 I've posted a self-contained test case to github of a mystery.

 git://github.com/bimargulies/lucene-4-update-case.git

 The code can be seen at
 https://github.com/bimargulies/lucene-4-update-case/blob/master/src/test/java/org/apache/lucene/BadFieldTokenizedFlagTest.java.

 I write a doc to an index, close the index, then reopen and do a
 delete/add on the doc to add a field. If I iterate the docs in the
 index, all looks well, but when I try to query for the doc, it isn't
 found.

 To be a bit more specific, the doc has a field field1 which is a
 StringField.TYPE_STORED, and it is a query on that field which comes
 up empty.

 I expect to learn that I've missed something obvious, and I offer
 thanks and apologies in advance.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies bimargul...@gmail.com wrote:
 On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote:
 I think the issue is that your analyzer is standardanalyzer, yet field
 text value is value-1

 Robert,

 Why is this field analyzed at all? It's built with StringField.TYPE_STORED.

 I'll push another copy that shows that it works fine when the doc is
 first added, and gets bad after the 'update', when the field acquires
 the 'tokenized' boolean mysteriously.

I pushed a new copy that runs the query successfully before the
'delete/add' sequence, and then fails afterwards.


 --benson



 So standardanalyzer will tokenize this into two terms: value and 1

 But later, you proceed to do TermQueries on value-1. This term won't
 exist... TermQuery etc that take Term don't analyze any text.

 Instead usually higher-level things like QueryParsers analyze text into 
 Terms.

 On Tue, Mar 6, 2012 at 8:35 AM, Benson Margulies bimargul...@gmail.com 
 wrote:
 I've posted a self-contained test case to github of a mystery.

 git://github.com/bimargulies/lucene-4-update-case.git

 The code can be seen at
 https://github.com/bimargulies/lucene-4-update-case/blob/master/src/test/java/org/apache/lucene/BadFieldTokenizedFlagTest.java.

 I write a doc to an index, close the index, then reopen and do a
 delete/add on the doc to add a field. If I iterate the docs in the
 index, all looks well, but when I try to query for the doc, it isn't
 found.

 To be a bit more specific, the doc has a field field1 which is a
 StringField.TYPE_STORED, and it is a query on that field which comes
 up empty.

 I expect to learn that I've missed something obvious, and I offer
 thanks and apologies in advance.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:34 AM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 MultiFields should only be used (as it is slow) if you exactly know what you 
 are doing and what the consequences are. There is a change in Lucene 4.0, so 
 you can no longer terms and postings from a top-level (composite) reader. 
 More info is also here: http://goo.gl/lMKTM

Uwe,

The 4.0 change is how I got here in the first place. Some code we have
here dumped all the terms using the old IndexReader.terms(), so I was
working on figuring out how to replace it. For my purposes, which are
a dev tool, I think that MultiFields will be fine.

--benson



 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Tuesday, March 06, 2012 3:15 PM
 To: java-user@lucene.apache.org
 Subject: Re: A little more CHANGES.txt help on terms(), please

 On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler u...@thetaphi.de wrote:
  AtomicReader.fields()

 I went and read up AtomicReader in CHANGES.txt. Should I call
 SegmentReader.getReader(IOContext)?

 I just posted a patch to CHANGES.txt to clarify before I read your email, 
 shall I
 improve it to use this instead of

     MultiFields.getFields(indexReader).iterator();

 which I came up with by fishing around for myself?

 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Benson Margulies [mailto:bimargul...@gmail.com]
  Sent: Tuesday, March 06, 2012 2:50 PM
  To: java-user@lucene.apache.org
  Subject: A little more CHANGES.txt help on terms(), please
 
  Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt
  appears to be missing one critical hint. If you have existing code
  that called IndexReader.terms(), where do you start to get a FieldsEnum?
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir rcm...@gmail.com wrote:
 On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies bimargul...@gmail.com 
 wrote:
 On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote:
 I think the issue is that your analyzer is standardanalyzer, yet field
 text value is value-1

 Robert,

 Why is this field analyzed at all? It's built with StringField.TYPE_STORED.


 thanks Benson, you are right!

So, should I attach this to a JIRA?

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:47 AM, Uwe Schindler u...@thetaphi.de wrote:
 String field is analyzed, but with KeywordTokenizer, so all should be fine.

I filed LUCENE-3854.


 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, March 06, 2012 3:42 PM
 To: java-user@lucene.apache.org
 Subject: Re: Problem with updating a document or TermQuery with current
 trunk

 Hmm something is up here... I'll dig.  Seems like we are somehow analyzing
 StringField when we shouldn't...

 Mike McCandless

 http://blog.mikemccandless.com

 On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir rcm...@gmail.com wrote:
  On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies
 bimargul...@gmail.com wrote:
  On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote:
  I think the issue is that your analyzer is standardanalyzer, yet
  field text value is value-1
 
  Robert,
 
  Why is this field analyzed at all? It's built with
 StringField.TYPE_STORED.
 
 
  thanks Benson, you are right!
 
  --
  lucidimagination.com
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:46 AM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 The recommended way to get an atomic reader from a composite reader is to use 
 SlowCompositeReaderWrapper.wrap(reader). MultiFields is now purely internal. 
 I think it's only public because the codecs package may need it, otherwise it 
 should be pkg-private.

Oh! I'll rework the patch again, then. I might include some commentary
in MultiFields at all.


 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Tuesday, March 06, 2012 3:40 PM
 To: java-user@lucene.apache.org
 Subject: Re: A little more CHANGES.txt help on terms(), please

 On Tue, Mar 6, 2012 at 9:34 AM, Uwe Schindler u...@thetaphi.de wrote:
  Hi,
 
  MultiFields should only be used (as it is slow) if you exactly know
  what you are doing and what the consequences are. There is a change in
  Lucene 4.0, so you can no longer terms and postings from a top-level
  (composite) reader. More info is also here: http://goo.gl/lMKTM

 Uwe,

 The 4.0 change is how I got here in the first place. Some code we have here
 dumped all the terms using the old IndexReader.terms(), so I was working on
 figuring out how to replace it. For my purposes, which are a dev tool, I 
 think
 that MultiFields will be fine.

 --benson


 
  Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Benson Margulies [mailto:bimargul...@gmail.com]
  Sent: Tuesday, March 06, 2012 3:15 PM
  To: java-user@lucene.apache.org
  Subject: Re: A little more CHANGES.txt help on terms(), please
 
  On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler u...@thetaphi.de wrote:
   AtomicReader.fields()
 
  I went and read up AtomicReader in CHANGES.txt. Should I call
  SegmentReader.getReader(IOContext)?
 
  I just posted a patch to CHANGES.txt to clarify before I read your
  email, shall I improve it to use this instead of
 
      MultiFields.getFields(indexReader).iterator();
 
  which I came up with by fishing around for myself?
 
  
   -
   Uwe Schindler
   H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
   eMail: u...@thetaphi.de
  
  
   -Original Message-
   From: Benson Margulies [mailto:bimargul...@gmail.com]
   Sent: Tuesday, March 06, 2012 2:50 PM
   To: java-user@lucene.apache.org
   Subject: A little more CHANGES.txt help on terms(), please
  
   Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt
   appears to be missing one critical hint. If you have existing code
   that called IndexReader.terms(), where do you start to get a 
   FieldsEnum?
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 10:04 AM, Robert Muir rcm...@gmail.com wrote:
 Thanks Benson: look like the problem revolves around indexing
 Document/Fields you get back from IR.document... this has always been
 'lossy', but I think this is a real API trap.

 Please keep testing :)

Got a suggestion for sneaking around this in the mean time?


 On Tue, Mar 6, 2012 at 9:58 AM, Benson Margulies bimargul...@gmail.com 
 wrote:
 On Tue, Mar 6, 2012 at 9:47 AM, Uwe Schindler u...@thetaphi.de wrote:
 String field is analyzed, but with KeywordTokenizer, so all should be fine.

 I filed LUCENE-3854.


 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, March 06, 2012 3:42 PM
 To: java-user@lucene.apache.org
 Subject: Re: Problem with updating a document or TermQuery with current
 trunk

 Hmm something is up here... I'll dig.  Seems like we are somehow analyzing
 StringField when we shouldn't...

 Mike McCandless

 http://blog.mikemccandless.com

 On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir rcm...@gmail.com wrote:
  On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies
 bimargul...@gmail.com wrote:
  On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote:
  I think the issue is that your analyzer is standardanalyzer, yet
  field text value is value-1
 
  Robert,
 
  Why is this field analyzed at all? It's built with
 StringField.TYPE_STORED.
 
 
  thanks Benson, you are right!
 
  --
  lucidimagination.com
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Benson Margulies
Sorry, I'm coming up empty in Google here.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Benson Margulies
To reduce noise slightly I'll stay on this thread.

I'm looking at this file, and not seeing a pointer to what to do about
QueryParser. Are jar file rearrangements supposed to be in that file?
I think that I don't have the right jar yet; all I'm seeing is the
'surround' package.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Benson Margulies
OK, thanks.

On Mon, Mar 5, 2012 at 11:22 AM, Steven A Rowe sar...@syr.edu wrote:
 You want the lucene-queryparser jar.  From trunk MIGRATE.txt:

 * LUCENE-3283: Lucene's core o.a.l.queryParser QueryParsers have been 
 consolidated into module/queryparser,
  where other QueryParsers from the codebase will also be placed.  The 
 following classes were moved:
  - o.a.l.queryParser.CharStream - o.a.l.queryparser.classic.CharStream
  - o.a.l.queryParser.FastCharStream - 
 o.a.l.queryparser.classic.FastCharStream
  - o.a.l.queryParser.MultiFieldQueryParser - 
 o.a.l.queryparser.classic.MultiFieldQueryParser
  - o.a.l.queryParser.ParseException - 
 o.a.l.queryparser.classic.ParseException
  - o.a.l.queryParser.QueryParser - o.a.l.queryparser.classic.QueryParser
  - o.a.l.queryParser.QueryParserBase - 
 o.a.l.queryparser.classic.QueryParserBase
  - o.a.l.queryParser.QueryParserConstants - 
 o.a.l.queryparser.classic.QueryParserConstants
  - o.a.l.queryParser.QueryParserTokenManager - 
 o.a.l.queryparser.classic.QueryParserTokenManager
  - o.a.l.queryParser.QueryParserToken - o.a.l.queryparser.classic.Token
  - o.a.l.queryParser.QueryParserTokenMgrError - 
 o.a.l.queryparser.classic.TokenMgrError


 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Monday, March 05, 2012 11:15 AM
 To: java-user@lucene.apache.org
 Subject: Re: What replaces IndexReader.openIfChanged in Lucene 4.0?

 To reduce noise slightly I'll stay on this thread.

 I'm looking at this file, and not seeing a pointer to what to do about 
 QueryParser. Are jar file rearrangements supposed to be in that file?
 I think that I don't have the right jar yet; all I'm seeing is the 'surround' 
 package.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Updating a document.

2012-03-04 Thread Benson Margulies
I am walking down the document in an index by number, and I find that
I want to update one. The updateDocument API only works on queries and
terms, not numbers.

So I can call remove and add, but, then, what's the document's number
after that? Or is that not a meaningful question until I make a new
reader?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   >