Re: Problem with porter stemming
Stemming is an inherently limited process. It doesn't know about the word 'news', it just has a rule about 's'. Some of us sell commercial products that do more complex linguistic processing that knows about which words are which. There may be open source implementations of similar technology. On Mon, Mar 14, 2016 at 12:13 PM, Ahmet Arslanwrote: > Hi Dwaipayan, > > Another way is to use KeywordMarkerFilter. Stemmer implementations respect > this attribute. > If you want to supply your own mappings, StemmerOverrideTokenFilter could be > used as well. > > ahmet > > > On Monday, March 14, 2016 4:31 PM, Dwaipayan Roy > wrote: > > > > I am using EnglishAnalyzer with my own stopword list. EnglishAnalyzer uses > the porter stemmer (snowball) to stem the words. But using the > EnglishAnalyzer, I am getting erroneous result for 'news'. 'news' is > getting stemmed into 'new'. > > Any help would be appreciated. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Text dependent analyzer
If you wait tokenization to depend on sentences, and you insist on being inside Lucene, you have to be a Tokenizer. Your tokenizer can set an attribute on the token that ends a sentence. Then, downstream, filters can read-ahead tokens to get the full sentence and buffer tokens as needed. On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Hummel, There was an effort to bring open-nlp capabilities to Lucene: https://issues.apache.org/jira/browse/LUCENE-2899 Lance was working on it to keep it up-to-date. But, it looks like it is not always best to accomplish all things inside Lucene. I personally would do the sentence detection outside of the Lucene. By the way, I remember there was a way to consume all upstream token stream. I think it was consuming all input and injecting one concatenated huge term/token. KeywordTokenizer has similar behaviour. It injects a single token. http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html Ahmet On Wednesday, April 15, 2015 3:12 PM, Shay Hummel shay.hum...@gmail.com wrote: Hi Ahment, Thank you for the reply, That's exactly what I am doing. At the moment, to index a document, I break it to sentences, and each sentence is analyzed (lemmatizing, stopword removal etc.) Now, what I am looking for is a way to create an analyzer (a class which extends lucene's analyzer). This analyzer will be used for index and query processing. It (a like the english analyzer) will receive the text and produce tokens. The Api of Analyzer requires implementing the createComponents which is not dependent on the text being analyzed. This fact is problematic since as you know the OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the model files to provide spans of each sentence and then break them). Is there a way around it? Shay On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Hummel, You can perform sentence detection outside of the solr, using opennlp for instance, and then feed them to solr. https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect Ahmet On Tuesday, April 14, 2015 8:12 PM, Shay Hummel shay.hum...@gmail.com wrote: Hi I would like to create a text dependent analyzer. That is, *given a string*, the analyzer will: 1. Read the entire text and break it into sentences. 2. Each sentence will then be tokenized, possesive removal, lowercased, mark terms and stemmed. The second part is essentially what happens in english analyzer (createComponent). However, this is not dependent of the text it receives - which is the first part of what I am trying to do. So ... How can it be achieved? Thank you, Shay Hummel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A codec moment or pickle
Based on reading the same comments you read, I'm pretty doubtful that Codec.getDefault() is going to work. It seems to me that this situation renders the FilterCodec a bit hard to to use, at least given the 'every release deprecates a codec' sort of pattern. On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, How about Codec.getDefault()? It does indeed not necessarily return the newest one (if somebody changes the default using Codec.setDefault()), but for your use case wrapping the current default one, it should be fine? I have not tried this yet, but there might be a chicken-egg problem: - Your codec will have a separate name and be listed in META-INF as service (I assume this). So it gets discovered by the Codec discovery process and is instantiated by that. - On loading the Codec framework the call to codec.getDefault() might get in at a time where the codecs are not yet fully initialized (because it will instantiate your codec while loading the META-INF). This happens before the Codec class is itself fully statically initialized, so the default codec might be null... So relying on Codec.getDefault() in constructors of filter codecs may not work as expected! Maybe try it out, was just an idea :-) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Thursday, February 12, 2015 2:11 AM To: java-user@lucene.apache.org Subject: A codec moment or pickle I have a class that extends FilterCodec. Written against Lucene 4.9, it uses the Lucene49Codec. Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is read-only in 4.10. Is there some way to code one of these to get 'the default codec' and not have to chase versions? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A codec moment or pickle
Robert, Let me lay out the scenario. Hardware has .5T of Index is relatively small. Application profiling shows a significant amount of time spent codec-ing. Options as I see them: 1. Use DPF complete with the irritation of having to have this spurious codec name in the on-disk format that has nothing to do with the on-disk format. 2. 'Officially' use the standard codec, and then use something like AOP to intercept and encapsulate it with the DPF or something else like it -- essentially, a do-it-myself alternative to convincing the community here that this is a use case worthy of support. 3. Find some way to move a significant amount of the data in question out of Lucene altogether into something else which fits nicely together with filling memory with a cache so that the amount of codeccing drops below the threshold of interest. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A codec moment or pickle
WHOOPS. First sentence was, until just before I clicked 'send', Hardware has .5T of RAM. Index is relatively small (20g) ... On Thu, Feb 12, 2015 at 4:51 PM, Benson Margulies ben...@basistech.com wrote: Robert, Let me lay out the scenario. Hardware has .5T of Index is relatively small. Application profiling shows a significant amount of time spent codec-ing. Options as I see them: 1. Use DPF complete with the irritation of having to have this spurious codec name in the on-disk format that has nothing to do with the on-disk format. 2. 'Officially' use the standard codec, and then use something like AOP to intercept and encapsulate it with the DPF or something else like it -- essentially, a do-it-myself alternative to convincing the community here that this is a use case worthy of support. 3. Find some way to move a significant amount of the data in question out of Lucene altogether into something else which fits nicely together with filling memory with a cache so that the amount of codeccing drops below the threshold of interest. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A codec moment or pickle
On Thu, Feb 12, 2015 at 8:43 AM, Robert Muir rcm...@gmail.com wrote: Honestly i dont agree. I don't know what you are trying to do, but if you want file format backwards compat working, then you need a different FilterCodec to match each lucene codec. Otherwise your codec is broken from a back compat standpoint. Wrapping the latest is an antipattern here. I understand this logic. It leaves me wandering between: 1: My old desire to convince you that there should be a way to do DirectPostingFormat's caching without being a codec at all. Unfortunately, I got dragged away from the benchmarking that might have been persuasive. 2: The problem of deprecation. I give someone a jar-of-code that works fine with Lucene 4.9. It does not work with 4.10. Now, maybe the answer here is that the codec deprecation is fundamental to the definition of moving from 4.9 to 4.10, so having a codec means that I'm really married to a process of making releases that mirror Lucene releases. On Thu, Feb 12, 2015 at 5:33 AM, Benson Margulies ben...@basistech.com wrote: Based on reading the same comments you read, I'm pretty doubtful that Codec.getDefault() is going to work. It seems to me that this situation renders the FilterCodec a bit hard to to use, at least given the 'every release deprecates a codec' sort of pattern. On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, How about Codec.getDefault()? It does indeed not necessarily return the newest one (if somebody changes the default using Codec.setDefault()), but for your use case wrapping the current default one, it should be fine? I have not tried this yet, but there might be a chicken-egg problem: - Your codec will have a separate name and be listed in META-INF as service (I assume this). So it gets discovered by the Codec discovery process and is instantiated by that. - On loading the Codec framework the call to codec.getDefault() might get in at a time where the codecs are not yet fully initialized (because it will instantiate your codec while loading the META-INF). This happens before the Codec class is itself fully statically initialized, so the default codec might be null... So relying on Codec.getDefault() in constructors of filter codecs may not work as expected! Maybe try it out, was just an idea :-) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Thursday, February 12, 2015 2:11 AM To: java-user@lucene.apache.org Subject: A codec moment or pickle I have a class that extends FilterCodec. Written against Lucene 4.9, it uses the Lucene49Codec. Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is read-only in 4.10. Is there some way to code one of these to get 'the default codec' and not have to chase versions? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
A codec moment or pickle
I have a class that extends FilterCodec. Written against Lucene 4.9, it uses the Lucene49Codec. Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is read-only in 4.10. Is there some way to code one of these to get 'the default codec' and not have to chase versions? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
A really hairy token graph case
Consider a case where we have a token which can be subdivided in several ways. This can happen in German. We'd like to represent this with positionIncrement/positionLength, but it does not seem possible. Once the position has moved out from one set of 'subtokens', we see no way to move it back for the second set of alternatives. Is this something that was considered? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A really hairy token graph case
I don't think so ... Let me be specific: First, consider the case of one 'analysis': an input token maps to a lemma and a sequence of components. So, we product surface form lemmaPI 0 comp1PI 0 comp2PI 1 . with PL set appropriately to cover the pieces. All the information is there. Now, if we have another analysis, we want to 'rewind' position, and deliver another lemma and another set of components, but, of course, we can't do that. The best we could do is something like: surface form lemma1 PI 0 lemma2 PI 0 lemmaN PI 0 comp0-1 PI 0 comp1-1 PI 0 comp0-N compM-N That is, group all the first-components, and all the second-components. But now the bits and pieces of the compounds are interspersed. Maybe that's OK. On Fri, Oct 24, 2014 at 5:44 PM, Will Martin wmartin...@gmail.com wrote: HI Benson: This is the case with n-gramming (though you have a more complicated start chooser than most I imagine). Does that help get your ideas unblocked? Will -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Friday, October 24, 2014 4:43 PM To: java-user@lucene.apache.org Subject: A really hairy token graph case Consider a case where we have a token which can be subdivided in several ways. This can happen in German. We'd like to represent this with positionIncrement/positionLength, but it does not seem possible. Once the position has moved out from one set of 'subtokens', we see no way to move it back for the second set of alternatives. Is this something that was considered? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Why does this search fail?
Does google actually support *? On Wed, Aug 27, 2014 at 9:54 AM, Milind mili...@gmail.com wrote: I see. This is going to be extremely difficult to explain to end users. It doesn't work as they would expect. Some of the tokenizing rules are already somewhat confusing. Their expectation is that it should work the way their searches work in Google. It's difficult enough to recognize that because the period is surrounded by a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets tokenized. So I'd have expected that C0001.DevNm00* would effectively become a search for C0001 OR DevNm00*. But now, because of the presence of the wildcard, it's considered as 1 term and the period is not a tokenizer. That's actually good, but now the fact that it's still considered as 2 terms for wildcard searches makes it very unintuitive. I don't suppose that I can do anything about making wildcard search use multiple terms if joined together with a tokenizer. But is there any way that I can force it to go through an analyzer prior to doing the search? On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky j...@basetechnology.com wrote: Sorry, but you can only use a wildcard on a single term. C0001.DevNm001 gets indexed as two terms, c0001 and devnm001, so your wildcard won't match any term (at least in this case.) Also, if your query term includes a wildcard, it will not be fully analyzed. Some filters such as lower case are defined as multi-term, so they will be performed, but the standard tokenizer is not being called, so the dot remains and this whole term is treated as one term, unlike the index analysis. -- Jack Krupansky -Original Message- From: Milind Sent: Tuesday, August 26, 2014 12:24 PM To: java-user@lucene.apache.org Subject: Why does this search fail? I have a field with the value C0001.DevNm001. If I search for C0001.DevNm001 -- Get Hit DevNm00* -- Get Hit C0001.DevNm00* -- Get No Hit The field gets tokenized on the period since it's surrounded by a letter and and a number. The query gets evaluated as a prefix query. I'd have thought that this should have found the document. Any clues on why this doesn't work? The full code is below. Directory theDirectory = new RAMDirectory(); Version theVersion = Version.LUCENE_47; Analyzer theAnalyzer = new StandardAnalyzer(theVersion); IndexWriterConfig theConfig = new IndexWriterConfig(theVersion, theAnalyzer); IndexWriter theWriter = new IndexWriter(theDirectory, theConfig); String theFieldName = Name; String theFieldValue = C0001.DevNm001; Document theDocument = new Document(); theDocument.add(new TextField(theFieldName, theFieldValue, Field.Store.YES)); theWriter.addDocument(theDocument); theWriter.close(); String theQueryStr = theFieldName + :C0001.DevNm00*; Query theQuery = new QueryParser(theVersion, theFieldName, theAnalyzer).parse(theQueryStr); System.out.println(theQuery.getClass() + , + theQuery); IndexReader theIndexReader = DirectoryReader.open(theDirectory); IndexSearcher theSearcher = new IndexSearcher(theIndexReader); TopScoreDocCollector collector = TopScoreDocCollector.create(10, true); theSearcher.search(theQuery, collector); ScoreDoc[] theHits = collector.topDocs().scoreDocs; System.out.println(Hits found: + theHits.length); Output: class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00* Hits found: 0 -- Regards Milind - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Regards Milind
Re: searching with stemming
You should construct an analysis chain that does what you need. Read the source of the relevant analyzer and pick the tokenizer and filter(s) that you need, and don't include stemming. On Mon, Jun 9, 2014 at 5:57 AM, Jamie ja...@mailarchiva.com wrote: Greetings Our app currently uses language specific analysers (e.g. EnglishAnalyzer, GermanAnalyzer, etc.). We need an option to disable stemming. What's the recommended way to do this? These analyzers do not include an option to disable stemming, only a parameter to specify a list words for which stemming should not apply. Furthermore, my understanding is that the StandardAnalyzer is tied to English specifically. I am trying to avoid having to override each of these analyzers with an option to disable stemming. Is there a better alternative? Much appreciate your consideration. Jamie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: searching with stemming
Are you using Solr? If so you are on the wrong mailing list. If not, why do you need a non- -anonymous analyzer at all. On Jun 9, 2014 6:55 AM, Jamie ja...@mailarchiva.com wrote: To me, it seems strange that these default analyzers, don't provide constructors that enable one to override stemming, etc? On 2014/06/09, 12:39 PM, Trejkaz wrote: On Mon, Jun 9, 2014 at 7:57 PM, Jamie ja...@mailarchiva.com wrote: Greetings Our app currently uses language specific analysers (e.g. EnglishAnalyzer, GermanAnalyzer, etc.). We need an option to disable stemming. What's the recommended way to do this? These analyzers do not include an option to disable stemming, only a parameter to specify a list words for which stemming should not apply. Furthermore, my understanding is that the StandardAnalyzer is tied to English specifically. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: searching with stemming
Analyzer classes are optional; an analyzer is just a factory for a set of token stream components. you can usually do just fine with an anonymous class. Or in your case, the only thing different for each language will be the stop words, so you can have one analyzer class with a language parameter. On Jun 9, 2014 7:02 AM, Jamie ja...@mailarchiva.com wrote: I am not using Solr. I am using the default analyzers... On 2014/06/09, 12:59 PM, Benson Margulies wrote: Are you using Solr? If so you are on the wrong mailing list. If not, why do you need a non- -anonymous analyzer at all. On Jun 9, 2014 6:55 AM, Jamie ja...@mailarchiva.com wrote: To me, it seems strange that these default analyzers, don't provide constructors that enable one to override stemming, etc? On 2014/06/09, 12:39 PM, Trejkaz wrote: On Mon, Jun 9, 2014 at 7:57 PM, Jamie ja...@mailarchiva.com wrote: Greetings Our app currently uses language specific analysers (e.g. EnglishAnalyzer, GermanAnalyzer, etc.). We need an option to disable stemming. What's the recommended way to do this? These analyzers do not include an option to disable stemming, only a parameter to specify a list words for which stemming should not apply. Furthermore, my understanding is that the StandardAnalyzer is tied to English specifically. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Confuse with Kuromoji
You must know what language each text is in, and use an appropriate analyzer. Some people do this by using a separate field (text_eng, text_spa, text_jpn). Other people put some extra information at the beginning of the field, and then make an analyzer that peeks in order to dispatch to the correct tokenizer. On Sat, Apr 5, 2014 at 9:59 PM, j7a42e4fd7...@softbank.ne.jp wrote: I am pretty new with Lucene, however I have not problem understanding what is about. My big problem is trying to understand how Kuromoji works. I need to implement a search functinality thats supports initially English, Spanish and Japanese. I doesn't seem to be a deal with the two firsts, as I can just use the analyzersーcommon to index both languages contents, but when it comes to Japanese it has it's own analyzer. I could't find any clues about combining analyzers, so I still don't if I can combine all languages under the same index (which would be ideal, as I expect mix searches in the context of my project) or I have to detect the language first and then index Japanese texts separately (what it will be a big disadvantage when it comes to mixed searches and future localization expansion). I found out about Lucene throgh Kuromoji, it will be great to find out a solution to be able to use all the greatness that Lucene offers. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Confuse with Kuromoji
On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat herb.roitb...@orcatec.comwrote: Just curious, what are some of the things that people do to properly tokenize the queries with mixed language collections? What do you do with mixed language queries? You can either force the user to tell you the language, or ... you can run a language detector. They are less accurate for short strings, or ... you can process it in _all_ of the languages and OR up the results. On 4/6/2014 4:51 AM, Benson Margulies wrote: You must know what language each text is in, and use an appropriate analyzer. Some people do this by using a separate field (text_eng, text_spa, text_jpn). Other people put some extra information at the beginning of the field, and then make an analyzer that peeks in order to dispatch to the correct tokenizer. On Sat, Apr 5, 2014 at 9:59 PM, j7a42e4fd7...@softbank.ne.jp wrote: I am pretty new with Lucene, however I have not problem understanding what is about. My big problem is trying to understand how Kuromoji works. I need to implement a search functinality thats supports initially English, Spanish and Japanese. I doesn't seem to be a deal with the two firsts, as I can just use the analyzersーcommon to index both languages contents, but when it comes to Japanese it has it's own analyzer. I could't find any clues about combining analyzers, so I still don't if I can combine all languages under the same index (which would be ideal, as I expect mix searches in the context of my project) or I have to detect the language first and then index Japanese texts separately (what it will be a big disadvantage when it comes to mixed searches and future localization expansion). I found out about Lucene throgh Kuromoji, it will be great to find out a solution to be able to use all the greatness that Lucene offers. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Custom Tokenizer/Analyzer
It sounds like you've been asked to implement Named Entity Recognition. OpenNLP has some capability here. There are also, um, commercial alternatives. On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio ye.pe...@gmail.comwrote: On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar geetgang...@gmail.com wrote: Hi, My requirement is it should have capabilities to match multiple words as one token. for example. When user passes String as International Business machine logo or IBM logo it should return International Business Machine as one token and logo as one token. This is an interesting problem. I suppose that if the user enters International Business Machines, possibly with some misspelling, you want to find all documents containing IBM - and that if he enters the string IBM, you want to find documents which contain the string International Business Machines, or even only parts of it. So this means you need some kind of map relating some acronyms with their content parts. There really are two directions here: acronym to content and content to acronym. One cannot find what an acronym means without some kind of acronym dictionary. This means that whatever approach you intend to use, there should be an external dictionary involved, which, for each acronym, would map a list of possible phrases. Retrieving all phrases matching the inputted acronym, you'd inject each part of each phrase as a token (removing possible duplicates between phrase parts). That's basically it for the direction acronym to content. The direction content to acronym is trickier, I believe. One way is to generate a second (reversed) map, matching each acronym content part to a list of acronyms containing that part. You'd simply inject acronyms (and possibly other things) if one part of their content is matched (or more than one part, if you want to increase relevance). This could however possibly require the definition of a specific hashing mechanism, if you want to find approximate (distanced) keys (e.g. intenational, with the lacking r, would still find IBM). A second way (more coupled to the concept of acronym, so less generic) could be to consider that every word starting with a capital letter if part of an acronym, buffering sequences of words starting with a capital letter, and eventually injecting the resulting acronym, if found in the acronym dictionary. This might not be safe, though - the user may not have the discipline to capitalize the words being part of an acronym (or may even misspell the first letter), or concatenated first letters could match an irrelevant acronym (many word sequences can give the acronym IBM). I do not know whether there already exists some Lucene module which processes acronyms, or if someone is working on one. It's definitely worth a search though, because writing a good one from scratch could mean a few days of work, or more. HTH. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: LUCENE-5388 AbstractMethodError
If you are sensitive to things being committed to trunk, that suggests that you are building your own jars and using the trunk. Are you perfectly sure that you have built, and are using, a consistent set of jars? It looks as if you've got some trunk-y stuff and some 4.6.1 stuff. On Thu, Jan 30, 2014 at 6:51 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Uwe, The bug occurred only after LUCENE-5388 was committed to trunk, looks like its the changes to Analyzer and friends. The full stack trace is not much more helpful: java.lang.AbstractMethodError at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:140) at io.openindex.lucene.analysis.util.QueryDigest.unigrams(QueryDigest.java:196) at io.openindex.lucene.analysis.util.QueryDigest.calculate(QueryDigest.java:135) at io.openindex.solr.handler.QueryDigestRequestHandler.handleRequestBody(QueryDigestRequestHandler.java:56) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:785) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:724) Here's what happens at the consumer code and where the exception begins: TokenStream stream = analyzer.tokenStream(null, new StringReader(input)); We test trunk with our custom stuff as well, but all our custom stuff is nicely built with Maven against the most recent release of Solr and/or Lucene. If that stays a problem we may have to build stuff against branch_4x instead. Thanks, Markus -Original message- From:Uwe Schindler u...@thetaphi.de Sent: Thursday 30th January 2014 11:18 To: java-user@lucene.apache.org Subject: RE: LUCENE-5388 AbstractMethodError Hi, Can you please post your complete stack trace? I have no idea what LUCENE-5388 has to do with that error? Please make sure that all your Analyzers and all of your Solr installation only uses *one set* of Lucen/Solr JAR files from *one* version. Mixing Lucene/Solr JARs and mixing with Factories compiled against older versions does not work. You have to keep all in sync, and then all should be fine. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent:
Re: How is incrementToken supposed to detect the lack of reset()?
If you'd like to join in on the doc, see https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to grant you access to push to my fork. On Wed, Jan 8, 2014 at 5:37 AM, Mindaugas Žakšauskas min...@gmail.comwrote: Just for the interest, I had a similar problem too as well as other people [1]. In my project, I am extending the Tokenizer class and have another tokenizer (e.g. ClassicTokenizer) as a delegate. Unfortunately, properly overriding all public/protected methods is *not* enough, e.g.: public void reset() throws IOException { super.reset(); delegate.reset(); } I was still getting the exception of broken read()/close() contract. Half day and *lots* of debugging later, I realized that exception is only thrown when indexing second document only as the delegate reader internally gets replaced with ILLEGAL_STATE_READER after .close() is called. My solution to this problem was to make the reset() method like this: public void reset() throws IOException { super.reset(); delegate.setReader(input); delegate.reset(); } Another thing worth mentioning is that it's crucial to have super.method() before delegate.method() in all overridden methods. Would be nice if all of this was somewhere in the Tokenizer Javadoc, or even nicer if the base class was designed with delegation in mind (Effective Java (2nd edition), Item 16). Hope this helps somebody. [1] http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673 Regards, Mindaugas On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies ben...@basistech.com wrote: Yes I Do. On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir rcm...@gmail.com wrote: Benson, do you want to open an issue to fix this constructor to not take Reader? (there might be one already, but lets make a new one). These things are supposed to be reused, and have setReader for that purpose. i think its confusing and contributes to bugs that you have to have logic in e.g. the ctor THEN ALSO in reset(). if someone does it correctly in the ctor, but they only test one time, they might think everything is working.. On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies ben...@basistech.com wrote: For the record of other people who implement tokenizers: Say that your tokenizer has a constructor, like: public MyTokenizer(Reader reader, ) { super(reader); myWrappedInputDevice = new MyWrappedInputDevice(reader); } Not a good idea. Tokenizer carefully manages the data flow from the constructor arg to the 'input' field. The correct form is: public MyTokenizer(Reader reader, ) { super(reader); myWrappedInputDevice = new MyWrappedInputDevice(this.input); } On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir rcm...@gmail.com wrote: See Tokenizer.java for the state machine logic. In general you should not have to do anything if the tokenizer is well-behaved (e.g. close calls super.close() and so on). On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies bimargul...@gmail.com wrote: In 4.6.0, org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException fails if incrementToken fails to throw if there's a missing reset. How am I supposed to organize this in a Tokenizer? A quick look at CharTokenizer did not reveal any code for the purpose. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How is incrementToken supposed to detect the lack of reset()?
I'm not in the delegate business, just a straight subclass. So I think they are complementary. Gimme your github identity, and you are, as far as I am concerned, more than welcome to add a section on delegates. On Wed, Jan 8, 2014 at 7:38 AM, Mindaugas Žakšauskas min...@gmail.comwrote: Hi, Sure, why not - I'm just not sure if my approach (of setting reader in reset()) is preferred over yours (using this.input instead of input in ctor)? Or are they both equally good? m. On Wed, Jan 8, 2014 at 12:18 PM, Benson Margulies ben...@basistech.com wrote: If you'd like to join in on the doc, see https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to grant you access to push to my fork. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How is incrementToken supposed to detect the lack of reset()?
In 4.6.0, org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException fails if incrementToken fails to throw if there's a missing reset. How am I supposed to organize this in a Tokenizer? A quick look at CharTokenizer did not reveal any code for the purpose. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How is incrementToken supposed to detect the lack of reset()?
For the record of other people who implement tokenizers: Say that your tokenizer has a constructor, like: public MyTokenizer(Reader reader, ) { super(reader); myWrappedInputDevice = new MyWrappedInputDevice(reader); } Not a good idea. Tokenizer carefully manages the data flow from the constructor arg to the 'input' field. The correct form is: public MyTokenizer(Reader reader, ) { super(reader); myWrappedInputDevice = new MyWrappedInputDevice(this.input); } On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir rcm...@gmail.com wrote: See Tokenizer.java for the state machine logic. In general you should not have to do anything if the tokenizer is well-behaved (e.g. close calls super.close() and so on). On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies bimargul...@gmail.com wrote: In 4.6.0, org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException fails if incrementToken fails to throw if there's a missing reset. How am I supposed to organize this in a Tokenizer? A quick look at CharTokenizer did not reveal any code for the purpose. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How is incrementToken supposed to detect the lack of reset()?
Yes I Do. On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir rcm...@gmail.com wrote: Benson, do you want to open an issue to fix this constructor to not take Reader? (there might be one already, but lets make a new one). These things are supposed to be reused, and have setReader for that purpose. i think its confusing and contributes to bugs that you have to have logic in e.g. the ctor THEN ALSO in reset(). if someone does it correctly in the ctor, but they only test one time, they might think everything is working.. On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies ben...@basistech.com wrote: For the record of other people who implement tokenizers: Say that your tokenizer has a constructor, like: public MyTokenizer(Reader reader, ) { super(reader); myWrappedInputDevice = new MyWrappedInputDevice(reader); } Not a good idea. Tokenizer carefully manages the data flow from the constructor arg to the 'input' field. The correct form is: public MyTokenizer(Reader reader, ) { super(reader); myWrappedInputDevice = new MyWrappedInputDevice(this.input); } On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir rcm...@gmail.com wrote: See Tokenizer.java for the state machine logic. In general you should not have to do anything if the tokenizer is well-behaved (e.g. close calls super.close() and so on). On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies bimargul...@gmail.com wrote: In 4.6.0, org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException fails if incrementToken fails to throw if there's a missing reset. How am I supposed to organize this in a Tokenizer? A quick look at CharTokenizer did not reveal any code for the purpose. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Where is the source for the .dat files in Kuromoji?
There are a handful of binary files in ./src/resources/org/apache/lucene/analysis/ja/dict/ with filenames ending in .dat. Trailing around in the source, it seems as if at least one of these derives from a source file named unk.def. In turn, this file comes from a dependency. should the build generate the file rather than having it in the tree and shipped as part of the source release?
Re: Where is the source for the .dat files in Kuromoji?
Thanks. On Mon, Dec 2, 2013 at 12:21 PM, Uwe Schindler u...@thetaphi.de wrote: Hi Benson, If you run ant regenerate, it downloads the source files (which is ant download-dict) and then rebuilds (ant build-dict) the FSTs and other binary stuff stored in the dat file. See also the ivy.xml. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:ben...@basistech.com] Sent: Monday, December 02, 2013 6:12 PM To: java-user@lucene.apache.org; Christian Moen Subject: Where is the source for the .dat files in Kuromoji? There are a handful of binary files in ./src/resources/org/apache/lucene/analysis/ja/dict/ with filenames ending in .dat. Trailing around in the source, it seems as if at least one of these derives from a source file named unk.def. In turn, this file comes from a dependency. should the build generate the file rather than having it in the tree and shipped as part of the source release? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Where is the source for the .dat files in Kuromoji?
On Mon, Dec 2, 2013 at 6:27 PM, Christian Moen c...@atilika.com wrote: Hello Benson, The sources for the .dat files are available from https://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz http://atilika.com/releases/mecab-ipadic/mecab-ipadic-2.7.0-20070801.tar.gz and a range of other places. I’m not sure I follow what you’re saying regarding unk.def -- it’s to my knowledge used as-is from the above sources when the binary .dat files are made. (See lucene/analysis/kuromoji/src/tools in the Lucene code tree.) Perhaps I’m missing something. Could you clarify how you think things should be done? I'm not clear that there's anything that anyone would complain of. The question is, are the .dat files part of the source bundle that is the 'official release'? I just fetched from git, not from the official release, so I don't know. Many thanks, Christian Moen アティリカ株式会社 http://www.atilika.com On Dec 3, 2013, at 2:11 AM, Benson Margulies ben...@basistech.com wrote: There are a handful of binary files in ./src/resources/org/apache/lucene/analysis/ja/dict/ with filenames ending in .dat. Trailing around in the source, it seems as if at least one of these derives from a source file named unk.def. In turn, this file comes from a dependency. should the build generate the file rather than having it in the tree and shipped as part of the source release?
Re: Modify the StandardTokenizerFactory to concatenate all words
How would you expect to recognize that 'Toy Story' is a thing? On Tue, Nov 5, 2013 at 6:32 PM, Kevin glidekensing...@gmail.com wrote: Currently I'm using StandardTokenizerFactory which tokenizes the words bases on spaces. For Toy Story it will create tokens toy and story. Ideally, I would want to extend the functionality ofStandardTokenizerFactory to create tokens toy, story, and toy story. How do I do that?
Threads and LuceneTestCase in 3.6.0
I just backported some code to 3.6.0, and it includes tests that use org.apache.lucene.analysis.BaseTokenStreamTestCase#checkRandomData(java.util.Random, org.apache.lucene.analysis.Analyzer, int, int) The tests that use this method fail in 3.6.0 in ways that suggest that multiple threads are hitting my token filter in ways that it's not intended to support. I've never had a failure like that with 4.1 - 4.5. Does anyone recall if anything changed here?
Re: new consistency check for token filters in 4.5.1
OK, thanks, for some reason the test of my tokenizer didn't fail but the test of my token filter with my tokenizer hit the problem. All fixed. On Wed, Oct 30, 2013 at 2:23 AM, Uwe Schindler u...@thetaphi.de wrote: I think this is more a result of the Tokenizer on top, does not correctly implementing end(). In Lucene 4.6 you will get much better error messages (IllegalStateException) because we improved this detection, also during runtime. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:ben...@basistech.com] Sent: Wednesday, October 30, 2013 12:30 AM To: java-user@lucene.apache.org Subject: new consistency check for token filters in 4.5.1 My token filter has no end() method at all. Am I required to have an end method()? BaseLinguisticsTokenFilterTest.testSegmentationReadings:175- Assert.assertTrue:41-Assert.fail:88 super.end()/clearAttributes() was not called correctly in end() BaseLinguisticsTokenFilterTest.testSpacesInLemma:189- Assert.assertTrue:41-Assert.fail:88 super.end()/clearAttributes() was not called correctly in end() - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
new consistency check for token filters in 4.5.1
My token filter has no end() method at all. Am I required to have an end method()? BaseLinguisticsTokenFilterTest.testSegmentationReadings:175-Assert.assertTrue:41-Assert.fail:88 super.end()/clearAttributes() was not called correctly in end() BaseLinguisticsTokenFilterTest.testSpacesInLemma:189-Assert.assertTrue:41-Assert.fail:88 super.end()/clearAttributes() was not called correctly in end()
Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?
I'm working on tool that wants to construct analyzers 'at arms length' -- a bit like from a solr schema -- so that multiple dueling analyzers could be in their own class loaders at one time. I want to just define a simple configuration for char filters, tokenizer, and token filter. So it would be, well, convenient if there were a tokenizer factory at the lucene level as there is a token filter factory. I can use Solr easily enough for now, but I'd consider it cleaner if I could define this entirely at the Lucene level.
Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?
OK, so, here I go again making a public idiot of myself. Could it be that the tokenizer factory is 'relatively recent' as in since 4.1? On Mon, Oct 28, 2013 at 7:39 AM, Benson Margulies ben...@basistech.comwrote: I'm working on tool that wants to construct analyzers 'at arms length' -- a bit like from a solr schema -- so that multiple dueling analyzers could be in their own class loaders at one time. I want to just define a simple configuration for char filters, tokenizer, and token filter. So it would be, well, convenient if there were a tokenizer factory at the lucene level as there is a token filter factory. I can use Solr easily enough for now, but I'd consider it cleaner if I could define this entirely at the Lucene level.
Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?
Just how 'experimental' is the SPI system at this point, if that's a reasonable question? On Mon, Oct 28, 2013 at 8:41 AM, Uwe Schindler u...@thetaphi.de wrote: Hi Benson, the base factory class and the abstract Tokenizer, TpokenFilter and CharFilter factory classes are all in Lucene's analyzers-commons module (since 4.0). They are no longer part of Solr. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:ben...@basistech.com] Sent: Monday, October 28, 2013 12:41 PM To: java-user@lucene.apache.org Subject: Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene? OK, so, here I go again making a public idiot of myself. Could it be that the tokenizer factory is 'relatively recent' as in since 4.1? On Mon, Oct 28, 2013 at 7:39 AM, Benson Margulies ben...@basistech.comwrote: I'm working on tool that wants to construct analyzers 'at arms length' -- a bit like from a solr schema -- so that multiple dueling analyzers could be in their own class loaders at one time. I want to just define a simple configuration for char filters, tokenizer, and token filter. So it would be, well, convenient if there were a tokenizer factory at the lucene level as there is a token filter factory. I can use Solr easily enough for now, but I'd consider it cleaner if I could define this entirely at the Lucene level. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?
We have been in the habit of naming of classes on the theory that Java packages are doing work in the namespace. So, we'd name a class: com.basistech.something.BaseLinguisticsTokenFilterFactory So that means that our name in the SPI system is just 'BaseLinguistics'. That seems a bit problematic. I don't suppose there are some guidelines? On Mon, Oct 28, 2013 at 9:43 AM, Benson Margulies ben...@basistech.comwrote: Just how 'experimental' is the SPI system at this point, if that's a reasonable question? On Mon, Oct 28, 2013 at 8:41 AM, Uwe Schindler u...@thetaphi.de wrote: Hi Benson, the base factory class and the abstract Tokenizer, TpokenFilter and CharFilter factory classes are all in Lucene's analyzers-commons module (since 4.0). They are no longer part of Solr. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:ben...@basistech.com] Sent: Monday, October 28, 2013 12:41 PM To: java-user@lucene.apache.org Subject: Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene? OK, so, here I go again making a public idiot of myself. Could it be that the tokenizer factory is 'relatively recent' as in since 4.1? On Mon, Oct 28, 2013 at 7:39 AM, Benson Margulies ben...@basistech.comwrote: I'm working on tool that wants to construct analyzers 'at arms length' -- a bit like from a solr schema -- so that multiple dueling analyzers could be in their own class loaders at one time. I want to just define a simple configuration for char filters, tokenizer, and token filter. So it would be, well, convenient if there were a tokenizer factory at the lucene level as there is a token filter factory. I can use Solr easily enough for now, but I'd consider it cleaner if I could define this entirely at the Lucene level. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Anyone interested in a worked-out example of the SPIs for analyzer components?
I just built myself a sort of Solr-schema-in-a-test-tube. It's a class that builds a classloader on some JAR files and then uses the SPI mechanism to manufacture Analyzer objects made out of tokenizers and filters. I can make this visible in github, or even attach it to a JIRA, if anyone is interested. For my own nefarious reasons, this acquires the JAR files from Maven repositories via Aether, but it wouldn't be hard to adjust for use with plain old pathnames or something.
Re: Handling special characters in Lucene 4.0
It might be helpful if you would explain, at a higher level, what you are trying to accomplish. Where do these things come from? What higher-level problem are you trying to solve? On Sun, Oct 20, 2013 at 7:12 PM, saisantoshi saisantosh...@gmail.com wrote: Thanks. So, if I understand correctly, StandardAnalyzer wont work for the following below as it strips out the special characters and does search only on searchText ( in this case). queryText = *searchText* If we want to do a search like *** then we need to use WhiteSpaceAnalyzer. Please let me know if my understanding is correct. Also, I am not sure as the following is mentioned in the lucene docs? Is the below not for StandardAnalyzer then? It is not mentioned that it wont work for StandardAnalyzer. /* Escaping Special Characters Lucene supports escaping special characters that are part of the query syntax. The current list special characters are + - || ! ( ) { } [ ] ^ ~ * ? : \ / To escape these character use the \ before the character. For example to search for (1+1):2 use the query: \(1\+1\)\:2 */ Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096727.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Exploiting a whole lot of memory
On Wed, Oct 9, 2013 at 7:18 PM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Oct 9, 2013 at 7:13 PM, Benson Margulies ben...@basistech.com wrote: On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless luc...@mikemccandless.com wrote: DirectPostingsFormat? It stores all terms + postings as simple java arrays, uncompressed. This definitely speeded things up in my benchmark, but I'm greedy for more. I just made a codec that returns it as the postings guy, is that the whole recipe?. Does it make sense to extend it any further to any of the other codec pieces? Yes, that's all you should need to do (you should have seen RAM usage go up too, to confirm :) ). Really this just addressed one hotspot (decoding terms/postings from the index); the query matching + scoring is also costly, and if you do other stuff (highlighting, spell correction) that can be costly too ... what kind of queries are you running / where are the hotspots in profiling? Profile shows a lot of time in org.apache.lucene.search.BooleanScorer$ BooleanScorerCollector.collect(int). We know that a typical query inspects about 1/2 of the documents in the index. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Exploiting a whole lot of memory
On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless luc...@mikemccandless.com wrote: DirectPostingsFormat? It stores all terms + postings as simple java arrays, uncompressed. This definitely speeded things up in my benchmark, but I'm greedy for more. I just made a codec that returns it as the postings guy, is that the whole recipe?. Does it make sense to extend it any further to any of the other codec pieces? Mike McCandless http://blog.mikemccandless.com On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies ben...@basistech.com wrote: Consider a Lucene index consisting of 10m documents with a total disk footprint of 3G. Consider an application that treats this index as read-only, and runs very complex queries over it. Queries with many terms, some of them 'fuzzy' and 'should' terms and a dismax. And, finally, consider doing all this on a box with over 100G of physical memory, some cores, and nothing else to do with its time. I should probably just stop here and see what thoughts come back, but I'll go out on a limb and type the word 'codec'. The MMapDirectory, of course, cheerfully gets to keep every single bit in memory. And then each query runs, exercising the the codec, building up a flurry of Java objects, all of which turn into garbage and we start all over. So, I find myself wondering, is there some sort of an opportunity for a codec-that-caches in here? In other words, I'd like to sell some of my space to buy some time. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Exploiting a whole lot of memory
On Wed, Oct 9, 2013 at 7:18 PM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Oct 9, 2013 at 7:13 PM, Benson Margulies ben...@basistech.com wrote: On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless luc...@mikemccandless.com wrote: DirectPostingsFormat? It stores all terms + postings as simple java arrays, uncompressed. This definitely speeded things up in my benchmark, but I'm greedy for more. I just made a codec that returns it as the postings guy, is that the whole recipe?. Does it make sense to extend it any further to any of the other codec pieces? Yes, that's all you should need to do (you should have seen RAM usage go up too, to confirm :) ). Yes I did that and saw that. Really this just addressed one hotspot (decoding terms/postings from the index); the query matching + scoring is also costly, and if you do other stuff (highlighting, spell correction) that can be costly too ... what kind of queries are you running / where are the hotspots in profiling? no 'other stuff' just matching and scoring -- of an embarrassingly complex query. I will post some results of profiling tomorrow. I had profiled extensively with lucene 3, we just got the code moved to lucene 4.3, and the very first thing I did was run this. In lucene 3 there was a very busy PriorityQueue in there somewhere; but I don't want to waste time and bandwidth on details until they are 4.x details. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Analyzer classes versus the constituent components
Is there some advice around about when it's appropriate to create an Analyzer class, as opposed to just Tokenizer and TokenFilter classes? The advantage of the constituent elements is that they allow the consuming application to add more filters. The only disadvantage I see is that the following is a bit on the verbose side. Is there some advantage or use of an Analyzer class that I'm missing? private Analyzer newAnalyzer() { return new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer source = tokenizerFactory.create(reader, LanguageCode.JAPANESE); com.basistech.rosette.bl.Analyzer rblAnalyzer; try { rblAnalyzer = analyzerFactory.create(LanguageCode.JAPANESE); } catch (IOException e) { throw new RuntimeException(Error creating RBL analyzer, e); } BaseLinguisticsTokenFilter filter = new BaseLinguisticsTokenFilter(source, rblAnalyzer); return new TokenStreamComponents(source, filter); } }; } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Exploiting a whole lot of memory
Consider a Lucene index consisting of 10m documents with a total disk footprint of 3G. Consider an application that treats this index as read-only, and runs very complex queries over it. Queries with many terms, some of them 'fuzzy' and 'should' terms and a dismax. And, finally, consider doing all this on a box with over 100G of physical memory, some cores, and nothing else to do with its time. I should probably just stop here and see what thoughts come back, but I'll go out on a limb and type the word 'codec'. The MMapDirectory, of course, cheerfully gets to keep every single bit in memory. And then each query runs, exercising the the codec, building up a flurry of Java objects, all of which turn into garbage and we start all over. So, I find myself wondering, is there some sort of an opportunity for a codec-that-caches in here? In other words, I'd like to sell some of my space to buy some time.
Re: Exploiting a whole lot of memory
Mike, where do I find DirectPostingFormat? On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless luc...@mikemccandless.com wrote: DirectPostingsFormat? It stores all terms + postings as simple java arrays, uncompressed. Mike McCandless http://blog.mikemccandless.com On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies ben...@basistech.com wrote: Consider a Lucene index consisting of 10m documents with a total disk footprint of 3G. Consider an application that treats this index as read-only, and runs very complex queries over it. Queries with many terms, some of them 'fuzzy' and 'should' terms and a dismax. And, finally, consider doing all this on a box with over 100G of physical memory, some cores, and nothing else to do with its time. I should probably just stop here and see what thoughts come back, but I'll go out on a limb and type the word 'codec'. The MMapDirectory, of course, cheerfully gets to keep every single bit in memory. And then each query runs, exercising the the codec, building up a flurry of Java objects, all of which turn into garbage and we start all over. So, I find myself wondering, is there some sort of an opportunity for a codec-that-caches in here? In other words, I'd like to sell some of my space to buy some time. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Exploiting a whole lot of memory
Oh, drat, I left out an 's'. I got it now. On Tue, Oct 8, 2013 at 7:40 PM, Benson Margulies ben...@basistech.comwrote: Mike, where do I find DirectPostingFormat? On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless luc...@mikemccandless.com wrote: DirectPostingsFormat? It stores all terms + postings as simple java arrays, uncompressed. Mike McCandless http://blog.mikemccandless.com On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies ben...@basistech.com wrote: Consider a Lucene index consisting of 10m documents with a total disk footprint of 3G. Consider an application that treats this index as read-only, and runs very complex queries over it. Queries with many terms, some of them 'fuzzy' and 'should' terms and a dismax. And, finally, consider doing all this on a box with over 100G of physical memory, some cores, and nothing else to do with its time. I should probably just stop here and see what thoughts come back, but I'll go out on a limb and type the word 'codec'. The MMapDirectory, of course, cheerfully gets to keep every single bit in memory. And then each query runs, exercising the the codec, building up a flurry of Java objects, all of which turn into garbage and we start all over. So, I find myself wondering, is there some sort of an opportunity for a codec-that-caches in here? In other words, I'd like to sell some of my space to buy some time. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to make good use of the multithreaded IndexSearcher?
On Tue, Oct 1, 2013 at 3:58 PM, Desidero desid...@gmail.com wrote: Benson, Rather than forcing a random number of small segments into the index using maxMergedSegmentMB, it might be better to split your index into multiple shards. You can create a specific number of balanced shards to control the parallelism and then forceMerge each shard down to 1 segment to avoid spawning extra threads per shard. Once that's done, you just open all of the shards with a MultiReader and use that with the IndexSearcher and an ExecutorService. The downside to this is that it doesn't play nicely with near real-time search, but if you have a relatively static index that gets pushed to slaves periodically it gets the job done. As Mike said, it'd be nicer if there was a way to split the docID space into virtual shards, but it's not currently available. I'm not sure if anyone is even looking into it. Thanks, folks, for all the help. I'm musing about the top-level issue here, which is whether the important case is many independent queries or latency of just one. In the case where it's just one, we'll follow the shard-related advice. Regards, Matt On Tue, Oct 1, 2013 at 7:09 AM, Michael McCandless luc...@mikemccandless.com wrote: You might want to set a smallish maxMergedSegmentMB in TieredMergePolicy to force enough segments in the index ... sort of the opposite of optimizing. Really, IndexSearcher's approach to using one thread per segment is rather silly, and, it's annoying/bad to expose change in behavior due to segment structure. I think it'd be better to carve up the overall docID space into N virtual shards. Ie, if you have 100M docs, then one thread searches docs 0-10M, another 10M-20M, etc. Nobody has created such a searcher impl but it should not be hard and it would be agnostic to the segment structure. But then again, this need (using concurrent hardware to reduce latency of a single query) is somewhat rare; most apps are fine using the concurrency across queries rather than within one query. Mike McCandless http://blog.mikemccandless.com On Tue, Oct 1, 2013 at 7:09 AM, Adrien Grand jpou...@gmail.com wrote: Hi Benson, On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies ben...@basistech.com wrote: The multithreaded index searcher fans out across segments. How aggressively does 'optimize' reduce the number of segments? If the segment count goes way down, is there some other way to exploit multiple cores? forceMerge[1], formerly known as optimize, takes a parameter to configure how many segments should remain in the index. Regarding multi-core usage, if your query load is high enough to use all you CPUs (there are alwas #cores queries running in parrallel), there is generally no need to use the multi-threaded IndexSearcher. The multi-threaded index searcher can however help in case all CPU power is not in use or if you care more about latency than throughput. It indeed leverages the fact that the index is splitted into segments to parallelize query execution, so a fully merged index will actually run the query in a single thread in any case. There is no way to make query execution efficiently use several cores on a single-segment index so if you really want to parallelize query execution, you will have to shard the index to do at the index level what the multi-threaded IndexSearcher does at the segment level. Side notes: - A single segment index only runs more efficiently queries which are terms-dictionary-intensive, it is generally discouraged to run forceMerge on an index unless this index is read-only. - The multi-threaded index searcher only parallelizes query execution in certain cases. In particular, it never parallelizes execution when the method takes a collector. This means that if you want to use TotalHitCountCollector to count matches, you will have to do the parallelization by yourself. [1] http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#forceMerge%28int%29 -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How to make good use of the multithreaded IndexSearcher?
The multithreaded index searcher fans out across segments. How aggressively does 'optimize' reduce the number of segments? If the segment count goes way down, is there some other way to exploit multiple cores?
Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?
Thanks, I might pitch in. On Mon, Sep 16, 2013 at 12:58 PM, Robert Muir rcm...@gmail.com wrote: Mostly because our tokenizers like StandardTokenizer will tokenize the same way regardless of normalization form or whether its normalized at all? But for other tokenizers, such a charfilter should be useful: there is a JIRA for it, but it has some unresolved issues https://issues.apache.org/jira/browse/LUCENE-4072 On Sun, Sep 15, 2013 at 7:05 PM, Benson Margulies bimargul...@gmail.com wrote: Can anyone shed light as to why this is a token filter and not a char filter? I'm wishing for one of these _upstream_ of a tokenizer, so that the tokenizer's lookups in its dictionaries are seeing normalized contents. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?
Can anyone shed light as to why this is a token filter and not a char filter? I'm wishing for one of these _upstream_ of a tokenizer, so that the tokenizer's lookups in its dictionaries are seeing normalized contents.
Re: PositionLengthAttribute
In Japanese, compounds are just decompositions of the input string. In other languages, compounds can manufacture entire tokens from thin air. In those cases, it's something of a question how to decide on the offsets. I think that you're right, eventually, insofar as there's some offset in the original that might as well be blamed for any given component. On Fri, Sep 6, 2013 at 9:37 PM, Robert Muir rcm...@gmail.com wrote: On Fri, Sep 6, 2013 at 9:32 PM, Benson Margulies ben...@basistech.com wrote: On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir rcm...@gmail.com wrote: its the latter. the way its designed to work i think is illustrated best in kuromoji analyzer where it heuristically decompounds nouns: if it decompounds ABCD into AB + CD, then the tokens are AB and CD. these both have posinc=1. however (to compensate for precision issue you mentioned on the other thread), it keeps the full compound as a synonym too (there are some papers benchmarking this approach for decompounding, just think of IDF etc sorting things out). so that ABCD synonym has position increment 0, and it sits at the same position as the first token (AB). but it has positionLength=2, which basically keeps the information in the chain that this synonym spans across both AB and CD. so the output is like this: AB(posinc=1,posLength=1), ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1) I suppose this works best if you actually know the offsets of the pieces. In disassembling German, this is not always straightforward. i dont really see how it has anything to do with natural languages? its just the way you represent the compound components in the tokenstream. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: PositionLengthAttribute
On Sat, Sep 7, 2013 at 8:39 AM, Robert Muir rcm...@gmail.com wrote: On Sat, Sep 7, 2013 at 7:44 AM, Benson Margulies ben...@basistech.com wrote: In Japanese, compounds are just decompositions of the input string. In other languages, compounds can manufacture entire tokens from thin air. In those cases, it's something of a question how to decide on the offsets. I think that you're right, eventually, insofar as there's some offset in the original that might as well be blamed for any given component. Why change the offsets then? Offsets are for highlighting. Let the whole compound be highlighted when its a match in search results. Its transparent and totally accurate as to what is happening: this is why we do highlighting, to aid the user can make a relevance assessment about the document, not to try to assist the end user to debug the analysis chain or anything like that. Thanks, that's very helpful. I spend all my time crawling around the underside of this stuff and I lack perspective. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: LookaheadTokenFilter
nextToken() calls peekToken(). That seems to prevent my lookahead processing from seeing that item later. Am I missing something? On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies ben...@basistech.com wrote: I think that the penny just dropped, and I should not be using this class. If I call peekToken 10 times while sitting at token 0, this class will stack up all 10 of these _at token position 0_. That's not really very helpful for what I'm doing. I need to borrow code from this class and not use it. On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies ben...@basistech.com wrote: Michael, I'm apparently not fully deconfused yet. I've got a very simple incrementToken function. It calls peekToken to stack up the tokens. afterPosition is never called; I expected it to be called as each of the peeked tokens gets next-ed back out. I assume that I'm missing something simple. public boolean incrementToken() throws IOException { if (positions.getMaxPos() 0) { peekSentence(); } return nextToken(); } On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com wrote: On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com wrote: I'm trying to work through the logic of reading ahead until I've seen marker for the end of a sentence, then applying some analysis to all of the tokens of the sentence, and then changing some attributes of each token to reflect the results. The queue of tokens for a position is just a State, so there isn't an API there to set any values. So do I need to subclass Position for myself, store the additional information in there, and set the attributes as each token comes by on the output side? Yes, that sounds right. Either that or, on emitting the eventual Tokens, apply your logic there (because at that point, after restoreState, you have access to all the attr values for that token). I would be grateful for a bit more explanation of afterPosition versus incrementToken; some of the mock classes call peek from afterPosition, and I expected to see peek called in incrementToken based on the javadoc. afterPosition is where your subclass can insert new tokens. I think (it's been a while here...) you are allowed to call peekToken in afterPosition; this is necessary if your logic about inserting additional tokens leaving a given position depends on future tokens. But: are you doing any new token insertion? Or are you just tweaking the attributes of the tokens that pass through the filter? If it's the latter then this class may be overkill ... you could make a simple TokenFilter.incrementToken that just enumerates saves all input tokens, does its processing, then returns those tokens one by one, instead. I'm not adding tokens yet, but I will be soon, so all of this isn't entirely crazy. The underlying capability here includes decompounding. (I have mixed feelings about just adding all the fragments to the token stream, as it can reduce precision, but there isn't an obvious alternative (except perhaps to suppress the super-common ones)). So, to summarize, logic might be: in incrementToken: If positions.getMaxPos() -1. just return nextToken(). If not, loop calling peekToken to acquire a sentence, process the sentence, and attach the lemmas and compound-pieces to the Position subclass objects. in afterPosition, as each token comes 'into focus', splat the lemma from the Position into the char term attribute, and insert new tokens as needed for the compound components. Thanks, benson Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: LookaheadTokenFilter
I think I had better build you a test case for this situation, and attach it to a JIRA. On Sat, Sep 7, 2013 at 3:33 PM, Michael McCandless luc...@mikemccandless.com wrote: Something is wrong; I'm not sure what offhand, but calling peekToken 10 times should not stack all tokens @ position 0; it should stack the tokens at the positions where they occurred. Are you sure the posIncr att is sometimes 1 (i.e., the position is in fact moving forward for some tokens)? nextToken() only calls peekToken() once the lookahead buffer is exhausted. afterPosition() should be called within nextToken(), for each position, once all tokens leaving that position are done. You use case *should* be working: inside your incrementToken() you call peekToken() over and over until you've seen the full sentence (saving away any state in your subclass of Position), then nextToken() to emit the buffered tokens, and to insert your own tokens when afterPosition() is called ... Mike McCandless http://blog.mikemccandless.com On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies ben...@basistech.com wrote: nextToken() calls peekToken(). That seems to prevent my lookahead processing from seeing that item later. Am I missing something? On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies ben...@basistech.com wrote: I think that the penny just dropped, and I should not be using this class. If I call peekToken 10 times while sitting at token 0, this class will stack up all 10 of these _at token position 0_. That's not really very helpful for what I'm doing. I need to borrow code from this class and not use it. On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies ben...@basistech.com wrote: Michael, I'm apparently not fully deconfused yet. I've got a very simple incrementToken function. It calls peekToken to stack up the tokens. afterPosition is never called; I expected it to be called as each of the peeked tokens gets next-ed back out. I assume that I'm missing something simple. public boolean incrementToken() throws IOException { if (positions.getMaxPos() 0) { peekSentence(); } return nextToken(); } On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com wrote: On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com wrote: I'm trying to work through the logic of reading ahead until I've seen marker for the end of a sentence, then applying some analysis to all of the tokens of the sentence, and then changing some attributes of each token to reflect the results. The queue of tokens for a position is just a State, so there isn't an API there to set any values. So do I need to subclass Position for myself, store the additional information in there, and set the attributes as each token comes by on the output side? Yes, that sounds right. Either that or, on emitting the eventual Tokens, apply your logic there (because at that point, after restoreState, you have access to all the attr values for that token). I would be grateful for a bit more explanation of afterPosition versus incrementToken; some of the mock classes call peek from afterPosition, and I expected to see peek called in incrementToken based on the javadoc. afterPosition is where your subclass can insert new tokens. I think (it's been a while here...) you are allowed to call peekToken in afterPosition; this is necessary if your logic about inserting additional tokens leaving a given position depends on future tokens. But: are you doing any new token insertion? Or are you just tweaking the attributes of the tokens that pass through the filter? If it's the latter then this class may be overkill ... you could make a simple TokenFilter.incrementToken that just enumerates saves all input tokens, does its processing, then returns those tokens one by one, instead. I'm not adding tokens yet, but I will be soon, so all of this isn't entirely crazy. The underlying capability here includes decompounding. (I have mixed feelings about just adding all the fragments to the token stream, as it can reduce precision, but there isn't an obvious alternative (except perhaps to suppress the super-common ones)). So, to summarize, logic might be: in incrementToken: If positions.getMaxPos() -1. just return nextToken(). If not, loop calling peekToken to acquire a sentence, process the sentence, and attach the lemmas and compound-pieces to the Position subclass objects. in afterPosition, as each token comes 'into focus', splat the lemma from the Position into the char term attribute, and insert new tokens as needed for the compound components. Thanks, benson Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java
Re: LookaheadTokenFilter
LUCENE-5202. It seems to show the problem of the extra peek. I'm still struggling to make sense of the 'problem' of not always calling afterPosition(); that may be entirely my own confusion. On Sat, Sep 7, 2013 at 4:21 PM, Michael McCandless luc...@mikemccandless.com wrote: That would be awesome, thanks! Mike McCandless http://blog.mikemccandless.com On Sat, Sep 7, 2013 at 3:40 PM, Benson Margulies ben...@basistech.com wrote: I think I had better build you a test case for this situation, and attach it to a JIRA. On Sat, Sep 7, 2013 at 3:33 PM, Michael McCandless luc...@mikemccandless.com wrote: Something is wrong; I'm not sure what offhand, but calling peekToken 10 times should not stack all tokens @ position 0; it should stack the tokens at the positions where they occurred. Are you sure the posIncr att is sometimes 1 (i.e., the position is in fact moving forward for some tokens)? nextToken() only calls peekToken() once the lookahead buffer is exhausted. afterPosition() should be called within nextToken(), for each position, once all tokens leaving that position are done. You use case *should* be working: inside your incrementToken() you call peekToken() over and over until you've seen the full sentence (saving away any state in your subclass of Position), then nextToken() to emit the buffered tokens, and to insert your own tokens when afterPosition() is called ... Mike McCandless http://blog.mikemccandless.com On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies ben...@basistech.com wrote: nextToken() calls peekToken(). That seems to prevent my lookahead processing from seeing that item later. Am I missing something? On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies ben...@basistech.com wrote: I think that the penny just dropped, and I should not be using this class. If I call peekToken 10 times while sitting at token 0, this class will stack up all 10 of these _at token position 0_. That's not really very helpful for what I'm doing. I need to borrow code from this class and not use it. On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies ben...@basistech.com wrote: Michael, I'm apparently not fully deconfused yet. I've got a very simple incrementToken function. It calls peekToken to stack up the tokens. afterPosition is never called; I expected it to be called as each of the peeked tokens gets next-ed back out. I assume that I'm missing something simple. public boolean incrementToken() throws IOException { if (positions.getMaxPos() 0) { peekSentence(); } return nextToken(); } On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com wrote: On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com wrote: I'm trying to work through the logic of reading ahead until I've seen marker for the end of a sentence, then applying some analysis to all of the tokens of the sentence, and then changing some attributes of each token to reflect the results. The queue of tokens for a position is just a State, so there isn't an API there to set any values. So do I need to subclass Position for myself, store the additional information in there, and set the attributes as each token comes by on the output side? Yes, that sounds right. Either that or, on emitting the eventual Tokens, apply your logic there (because at that point, after restoreState, you have access to all the attr values for that token). I would be grateful for a bit more explanation of afterPosition versus incrementToken; some of the mock classes call peek from afterPosition, and I expected to see peek called in incrementToken based on the javadoc. afterPosition is where your subclass can insert new tokens. I think (it's been a while here...) you are allowed to call peekToken in afterPosition; this is necessary if your logic about inserting additional tokens leaving a given position depends on future tokens. But: are you doing any new token insertion? Or are you just tweaking the attributes of the tokens that pass through the filter? If it's the latter then this class may be overkill ... you could make a simple TokenFilter.incrementToken that just enumerates saves all input tokens, does its processing, then returns those tokens one by one, instead. I'm not adding tokens yet, but I will be soon, so all of this isn't entirely crazy. The underlying capability here includes decompounding. (I have mixed feelings about just adding all the fragments to the token stream, as it can reduce precision, but there isn't an obvious alternative (except perhaps to suppress the super-common ones)). So, to summarize, logic might be: in incrementToken: If positions.getMaxPos() -1. just return nextToken(). If not, loop calling peekToken to acquire a sentence, process
Re: LookaheadTokenFilter
On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com wrote: I'm trying to work through the logic of reading ahead until I've seen marker for the end of a sentence, then applying some analysis to all of the tokens of the sentence, and then changing some attributes of each token to reflect the results. The queue of tokens for a position is just a State, so there isn't an API there to set any values. So do I need to subclass Position for myself, store the additional information in there, and set the attributes as each token comes by on the output side? Yes, that sounds right. Either that or, on emitting the eventual Tokens, apply your logic there (because at that point, after restoreState, you have access to all the attr values for that token). I would be grateful for a bit more explanation of afterPosition versus incrementToken; some of the mock classes call peek from afterPosition, and I expected to see peek called in incrementToken based on the javadoc. afterPosition is where your subclass can insert new tokens. I think (it's been a while here...) you are allowed to call peekToken in afterPosition; this is necessary if your logic about inserting additional tokens leaving a given position depends on future tokens. But: are you doing any new token insertion? Or are you just tweaking the attributes of the tokens that pass through the filter? If it's the latter then this class may be overkill ... you could make a simple TokenFilter.incrementToken that just enumerates saves all input tokens, does its processing, then returns those tokens one by one, instead. I'm not adding tokens yet, but I will be soon, so all of this isn't entirely crazy. The underlying capability here includes decompounding. (I have mixed feelings about just adding all the fragments to the token stream, as it can reduce precision, but there isn't an obvious alternative (except perhaps to suppress the super-common ones)). So, to summarize, logic might be: in incrementToken: If positions.getMaxPos() -1. just return nextToken(). If not, loop calling peekToken to acquire a sentence, process the sentence, and attach the lemmas and compound-pieces to the Position subclass objects. in afterPosition, as each token comes 'into focus', splat the lemma from the Position into the char term attribute, and insert new tokens as needed for the compound components. Thanks, benson Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
PositionLengthAttribute
I'm confused by the comment about compound components here. If a single token fissions into multiple tokens, then what belongs in the PositionLengthAttribute. I'm wanting to store a fraction in here! Or is the idea to store N in the 'mother' token and then '1' in each of the babies? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: LookaheadTokenFilter
Michael, I'm apparently not fully deconfused yet. I've got a very simple incrementToken function. It calls peekToken to stack up the tokens. afterPosition is never called; I expected it to be called as each of the peeked tokens gets next-ed back out. I assume that I'm missing something simple. public boolean incrementToken() throws IOException { if (positions.getMaxPos() 0) { peekSentence(); } return nextToken(); } On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com wrote: On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com wrote: I'm trying to work through the logic of reading ahead until I've seen marker for the end of a sentence, then applying some analysis to all of the tokens of the sentence, and then changing some attributes of each token to reflect the results. The queue of tokens for a position is just a State, so there isn't an API there to set any values. So do I need to subclass Position for myself, store the additional information in there, and set the attributes as each token comes by on the output side? Yes, that sounds right. Either that or, on emitting the eventual Tokens, apply your logic there (because at that point, after restoreState, you have access to all the attr values for that token). I would be grateful for a bit more explanation of afterPosition versus incrementToken; some of the mock classes call peek from afterPosition, and I expected to see peek called in incrementToken based on the javadoc. afterPosition is where your subclass can insert new tokens. I think (it's been a while here...) you are allowed to call peekToken in afterPosition; this is necessary if your logic about inserting additional tokens leaving a given position depends on future tokens. But: are you doing any new token insertion? Or are you just tweaking the attributes of the tokens that pass through the filter? If it's the latter then this class may be overkill ... you could make a simple TokenFilter.incrementToken that just enumerates saves all input tokens, does its processing, then returns those tokens one by one, instead. I'm not adding tokens yet, but I will be soon, so all of this isn't entirely crazy. The underlying capability here includes decompounding. (I have mixed feelings about just adding all the fragments to the token stream, as it can reduce precision, but there isn't an obvious alternative (except perhaps to suppress the super-common ones)). So, to summarize, logic might be: in incrementToken: If positions.getMaxPos() -1. just return nextToken(). If not, loop calling peekToken to acquire a sentence, process the sentence, and attach the lemmas and compound-pieces to the Position subclass objects. in afterPosition, as each token comes 'into focus', splat the lemma from the Position into the char term attribute, and insert new tokens as needed for the compound components. Thanks, benson Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: LookaheadTokenFilter
I think that the penny just dropped, and I should not be using this class. If I call peekToken 10 times while sitting at token 0, this class will stack up all 10 of these _at token position 0_. That's not really very helpful for what I'm doing. I need to borrow code from this class and not use it. On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies ben...@basistech.com wrote: Michael, I'm apparently not fully deconfused yet. I've got a very simple incrementToken function. It calls peekToken to stack up the tokens. afterPosition is never called; I expected it to be called as each of the peeked tokens gets next-ed back out. I assume that I'm missing something simple. public boolean incrementToken() throws IOException { if (positions.getMaxPos() 0) { peekSentence(); } return nextToken(); } On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com wrote: On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com wrote: I'm trying to work through the logic of reading ahead until I've seen marker for the end of a sentence, then applying some analysis to all of the tokens of the sentence, and then changing some attributes of each token to reflect the results. The queue of tokens for a position is just a State, so there isn't an API there to set any values. So do I need to subclass Position for myself, store the additional information in there, and set the attributes as each token comes by on the output side? Yes, that sounds right. Either that or, on emitting the eventual Tokens, apply your logic there (because at that point, after restoreState, you have access to all the attr values for that token). I would be grateful for a bit more explanation of afterPosition versus incrementToken; some of the mock classes call peek from afterPosition, and I expected to see peek called in incrementToken based on the javadoc. afterPosition is where your subclass can insert new tokens. I think (it's been a while here...) you are allowed to call peekToken in afterPosition; this is necessary if your logic about inserting additional tokens leaving a given position depends on future tokens. But: are you doing any new token insertion? Or are you just tweaking the attributes of the tokens that pass through the filter? If it's the latter then this class may be overkill ... you could make a simple TokenFilter.incrementToken that just enumerates saves all input tokens, does its processing, then returns those tokens one by one, instead. I'm not adding tokens yet, but I will be soon, so all of this isn't entirely crazy. The underlying capability here includes decompounding. (I have mixed feelings about just adding all the fragments to the token stream, as it can reduce precision, but there isn't an obvious alternative (except perhaps to suppress the super-common ones)). So, to summarize, logic might be: in incrementToken: If positions.getMaxPos() -1. just return nextToken(). If not, loop calling peekToken to acquire a sentence, process the sentence, and attach the lemmas and compound-pieces to the Position subclass objects. in afterPosition, as each token comes 'into focus', splat the lemma from the Position into the char term attribute, and insert new tokens as needed for the compound components. Thanks, benson Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: PositionLengthAttribute
On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir rcm...@gmail.com wrote: its the latter. the way its designed to work i think is illustrated best in kuromoji analyzer where it heuristically decompounds nouns: if it decompounds ABCD into AB + CD, then the tokens are AB and CD. these both have posinc=1. however (to compensate for precision issue you mentioned on the other thread), it keeps the full compound as a synonym too (there are some papers benchmarking this approach for decompounding, just think of IDF etc sorting things out). so that ABCD synonym has position increment 0, and it sits at the same position as the first token (AB). but it has positionLength=2, which basically keeps the information in the chain that this synonym spans across both AB and CD. so the output is like this: AB(posinc=1,posLength=1), ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1) I suppose this works best if you actually know the offsets of the pieces. In disassembling German, this is not always straightforward. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
LookaheadTokenFilter
This useful-looking item is in the test-framework jar. Is there some subtle reason that it isn't in the common analyzer jar? Some reason why I'd regret using it?
LookaheadTokenFilter
I'm trying to work through the logic of reading ahead until I've seen marker for the end of a sentence, then applying some analysis to all of the tokens of the sentence, and then changing some attributes of each token to reflect the results. The queue of tokens for a position is just a State, so there isn't an API there to set any values. So do I need to subclass Position for myself, store the additional information in there, and set the attributes as each token comes by on the output side? I would be grateful for a bit more explanation of afterPosition versus incrementToken; some of the mock classes call peek from afterPosition, and I expected to see peek called in incrementToken based on the javadoc.
Re: Issue with documentation for org.apache.lucene.analysis.synonym.SynonymMap.Builder.add() method
On Thu, Sep 6, 2012 at 1:59 PM, Robert Muir rcm...@gmail.com wrote: Thanks for reporting this Mark. I think it was not intended to have actual null characters here (or probably anywhere in javadocs). Our javadocs checkers should be failing on stuff like this... On Thu, Sep 6, 2012 at 1:52 PM, Mark Parker godef...@gmail.com wrote: I'm building documentation from the Lucene 4.0.0-BETA source (though this was also an issue with the ALPHA source), and the output has null characters in it. I believe that this is because the source looks like this: /** * Add a phrase-phrase synonym mapping. * Phrases are character sequences where words are * separated with character zero (\u). Empty words * (two \us in a row) are not allowed in the input nor * the output! * * @param input input phrase * @param output output phrase * @param includeOrig true if the original should be included */ These \u characters are converted to null (\0) characters in the output, which are invalid in XML (I'm outputting XML). Indeed, this is a problem in the built documentation at the Apache Lucene site ( http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.Builder.html ) where the documentation looks like this (in my browser): Converted to U+000 by what, I wonder? Javadoc shouldn't be doing that. If it does, I wonder if we need \\u instead? Add a phrase-phrase synonym mapping. Phrases are character sequences where words are separated with character zero (). Empty words (two s in a row) are not allowed in the input nor the output! The actual HTML file does have null characters at the two locations, which may be technically correct, but not very helpful. I believe the \u in the source ought to be escaped in some way, so that something more meaningful than \0 ends up in the output. I'd submit a patch, just for the prestige of it, but I don't have the slightest idea what the change should be, not being a Java guy at all. For those interested in why I'm messing with this, then, I'm using IKVM to convert the Java Lucene libraries to .NET assemblies (well, one assembly) and converting the javadoc comments to XML documentation for good IntelliSense in Visual Studio. It works wonderfully, and we use it in very successful commercial software! Note that I'm not subscribed to the list, so please CC me if there are questions. Mark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Payload class
I'm failing to find advice in MIGRATE.txt on how to replace 'new Payload(...)' in migrating to 4.0. What am I missing?
ResourceLoader?
Our Solr 3.x code used init(ResourceLoader) and then called the loader to read a file. What's the new approach to reading content from files in the 'usual place'?
Re: ResourceLoader?
That's what I meant, thanks. On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies ben...@basistech.com wrote: Our Solr 3.x code used init(ResourceLoader) and then called the loader to read a file. What's the new approach to reading content from files in the 'usual place'? I'm not aware of init(ResourceLoader), only inform(ResourceLoader). is that what you meant? I added some javadocs on the lifecycle of these factories the other day (please review, possible doc bugs!): https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html Here are some examples: Parses a tab-separated file (using getLines: UTF-8): http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilterFactory.java Parses a file of its own format (using specified encoding): http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizerFactory.java -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ResourceLoader?
I'm confused. Isn't inform/ResourceLoader deprecated? But your example use it? On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies ben...@basistech.com wrote: Our Solr 3.x code used init(ResourceLoader) and then called the loader to read a file. What's the new approach to reading content from files in the 'usual place'? I'm not aware of init(ResourceLoader), only inform(ResourceLoader). is that what you meant? I added some javadocs on the lifecycle of these factories the other day (please review, possible doc bugs!): https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html Here are some examples: Parses a tab-separated file (using getLines: UTF-8): http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilterFactory.java Parses a file of its own format (using specified encoding): http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizerFactory.java -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Using a char filter in solr createComponents
I'm close to the bottom of my list here. I've got an Analyzer that, in 3.1, set up a CharFilter in the tokenStream method. So now I have to migrate that to createComponents. Can someone give me a shove in the right direction?
Re: ResourceLoader?
On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com wrote: I'm confused. Isn't inform/ResourceLoader deprecated? But your example use it? Where is it deprecated? What does the deprecation message say? I see. It moved from one package to another. Sorry for the noise. -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ResourceLoader?
Hang on: [deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in org.apache.solr.util.plugin has been deprecated On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com wrote: I'm confused. Isn't inform/ResourceLoader deprecated? But your example use it? Where is it deprecated? What does the deprecation message say? -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ResourceLoader?
On Wed, Aug 29, 2012 at 10:42 AM, Robert Muir rcm...@gmail.com wrote: Right and what does the @deprecated message say :) Yes, indeed, sorry. I got caught in a maze of twisty passages and my brain turned off. I'm better now. On Wed, Aug 29, 2012 at 10:40 AM, Benson Margulies ben...@basistech.com wrote: Hang on: [deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in org.apache.solr.util.plugin has been deprecated On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com wrote: I'm confused. Isn't inform/ResourceLoader deprecated? But your example use it? Where is it deprecated? What does the deprecation message say? -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
reset versus setReader on TokenStream
I've read the javadoc through a few times, but I confess that I'm still feeling dense. Are all tokenizers responsible for implementing some way of retaining the contents of their reader, so that a call to reset without a call to setReader rewinds? I note that CharTokenizer doesn't implement #reset, which leads me to suspect that I'm not responsible for the rewind behavior.
Re: reset versus setReader on TokenStream
On Wed, Aug 29, 2012 at 3:37 PM, Robert Muir rcm...@gmail.com wrote: ok, lets help improve it: I think these have likely always been confusing. before they were both reset: reset() and reset(Reader), even though they are unrelated. I thought the rename would help this :) Does the TokenStream workfloat here help? http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html Basically reset() is a mandatory thing the consumer must call. it just means 'reset any mutable state so you can be reused for processing again'. I really did read this. setReader I get; I don't understand what reset accomplishes. What does it mean to reuse one a TokenStream without calling setReader to supply a new input? If it means reuse the old input, who does the rewinding? This is something on any TokenStream: Tokenizers, TokenFilters, or even some direct descendent you make that parses byte arrays, or whatever. This means if you are keeping some state across tokens (like stopfilter's #skippedTokens). here is where you would set that = 0 again. setReader(Reader) is only on Tokenizer, it means replace the Reader with a different one to be processed. The fact that CharTokenizer is doing 'reset()-like-stuff' in here is bogus IMO, but I dont think it will cause any bugs. Don't emulate it :) On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies ben...@basistech.com wrote: I've read the javadoc through a few times, but I confess that I'm still feeling dense. Are all tokenizers responsible for implementing some way of retaining the contents of their reader, so that a call to reset without a call to setReader rewinds? I note that CharTokenizer doesn't implement #reset, which leads me to suspect that I'm not responsible for the rewind behavior. -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: reset versus setReader on TokenStream
Some interlinear commentary on the doc. * Resets this stream to the beginning. To me this implies a rewind. As previously noted, I don't see how this works for the existing implementations. * As all TokenStreams must be reusable, * any implementations which have state that needs to be reset between usages * of the TokenStream, must implement this method. Note that if your TokenStream * caches tokens and feeds them back again after a reset, What's the alternative? What happens with all the existing Tokenizers that have no special implementation of #reset()? * it is imperative * that you clone the tokens when you store them away (on the first pass) as * well as when you return them (on future passes after {@link #reset()}).
Re: reset versus setReader on TokenStream
I think I'm beginning to get the idea. Is the following plausible? At the bottom of the stack, there's an actual source of data -- like a tokenizer. For one of those, reset() is a bit silly, and something like setReader is the brains of the operation. Some number of other components may be stacked up on top of the source of data, and these may have local state. Calling #reset prepared them for new data to emerge from the actual source of data.
Re: reset versus setReader on TokenStream
If I'm following, you've created a division of labor between setReader and reset. We have a tokenizer that has a good deal of state, since it has to split the input into chunks. If I'm following here, you'd recommend that we do nothing special in setReader, but have #reset fix up all the state on the assumption that we are are starting from the beginning of something, and we'd reinitialize our chunker over what was sitting in the protected 'input'. If someone called #setReader and neglected to call #reset, awful things would happen, but you've warned them. To me, it seemed natural to overload #setReader so that our tokenizer was in a consistent state once it was called. It occurs to me to wonder about order: if #reset is called before #setReader, I'm up creek unless I copy my reset implementation into a local override of #setReader.
Re: DisjunctionMaxQuery and scoring
Uwe and Robert, Thanks. David and I are two peas in one pod here at Basis. --benson On Fri, Apr 20, 2012 at 2:33 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To achieve this, you have to change the coord function in your similarity/BooleanWeight used for this query. Either way: If you want a group of terms that get only one score if at least one of the terms match (SQL IN), but not add them at all, DisjunctionMaxQuery is fine. I think this is what Benson asked for. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Friday, April 20, 2012 8:16 AM To: java-user@lucene.apache.org; david_murgatr...@hotmail.com Subject: RE: DisjunctionMaxQuery and scoring Hi, I think BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish the desired name IN (dick, rich) scoring behavior. This is because (name:dick | name:rich) with coord=false would score the 'document' Dick Rich higher than Rich because the former has two term matches and the latter only one. In contrast, I think the desire is that one and only one of the terms in the document match those in the BooleanQuery so that Rich would score higher than Dick Rich, given document length normalization. It's almost like a desire for BooleanQuery bq = new BooleanQuery(false); bq.set*Maximum*NumberShouldMatch(1); I that case DisjunctionMaxQuery is the way to go (it will only count the hit with highest score and not add scores (coord or not coord doesn't matter here). - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
DisjunctionMaxQuery and scoring
I am trying to solve a problem using DisjunctionMaxQuery. Consider a query like: a:b OR c:d OR e:f OR ... name:richard OR name:dick OR name:dickie OR name:rich ... At most, one of the richard names matches. So the match score gets dragged down by the long list of things that don't match, as the list can get quite long. It seemed to me, upon reading the documentation, that I could cure this problem by creating a query tree that used DisjunctionMaxQuery around all those nicknames. However, when I built a boolean query that had, as a clause, a DisjunctionMaxQuery in the place of a pile of these individual Term queries, the score and the explanation did not change at all -- in particular, the coord term shows the same number of total terms. So it looks as if the children of the disjunction still count. Is there a way to control that term? Or a better way to express this? Thinking SQL for a moment, what I'm trying to express is name IN (richard, dick, dickie, rich) as a single term query. Reading the javadoc, I am seeing MultiTermQuery, and I'm that it is what we want. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DisjunctionMaxQuery and scoring
On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies bimargul...@gmail.com wrote: I am trying to solve a problem using DisjunctionMaxQuery. Consider a query like: a:b OR c:d OR e:f OR ... name:richard OR name:dick OR name:dickie OR name:rich ... At most, one of the richard names matches. So the match score gets dragged down by the long list of things that don't match, as the list can get quite long. It seemed to me, upon reading the documentation, that I could cure this problem by creating a query tree that used DisjunctionMaxQuery around all those nicknames. However, when I built a boolean query that had, as a clause, a DisjunctionMaxQuery in the place of a pile of these individual Term queries, the score and the explanation did not change at all -- in particular, the coord term shows the same number of total terms. So it looks as if the children of the disjunction still count. Is there a way to control that term? Or a better way to express this? Thinking SQL for a moment, what I'm trying to express is name IN (richard, dick, dickie, rich) I think you just want to disable coord() here? You can do this for that particular boolean query by passing true to the ctor: public BooleanQuery(boolean disableCoord) Rob, How do nested queries work with respect to this? If I build a boolean query one of whose clauses is a BooleanQuery with coord turned off, does just the nested query insides get left out of 'coord'? If so, then your answer certainly seems to be what the doctor ordered. --benson -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DisjunctionMaxQuery and scoring
Turning on disableCoord for a nested boolean query does not seem to change the overall maxCoord term as displayed in explain. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DisjunctionMaxQuery and scoring
On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies bimargul...@gmail.com wrote: On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies bimargul...@gmail.com wrote: I am trying to solve a problem using DisjunctionMaxQuery. Consider a query like: a:b OR c:d OR e:f OR ... name:richard OR name:dick OR name:dickie OR name:rich ... At most, one of the richard names matches. So the match score gets dragged down by the long list of things that don't match, as the list can get quite long. It seemed to me, upon reading the documentation, that I could cure this problem by creating a query tree that used DisjunctionMaxQuery around all those nicknames. However, when I built a boolean query that had, as a clause, a DisjunctionMaxQuery in the place of a pile of these individual Term queries, the score and the explanation did not change at all -- in particular, the coord term shows the same number of total terms. So it looks as if the children of the disjunction still count. Is there a way to control that term? Or a better way to express this? Thinking SQL for a moment, what I'm trying to express is name IN (richard, dick, dickie, rich) I think you just want to disable coord() here? You can do this for that particular boolean query by passing true to the ctor: public BooleanQuery(boolean disableCoord) Rob, How do nested queries work with respect to this? If I build a boolean query one of whose clauses is a BooleanQuery with coord turned off, does just the nested query insides get left out of 'coord'? If so, then your answer certainly seems to be what the doctor ordered. it applies only to that query itself. So if this BQ is a clause to another BQ that has coord enabled, that would not change the top-level BQ's coord. Note: if you don't want coord at all, then you can also plug in a Similarity that returns 1, or pick another Similarity like BM25: in trunk only the vector space impl even does anything for coord() Robert, I'm sorry that my density is approaching lead. My problem is that I want coord, but I want to control which terms are counted and which are not. I suppose I can accomplish this with my own scorer. My hope was that there was a way to express This group of terms counts as one for coord. In other words, for a subset of fields in the query, I want to scale the entire score by the fraction of them that match. Another way to think about this, which might be no use at all, is to wonder: is there a way to charge a score penalty for failure to match a particular query term? That would, from another direction, address the underlying effect I'm trying to get. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DisjunctionMaxQuery and scoring
On Thu, Apr 19, 2012 at 5:10 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 5:05 PM, Benson Margulies bimargul...@gmail.com wrote: On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies bimargul...@gmail.com wrote: On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies bimargul...@gmail.com wrote: I am trying to solve a problem using DisjunctionMaxQuery. Consider a query like: a:b OR c:d OR e:f OR ... name:richard OR name:dick OR name:dickie OR name:rich ... At most, one of the richard names matches. So the match score gets dragged down by the long list of things that don't match, as the list can get quite long. It seemed to me, upon reading the documentation, that I could cure this problem by creating a query tree that used DisjunctionMaxQuery around all those nicknames. However, when I built a boolean query that had, as a clause, a DisjunctionMaxQuery in the place of a pile of these individual Term queries, the score and the explanation did not change at all -- in particular, the coord term shows the same number of total terms. So it looks as if the children of the disjunction still count. Is there a way to control that term? Or a better way to express this? Thinking SQL for a moment, what I'm trying to express is name IN (richard, dick, dickie, rich) I think you just want to disable coord() here? You can do this for that particular boolean query by passing true to the ctor: public BooleanQuery(boolean disableCoord) Rob, How do nested queries work with respect to this? If I build a boolean query one of whose clauses is a BooleanQuery with coord turned off, does just the nested query insides get left out of 'coord'? If so, then your answer certainly seems to be what the doctor ordered. it applies only to that query itself. So if this BQ is a clause to another BQ that has coord enabled, that would not change the top-level BQ's coord. Note: if you don't want coord at all, then you can also plug in a Similarity that returns 1, or pick another Similarity like BM25: in trunk only the vector space impl even does anything for coord() Robert, I'm sorry that my density is approaching lead. My problem is that I want coord, but I want to control which terms are counted and which are not. I suppose I can accomplish this with my own scorer. My hope was that there was a way to express This group of terms counts as one for coord. So just structure your boolean query appropriately? BQ1(coord=true) BQ2(coord=false): 25 terms BQ3(coord=false): 87 terms BQ1's coord is based on how many subscorers match (out of 2, BQ2 and BQ3). If both match its 2/2 otherwise 1/2. But in this example BQ2 and BQ3 disable coord themselves, hiding the fact they accept 25 and 87 terms respectively and appearing as a single sub for coord(). Does this make sense? you can extend this idea to control this however you want by structuring the BQ appropriately so your BQ's with synonyms have coord=0 Robert, This makes perfect sense, it is what I thought you meant to begin with. I tried it and thought that it did not work. Or, perhaps, I am misreading the 'explain' output. Or, more likely, I goofed altogether. I'll go back and recheck my results and post some explain output if I can't find my mistake. --benson -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DisjunctionMaxQuery and scoring
I see why I'm so confused, but I think I need to construct a simpler test case. My top-level BooleanQuery, which has disableCoord=false, has 22 clauses. All but three are ordinary SHOULD TermQueries. the remainder are a spanNear and a nested BooleanQuery, and an empty PhraseQuery (that's a bug). However, at the end of the explain trace, I see: 0.45 = coord(9/20) I think that my nested Boolean, for which I've been flipping coord on and off to see what happens, is somehow not participating at all. So switching it's coord on and off has no effect. Why 20? Why not 22? Is this just an explain quirk? Should I shove all this code up to 3.6 from 2.9.3 before bugging you further? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DisjunctionMaxQuery and scoring
FWIW, there seems to be an explain bug in 2.9.1 that is fixed in 3.6.0, so I'm no longer confused about the actual behavior. On Thu, Apr 19, 2012 at 8:32 PM, David Murgatroyd dmu...@gmail.com wrote: [apologies for the earlier errant send] I think BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish the desired name IN (dick, rich) scoring behavior. This is because (name:dick | name:rich) with coord=false would score the 'document' Dick Rich higher than Rich because the former has two term matches and the latter only one. In contrast, I think the desire is that one and only one of the terms in the document match those in the BooleanQuery so that Rich would score higher than Dick Rich, given document length normalization. It's almost like a desire for BooleanQuery bq = new BooleanQuery(false); bq.set*Maximum*NumberShouldMatch(1); Is there a good way to accomplish this? On Thu, Apr 19, 2012 at 7:37 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies bimargul...@gmail.com wrote: I see why I'm so confused, but I think I need to construct a simpler test case. My top-level BooleanQuery, which has disableCoord=false, has 22 clauses. All but three are ordinary SHOULD TermQueries. the remainder are a spanNear and a nested BooleanQuery, and an empty PhraseQuery (that's a bug). However, at the end of the explain trace, I see: 0.45 = coord(9/20) I think that my nested Boolean, for which I've been flipping coord on and off to see what happens, is somehow not participating at all. So switching it's coord on and off has no effect. Why 20? Why not 22? Is this just an explain quirk? I am not sure (also not sure i understand your example totally), but at the same time could be as simple as the fact you have 2 prohibited (MUST_NOT) clauses. These don't count towards coord() I think its hard to tell from your description (just since it doesn't have all the details). an explain or test case or something like that would might be more efficient if its still not making sense... -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Repeatability of results
We've observed something that, in some ways, is not surprising. If you take a set of documents that are close in 'score' to some query, and shuffle them in different orders and then see what results you get in what order from the reference query, the scores will vary according to the insertion order. I can't see any way to argue that it's wrong, but we find it inconvenient when we are testing something and we want to multithread the test to speed it up, thus making the insertion order nondeterministic. It occurred to me that perhaps you all have some similar concerns in testing lucene itself, and might have some advice about how to get around it, thus this email. We currently observe this with 2.9.1 and 3.5.0. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Repeatability of results
On Mon, Apr 2, 2012 at 5:33 PM, Michael McCandless luc...@mikemccandless.com wrote: Hmm that's odd. If the scores were identical I'd expect different sort order, since we tie-break by internal docID. But if the scores are different... the insertion order shouldn't matter. And, the score should not change as a function of insertion order... Well, I assumed that TF-IDF would wiggle. Do you have a small test case? SInce this surprises you, I will build a test case. Mike McCandless http://blog.mikemccandless.com On Mon, Apr 2, 2012 at 5:28 PM, Benson Margulies bimargul...@gmail.com wrote: We've observed something that, in some ways, is not surprising. If you take a set of documents that are close in 'score' to some query, and shuffle them in different orders and then see what results you get in what order from the reference query, the scores will vary according to the insertion order. I can't see any way to argue that it's wrong, but we find it inconvenient when we are testing something and we want to multithread the test to speed it up, thus making the insertion order nondeterministic. It occurred to me that perhaps you all have some similar concerns in testing lucene itself, and might have some advice about how to get around it, thus this email. We currently observe this with 2.9.1 and 3.5.0. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problems Indexing/Parsing Tibetan Text
fileformat.info On Mar 30, 2012, at 1:04 PM, Denis Brodeur denisbrod...@gmail.com wrote: Thanks Robert. That makes sense. Do you have a link handy where I can find this information? i.e. word boundary/punctuation for any unicode character set? On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir rcm...@gmail.com wrote: On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur denisbrod...@gmail.com wrote: Hello, I'm currently working out some problems when searching for Tibetan Characters. More specifically: /u0f10-/u0f19. We are using the unicode doesn't consider most of these characters part of a word: most are punctuation and symbols (except 0f18 and 0f19 which are combining characters that combine with digits). for example 0f14 is a text delimiter. in general standardtokenizer discards punctuation and is geared at word boundaries, just like you would have trouble searching on characters like '(', etc in english. So i think its totally expected. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Problem with updating a document or TermQuery with current trunk
I've posted a self-contained test case to github of a mystery. git://github.com/bimargulies/lucene-4-update-case.git The code can be seen at https://github.com/bimargulies/lucene-4-update-case/blob/master/src/test/java/org/apache/lucene/BadFieldTokenizedFlagTest.java. I write a doc to an index, close the index, then reopen and do a delete/add on the doc to add a field. If I iterate the docs in the index, all looks well, but when I try to query for the doc, it isn't found. To be a bit more specific, the doc has a field field1 which is a StringField.TYPE_STORED, and it is a query on that field which comes up empty. I expect to learn that I've missed something obvious, and I offer thanks and apologies in advance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
A little more CHANGES.txt help on terms(), please
Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt appears to be missing one critical hint. If you have existing code that called IndexReader.terms(), where do you start to get a FieldsEnum? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A little more CHANGES.txt help on terms(), please
On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler u...@thetaphi.de wrote: AtomicReader.fields() I went and read up AtomicReader in CHANGES.txt. Should I call SegmentReader.getReader(IOContext)? I just posted a patch to CHANGES.txt to clarify before I read your email, shall I improve it to use this instead of MultiFields.getFields(indexReader).iterator(); which I came up with by fishing around for myself? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Tuesday, March 06, 2012 2:50 PM To: java-user@lucene.apache.org Subject: A little more CHANGES.txt help on terms(), please Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt appears to be missing one critical hint. If you have existing code that called IndexReader.terms(), where do you start to get a FieldsEnum? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A little more CHANGES.txt help on terms(), please
On Tue, Mar 6, 2012 at 9:09 AM, Michael McCandless luc...@mikemccandless.com wrote: I think MIGRATE.txt talks about this? Yes it does, but it doesn't actually answer the specific question. See LUCENE-3853 where I added what seems to be missing. If it's somewhere else in the file I apologize. Mike McCandless http://blog.mikemccandless.com On Tue, Mar 6, 2012 at 8:50 AM, Benson Margulies bimargul...@gmail.com wrote: Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt appears to be missing one critical hint. If you have existing code that called IndexReader.terms(), where do you start to get a FieldsEnum? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A little more CHANGES.txt help on terms(), please
Oh, I see, I didn't read far enough down. Well, the patch still repairs a bug in the code fragment relative to the Term enumeration. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A little more CHANGES.txt help on terms(), please
Oh, ouch, there's no SegmentReader.getReader, I was reading IndexWriter. Sorry. On Tue, Mar 6, 2012 at 9:14 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler u...@thetaphi.de wrote: AtomicReader.fields() - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problem with updating a document or TermQuery with current trunk
On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote: I think the issue is that your analyzer is standardanalyzer, yet field text value is value-1 Robert, Why is this field analyzed at all? It's built with StringField.TYPE_STORED. I'll push another copy that shows that it works fine when the doc is first added, and gets bad after the 'update', when the field acquires the 'tokenized' boolean mysteriously. --benson So standardanalyzer will tokenize this into two terms: value and 1 But later, you proceed to do TermQueries on value-1. This term won't exist... TermQuery etc that take Term don't analyze any text. Instead usually higher-level things like QueryParsers analyze text into Terms. On Tue, Mar 6, 2012 at 8:35 AM, Benson Margulies bimargul...@gmail.com wrote: I've posted a self-contained test case to github of a mystery. git://github.com/bimargulies/lucene-4-update-case.git The code can be seen at https://github.com/bimargulies/lucene-4-update-case/blob/master/src/test/java/org/apache/lucene/BadFieldTokenizedFlagTest.java. I write a doc to an index, close the index, then reopen and do a delete/add on the doc to add a field. If I iterate the docs in the index, all looks well, but when I try to query for the doc, it isn't found. To be a bit more specific, the doc has a field field1 which is a StringField.TYPE_STORED, and it is a query on that field which comes up empty. I expect to learn that I've missed something obvious, and I offer thanks and apologies in advance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problem with updating a document or TermQuery with current trunk
On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote: I think the issue is that your analyzer is standardanalyzer, yet field text value is value-1 Robert, Why is this field analyzed at all? It's built with StringField.TYPE_STORED. I'll push another copy that shows that it works fine when the doc is first added, and gets bad after the 'update', when the field acquires the 'tokenized' boolean mysteriously. I pushed a new copy that runs the query successfully before the 'delete/add' sequence, and then fails afterwards. --benson So standardanalyzer will tokenize this into two terms: value and 1 But later, you proceed to do TermQueries on value-1. This term won't exist... TermQuery etc that take Term don't analyze any text. Instead usually higher-level things like QueryParsers analyze text into Terms. On Tue, Mar 6, 2012 at 8:35 AM, Benson Margulies bimargul...@gmail.com wrote: I've posted a self-contained test case to github of a mystery. git://github.com/bimargulies/lucene-4-update-case.git The code can be seen at https://github.com/bimargulies/lucene-4-update-case/blob/master/src/test/java/org/apache/lucene/BadFieldTokenizedFlagTest.java. I write a doc to an index, close the index, then reopen and do a delete/add on the doc to add a field. If I iterate the docs in the index, all looks well, but when I try to query for the doc, it isn't found. To be a bit more specific, the doc has a field field1 which is a StringField.TYPE_STORED, and it is a query on that field which comes up empty. I expect to learn that I've missed something obvious, and I offer thanks and apologies in advance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A little more CHANGES.txt help on terms(), please
On Tue, Mar 6, 2012 at 9:34 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, MultiFields should only be used (as it is slow) if you exactly know what you are doing and what the consequences are. There is a change in Lucene 4.0, so you can no longer terms and postings from a top-level (composite) reader. More info is also here: http://goo.gl/lMKTM Uwe, The 4.0 change is how I got here in the first place. Some code we have here dumped all the terms using the old IndexReader.terms(), so I was working on figuring out how to replace it. For my purposes, which are a dev tool, I think that MultiFields will be fine. --benson Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Tuesday, March 06, 2012 3:15 PM To: java-user@lucene.apache.org Subject: Re: A little more CHANGES.txt help on terms(), please On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler u...@thetaphi.de wrote: AtomicReader.fields() I went and read up AtomicReader in CHANGES.txt. Should I call SegmentReader.getReader(IOContext)? I just posted a patch to CHANGES.txt to clarify before I read your email, shall I improve it to use this instead of MultiFields.getFields(indexReader).iterator(); which I came up with by fishing around for myself? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Tuesday, March 06, 2012 2:50 PM To: java-user@lucene.apache.org Subject: A little more CHANGES.txt help on terms(), please Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt appears to be missing one critical hint. If you have existing code that called IndexReader.terms(), where do you start to get a FieldsEnum? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problem with updating a document or TermQuery with current trunk
On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir rcm...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote: I think the issue is that your analyzer is standardanalyzer, yet field text value is value-1 Robert, Why is this field analyzed at all? It's built with StringField.TYPE_STORED. thanks Benson, you are right! So, should I attach this to a JIRA? -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problem with updating a document or TermQuery with current trunk
On Tue, Mar 6, 2012 at 9:47 AM, Uwe Schindler u...@thetaphi.de wrote: String field is analyzed, but with KeywordTokenizer, so all should be fine. I filed LUCENE-3854. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, March 06, 2012 3:42 PM To: java-user@lucene.apache.org Subject: Re: Problem with updating a document or TermQuery with current trunk Hmm something is up here... I'll dig. Seems like we are somehow analyzing StringField when we shouldn't... Mike McCandless http://blog.mikemccandless.com On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir rcm...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote: I think the issue is that your analyzer is standardanalyzer, yet field text value is value-1 Robert, Why is this field analyzed at all? It's built with StringField.TYPE_STORED. thanks Benson, you are right! -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A little more CHANGES.txt help on terms(), please
On Tue, Mar 6, 2012 at 9:46 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, The recommended way to get an atomic reader from a composite reader is to use SlowCompositeReaderWrapper.wrap(reader). MultiFields is now purely internal. I think it's only public because the codecs package may need it, otherwise it should be pkg-private. Oh! I'll rework the patch again, then. I might include some commentary in MultiFields at all. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Tuesday, March 06, 2012 3:40 PM To: java-user@lucene.apache.org Subject: Re: A little more CHANGES.txt help on terms(), please On Tue, Mar 6, 2012 at 9:34 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, MultiFields should only be used (as it is slow) if you exactly know what you are doing and what the consequences are. There is a change in Lucene 4.0, so you can no longer terms and postings from a top-level (composite) reader. More info is also here: http://goo.gl/lMKTM Uwe, The 4.0 change is how I got here in the first place. Some code we have here dumped all the terms using the old IndexReader.terms(), so I was working on figuring out how to replace it. For my purposes, which are a dev tool, I think that MultiFields will be fine. --benson Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Tuesday, March 06, 2012 3:15 PM To: java-user@lucene.apache.org Subject: Re: A little more CHANGES.txt help on terms(), please On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler u...@thetaphi.de wrote: AtomicReader.fields() I went and read up AtomicReader in CHANGES.txt. Should I call SegmentReader.getReader(IOContext)? I just posted a patch to CHANGES.txt to clarify before I read your email, shall I improve it to use this instead of MultiFields.getFields(indexReader).iterator(); which I came up with by fishing around for myself? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Tuesday, March 06, 2012 2:50 PM To: java-user@lucene.apache.org Subject: A little more CHANGES.txt help on terms(), please Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt appears to be missing one critical hint. If you have existing code that called IndexReader.terms(), where do you start to get a FieldsEnum? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problem with updating a document or TermQuery with current trunk
On Tue, Mar 6, 2012 at 10:04 AM, Robert Muir rcm...@gmail.com wrote: Thanks Benson: look like the problem revolves around indexing Document/Fields you get back from IR.document... this has always been 'lossy', but I think this is a real API trap. Please keep testing :) Got a suggestion for sneaking around this in the mean time? On Tue, Mar 6, 2012 at 9:58 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:47 AM, Uwe Schindler u...@thetaphi.de wrote: String field is analyzed, but with KeywordTokenizer, so all should be fine. I filed LUCENE-3854. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, March 06, 2012 3:42 PM To: java-user@lucene.apache.org Subject: Re: Problem with updating a document or TermQuery with current trunk Hmm something is up here... I'll dig. Seems like we are somehow analyzing StringField when we shouldn't... Mike McCandless http://blog.mikemccandless.com On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir rcm...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote: I think the issue is that your analyzer is standardanalyzer, yet field text value is value-1 Robert, Why is this field analyzed at all? It's built with StringField.TYPE_STORED. thanks Benson, you are right! -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
What replaces IndexReader.openIfChanged in Lucene 4.0?
Sorry, I'm coming up empty in Google here. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What replaces IndexReader.openIfChanged in Lucene 4.0?
To reduce noise slightly I'll stay on this thread. I'm looking at this file, and not seeing a pointer to what to do about QueryParser. Are jar file rearrangements supposed to be in that file? I think that I don't have the right jar yet; all I'm seeing is the 'surround' package. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What replaces IndexReader.openIfChanged in Lucene 4.0?
OK, thanks. On Mon, Mar 5, 2012 at 11:22 AM, Steven A Rowe sar...@syr.edu wrote: You want the lucene-queryparser jar. From trunk MIGRATE.txt: * LUCENE-3283: Lucene's core o.a.l.queryParser QueryParsers have been consolidated into module/queryparser, where other QueryParsers from the codebase will also be placed. The following classes were moved: - o.a.l.queryParser.CharStream - o.a.l.queryparser.classic.CharStream - o.a.l.queryParser.FastCharStream - o.a.l.queryparser.classic.FastCharStream - o.a.l.queryParser.MultiFieldQueryParser - o.a.l.queryparser.classic.MultiFieldQueryParser - o.a.l.queryParser.ParseException - o.a.l.queryparser.classic.ParseException - o.a.l.queryParser.QueryParser - o.a.l.queryparser.classic.QueryParser - o.a.l.queryParser.QueryParserBase - o.a.l.queryparser.classic.QueryParserBase - o.a.l.queryParser.QueryParserConstants - o.a.l.queryparser.classic.QueryParserConstants - o.a.l.queryParser.QueryParserTokenManager - o.a.l.queryparser.classic.QueryParserTokenManager - o.a.l.queryParser.QueryParserToken - o.a.l.queryparser.classic.Token - o.a.l.queryParser.QueryParserTokenMgrError - o.a.l.queryparser.classic.TokenMgrError -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Monday, March 05, 2012 11:15 AM To: java-user@lucene.apache.org Subject: Re: What replaces IndexReader.openIfChanged in Lucene 4.0? To reduce noise slightly I'll stay on this thread. I'm looking at this file, and not seeing a pointer to what to do about QueryParser. Are jar file rearrangements supposed to be in that file? I think that I don't have the right jar yet; all I'm seeing is the 'surround' package. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Updating a document.
I am walking down the document in an index by number, and I find that I want to update one. The updateDocument API only works on queries and terms, not numbers. So I can call remove and add, but, then, what's the document's number after that? Or is that not a meaningful question until I make a new reader? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org