Re: Accent insensitive search for greek characters

2017-09-27 Thread Koji Sekiguchi
Hi Chitra, Without having the knowledge of the language, but can you solve the problem not in TokenFilter level but in CharFilter level, by setting your own mapping definition using MappingCharFilter? Koji On 2017/09/27 21:39, Chitra wrote: Hi Ahmet, Thank you so much

[ANN] KEA-lucene (program that extracts keyphrases from Lucene index)

2016-07-14 Thread Koji Sekiguchi
Hello everyone! I've developed KEA-lucene [1]. It is an Apache Lucene implementation of KEA [2]. KEA is a program developed by the University of Waikato in New Zealand that automatically extracts key phrases (keywords) from natural language documents. KEA stands for Keyphrase Extraction Algo

Re: Grouping Lucene result

2016-02-25 Thread Koji Sekiguchi
Hi Taher, Solr has the function of result grouping. I think it has two steps. First, it tries to find how many groups are there in the result and choose top groups (say 10 groups) using a priority queue. Second, provide 10 priority queues for each groups and search again to collect second or a

Re: Learning to Rank algorithms in Lucene

2015-08-24 Thread Koji Sekiguchi
Hi ajinkya, In last week, I had a technical talk about NLP4L at Lucene/Solr meetup: http://www.meetup.com/Downtown-SF-Apache-Lucene-Solr-Meetup/events/223899054/ In my talk, I told about the implementation idea of Learning to Rank using Lucene. Please take a look at page 48 to 50 of the follow

Re: Luke for Lucene 5.x?

2015-04-23 Thread Koji Sekiguchi
Hi Clemens, NLP4L, which stands for Natural Language Processing for Lucene, has a function for browsing Lucene index aside from NLP tools. It supports 5.x index format. https://github.com/NLP4L/nlp4l#using-lucene-index-browser Thanks, Koji On 2015/04/24 15:10, Clemens Wyss DEV wrote: From ti

Re: Data structures used by Lucene

2015-04-07 Thread Koji Sekiguchi
Hi Prateek, Using Luke, which is a GUI based browser tool for Lucene index, may be a good start to see the structure of Lucene index for you. https://github.com/DmitryKey/luke/ NLP4L also provides CUI based index browser for Lucene users aside from NLP functions. https://github.com/NLP4L/nlp

Re: Tokenizer for Brown Corpus?

2015-02-24 Thread Koji Sekiguchi
hub.com/INL/BlackLab/wiki/Blacklab-query-tool -- Jack Krupansky On Tue, Feb 24, 2015 at 1:40 AM, Koji Sekiguchi wrote: Hello, Doesn't Lucene have a Tokenizer/Analyzer for Brown Corpus? There doesn't seem to be such tokenizers/analyzers in Lucene. As I didn't want re-inventing th

Tokenizer for Brown Corpus?

2015-02-23 Thread Koji Sekiguchi
Hello, Doesn't Lucene have a Tokenizer/Analyzer for Brown Corpus? There doesn't seem to be such tokenizers/analyzers in Lucene. As I didn't want re-inventing the wheel, so I googled, I got the list of snippets that include "the quick brown fox..." :) Koji ---

Re: o.a.l.u.fst package's sample code might be outdated?

2014-12-13 Thread Koji Sekiguchi
Hi Tomoko, Please don't hesitate to open a JIRA issue and give your patch to fix the error you found. Koji -- http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html (2014/12/14 11:11), Tomoko Uchida wrote: Sorry again, I checked the o.a.l.u.fst.TestFSTs.j

Re: [ANN] word2vec for Lucene

2014-11-20 Thread Koji Sekiguchi
ays rather pleasant for the LSI/LSA-like approach, but precisely this is mathematically opaque. Maybe it's more a question of presentation. Paul On 20 nov. 2014, at 16:24, Koji Sekiguchi wrote: Hi Paul, I cannot compare it to SemanticVectors as I don't know SemanticVectors. But w

Re: [ANN] word2vec for Lucene

2014-11-20 Thread Koji Sekiguchi
At least I see more transparent math in the web-page. > Maybe this helps a bit? > > SemanticVectors has always rather pleasant for the LSI/LSA-like approach, but > precisely this is mathematically opaque. > Maybe it's more a question of presentation. > > Paul > >

Re: [ANN] word2vec for Lucene

2014-11-20 Thread Koji Sekiguchi
Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') Thanks, Koji (2014/11/20 20:01), Paul Libbrecht wrote: > Hello Koji, > > how would you compare that to SemanticVectors? > > paul > > On

[ANN] word2vec for Lucene

2014-11-20 Thread Koji Sekiguchi
Hello, It's my pleasure to share that I have an interesting tool "word2vec for Lucene" available at https://github.com/kojisekig/word2vec-lucene . As you can imagine, you can use "word2vec for Lucene" to extract word vectors from Lucene index. Thank you, Koji -- http://soleami.com/blog/compar

Re: Finding words not followed by other words

2014-07-12 Thread Koji Sekiguchi
Hi Michael, I haven't executed this yet, but can you try this: SpanNotQuery(SpanNearQuery("George Washington"), SpanNearQuery("George Washington Carver")) Koji -- http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html (2014/07/11 23:20), Michael Ryan wro

Re: How to add machine learning to Apache lucene

2014-05-16 Thread Koji Sekiguchi
Hi Priyanka, > How can I add Maching Learning Part in Apache Lucene . I think your question is too wide to asnwer because machine learning covers a lot of things... Lucene has already got a text categorization function which is a well known task of NLP and NLP is a part of machine learning. I'v

Re: How to add machine learning to Apache lucene

2014-05-10 Thread Koji Sekiguchi
Hi Priyanka, > How can I add Maching Learning Part in Apache Lucene . I think your question is too wide to asnwer because machine learning covers a lot of things... Lucene has already got a text categorization function which is a well known task of NLP and NLP is a part of machine learning. I'v

Re: [blog post] Comparing Document Classification Functions of Lucene and Mahout

2014-03-07 Thread Koji Sekiguchi
ili wrote: cool Koji, thanks a lot for sharing. Some useful points / suggestions come out of it, let's see if we can follow up :) Regards, Tommaso 2014-03-07 3:30 GMT+01:00 Koji Sekiguchi : Hello, I just posted an article on Comparing Document Classification Functions of Lucene and Mahout.

[blog post] Comparing Document Classification Functions of Lucene and Mahout

2014-03-06 Thread Koji Sekiguchi
Hello, I just posted an article on Comparing Document Classification Functions of Lucene and Mahout. http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html Comments are welcome. :) Thanks! koji -- http://soleami.com/blog/comparing-document-classification

Re: Help using ShingleFilter/NGramTokenizer: Could not find implementing class for org.apache.lucene.analysis.tokenattributes.OffsetAttribute

2014-01-24 Thread Koji Sekiguchi
Hi Russell, Seems that the error messages says that the implementing class for OffsetAttribute cannot be found in your classpath on the (Pig?) environment. There seems to be implementing classes OffsetAttributeImpl and Token, according to Javadoc: http://lucene.apache.org/core/4_6_0/core/org/a

Re: Phrase highlight

2013-11-26 Thread Koji Sekiguchi
(13/11/27 9:19), Scott Smith wrote: I'm doing some highlighting with the following code fragment: formatter = new SimpleHTMLFormatter(, ); Scorer score = new QueryScorer(myQuery); ht = new Highlighter(formatter, score); ht.

Re: Synonym Search in Lucene..

2013-10-09 Thread Koji Sekiguchi
x for only English.. I need to create Dictionary Index for all languages.I want to know whether anything like wordnet which i can readily plugin in my application .. Please Kindly Guide me.. Thanks and Regards Vignesh Srinivasan. On Wed, Oct 9, 2013 at 5:56 PM, Koji Sekiguchi wrote: Hi VIGNESH,

Re: Synonym Search in Lucene..

2013-10-09 Thread Koji Sekiguchi
wikipedia is giving for all languages. Please kindly help. On Mon, Oct 7, 2013 at 8:06 PM, Koji Sekiguchi wrote: (13/10/07 18:33), VIGNESH S wrote: Hi, How to implement synonym Search for All languages.. As far as i know,Wordnet has only English Support..Is there any other we can use to get

Re: Synonym Search in Lucene..

2013-10-07 Thread Koji Sekiguchi
(13/10/07 18:33), VIGNESH S wrote: Hi, How to implement synonym Search for All languages.. As far as i know,Wordnet has only English Support..Is there any other we can use to get support for all languages. I think most people make synonym data manually... I've never explored Wordnet, but I t

Re: Lucene Text Similarity

2013-09-03 Thread Koji Sekiguchi
(13/09/04 2:33), David Miranda wrote: Is there any way to check the similarity of texts with Lucene? I have the DBpedia indexed and wanted to get the texts more similar between the abstract and DBpedia another text. If I do a search in the abstract field, with a particular text the result is not

Re: Complete phrase Suggest Feature in Apache Lucene

2013-08-02 Thread Koji Sekiguchi
(13/08/02 17:16), Ankit Murarka wrote: Hello All, Just like spellcheck feature which after lot of trouble was Implemented, is it possible to implement Complete Phrase Suggest Feature in Lucene 4.3 . So if I enter an incorrect phrase it can suggest me few possible valid phrases. One way could

Re: Adding BM25 in Lucene

2013-07-11 Thread Koji Sekiguchi
(13/07/11 22:56), gtkesh wrote: Hi everyone! I have two questions: 1. What are the cases where Lucene's default tf-idf overperforms BM25? What are the best use cases where I should use tf-idf or BM25? 2. Are there any user-friendly guide or something about how can I use BM25 algorithm instead

Re: A Problem in Customizing DefaultSimilarity

2013-06-12 Thread Koji Sekiguchi
Hi Oliver, > My questions are: > > 1. Why are the overrided lengthNorm() (under Lucene410) or > computeNorm() (under Lucene350) methods not called during a searching > process? Regardless of whether you override the method or not, Lucene framework calls the method during index time only be

Re: [blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-28 Thread Koji Sekiguchi
e you shared source code / jar for the same so at it could be used ? Thanks, Rajesh On Mon, May 27, 2013 at 8:44 PM, Koji Sekiguchi wrote: Hello, Sorry for cross post. I just wanted to announce that I've written a blog post on how to create synonyms.txt file automatically from Wikiped

[blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-27 Thread Koji Sekiguchi
Hello, Sorry for cross post. I just wanted to announce that I've written a blog post on how to create synonyms.txt file automatically from Wikipedia: http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html Hope that the article gives someone a good experience! koji

Re: HTML tags and Lucene highlighting

2012-04-05 Thread Koji Sekiguchi
(12/04/06 2:34), okayndc wrote: Hello, I currently use Lucene version 3.0...probably need to upgrade to a more current version soon. The problem that I have is when I test search for a an HTML tag (ex. ), Lucene returns the highlighted HTML tag ~ which is what I DO NOT want. Is there a way to "

Re: Measuring document similarity

2012-03-12 Thread Koji Sekiguchi
(12/03/13 2:38), Hassane Cabir wrote: Hi guys, I'm using Lucene for my project and I need to calcule how similar two (or more) documents are, using TFIDF. How to get TFIDF with lucene? Any insights on this? Solr has TermVectorComponent which can return tf, df and tf-idf of each term in a docu

Re: lucene gosen diff btn jars

2012-03-02 Thread Koji Sekiguchi
Hi Thushara, Please use lucene-gosen mailing list for lucene-gosen questions: http://groups.google.com/group/lucene-gosen Thanks, koji -- Query Log Visualizer for Apache Solr http://soleami.com/ (12/03/03 6:41), Thushara Wijeratna wrote: > I'm testing lucene-gosen for Japanese tokenization an

Re: background merge hit exception

2011-09-12 Thread Koji Sekiguchi
re they running? We added this check originally as a workaround for a JRE bug... but usually when that bug strikes the file size is very close (like off by just 1 byte or 8 bytes or something). Mike McCandless http://blog.mikemccandless.com 2011/9/9 Koji Sekiguchi: A user here hit the exception th

Re: background merge hit exception

2011-09-09 Thread Koji Sekiguchi
Also: what java version are they running? We added this check originally as a workaround for a JRE bug... but usually when that bug strikes the file size is very close (like off by just 1 byte or 8 bytes or something). I think they are using 1.6, but I should ask the minor number. Could you show

Re: background merge hit exception

2011-09-09 Thread Koji Sekiguchi
e is very close (like off by just 1 byte or 8 bytes or something). I think they are using 1.6, but I should ask the minor number. Could you show me the pointer of the JRE bug you mentioned? Thank you very much! koji Mike McCandless http://blog.mikemccandless.com 2011/9/9 Koji Sekiguchi:

background merge hit exception

2011-09-09 Thread Koji Sekiguchi
A user here hit the exception the title says when optimizing. They're using Solr 1.4 (Lucene 2.9) running on a server that mounts NFS for index. I think I know the famous "Stale NFS File Handle IOException" problem, but I think it causes FileNoutFoundException. Is there any chance to hit the exc

Re: Solution for FHV and NGram Max Min Gram Restriction

2011-06-21 Thread Koji Sekiguchi
(11/06/22 2:03), Anupam Tangri wrote: Hi, We are using lucene 3.2 for our project where I needed to highlight search matches. I earlier used default highlighter which did not work correctly all the time. So, I started using FHV which worked worked beautifully till I started searching multiple t

Re: highlighting performance

2011-06-20 Thread Koji Sekiguchi
Mike, FVH used to be faster for large docs. I wrote FVH section for Lucene in Action and it said: In contrib/benchmark (covered in appendix C), there’s an algorithm file called highlight-vs-vector-highlight.alg that lets you see the difference between two highlighters in processing time. As of

Re: FastVectorHighlighter.getBestFragments returning null

2011-05-27 Thread Koji Sekiguchi
(11/05/27 19:57), Joel Halbert wrote: Hi, I'm using Lucene 3.0.3. I'm extracting snippets using FastVectorHighlighter, for some snippets (I think always when searching for exact matches, quoted) the fragment is null. Code looks like: query = QueryParser.escape(query);

Re: FastVectorHighlighter.getBestFragments returning null

2011-05-27 Thread Koji Sekiguchi
(11/05/27 20:56), Pierre GOSSE wrote: Hi, Maybe is it related to : https://issues.apache.org/jira/browse/LUCENE-3087 No, because Joel's problem is FastVectorHighlighter, but LUCENE-3087 is for Highlighter. koji -- http://www.rondhuit.com/en/ --

Re: FastVectorHighlighter - can FieldFragList expose fragInfo?

2011-05-24 Thread Koji Sekiguchi
(11/05/24 3:28), Sujit Pal wrote: > Hello, > > My version: Lucene 3.1.0 > > I've had to customize the snippet for highlighting based on our > application requirements. Specifically, instead of the snippet being a > set of relevant fragments in the text, I need it to be the first > sentence where

Re: FastVectorHighlighter StringIndexOutofBounds bug

2011-05-23 Thread Koji Sekiguchi
(11/05/23 14:36), Weiwei Wang wrote: > 1. source string: 7 > 2. WhitespaceTokenizer + EGramTokenFilter > 3. FastVectorHighlighter, > 4. debug info: subInfos=(777((8,11))777((5,8))777((2,5)))/3.0(2,102), > srcIndex is not correctly computed for the second loop of the outer for-loop > How

Re: The MoreLikeThisHandler could include highlighting ?

2011-05-03 Thread Koji Sekiguchi
(11/03/01 21:16), Amel Fraisse wrote: Hello, The MoreLikeThisHandler could include higlighting ? Is it true to define a MoreLikeThisHandler like this: ? true contenu Thank you for your help. Amel. Amel, 1. I think you shou

Re: Highlighting a phrase with "Single"

2011-04-06 Thread Koji Sekiguchi
(11/04/06 14:01), shrinath.m wrote: If there is a phrase in search, the highlighter highlights every word separately.. Like this : I love Lucene Instead what I want is like this : I love Lucene Not sure my mailer problem or not, I don't see the difference between above two. But reading t

Re: Difference between regular Highlighter and Fast Vector Highlighter ?

2011-04-01 Thread Koji Sekiguchi
(11/04/01 21:32), shrinath.m wrote: I was wondering whats the difference between the Lucene's 2 implementation of highlighters... I saw the javadoc of FVH, but it only says "another implementation of Lucene Highlighter" ... Description section in the javadoc shows the features of FVH: https://

Re: Regarding MoreLikeThis similarity Search

2011-03-18 Thread Koji Sekiguchi
(11/03/19 6:16), madhuri_1...@yahoo.com wrote: Hi, I am new to lucene ... I have a question while implementing similarity search using MoreLikeThis query. I have written a small program but it is not giving any results. In my index file I have both strored and unstored(analyzed) fields. Sampl

Re: getting the number of updated documents

2011-03-10 Thread Koji Sekiguchi
Does IndexWriter (or somewhere else) have the method such that it gets the number of updated documents before commit? you have maxDocs which gives you the maxdocid-1 but this might not be super accurate since there might have been merges going on in the background. I am not sure if this number yo

getting the number of updated documents

2011-03-10 Thread Koji Sekiguchi
Hello, Does IndexWriter (or somewhere else) have the method such that it gets the number of updated documents before commit? I have an optimized index and I'm using iw.updateDocument(Term,Document) with the index, and before commit, I'd like to know the number of updated documents from IndexWrite

Re: FastVectorHighlighter and field compression

2011-03-07 Thread Koji Sekiguchi
(11/03/07 1:16), Joel Halbert wrote: Hi, I'm using FastVectorHighlighter for highlighting, 3.0.3. At the moment this is highlighting a field which is stored, but not compressed. It all works perfectly. I'd like to compress the field that is being highlighted, but it seems like the new way to c

Re: Trying to extend MappingCharFilter so that it only changes a token if the length of the token matches the length of singleMatch

2011-01-28 Thread Koji Sekiguchi
(11/01/25 2:14), Paul Taylor wrote: On 22/01/2011 15:43, Koji Sekiguchi wrote: (11/01/20 22:19), Paul Taylor wrote: Trying to extend MappingCharFilter so that it only changes a token if the length of the token matches the length of singleMatch in NormalizeCharMap (currently the singleMatch

Re: Trying to extend MappingCharFilter so that it only changes a token if the length of the token matches the length of singleMatch

2011-01-22 Thread Koji Sekiguchi
(11/01/20 22:19), Paul Taylor wrote: Trying to extend MappingCharFilter so that it only changes a token if the length of the token matches the length of singleMatch in NormalizeCharMap (currently the singleMatch just has to be found in the token I want ut to match the whole token). Can this be

Re: question about Scorer.freq()

2010-10-04 Thread Koji Sekiguchi
Hi Mike, Hmm are you only gathering the MUST_NOT TermScorers? (In which case I'd expect that the .docID() would not match the docID being collected). Or do you also see .docID() not matching for SHOULD and MUST sub queries? The snippet I copy-n-paste at previous mail was not appropriate. Sor

question about Scorer.freq()

2010-10-03 Thread Koji Sekiguchi
Hello, I'd like to know which field got hit in each doc in the hit results. To implement it, I thought I could use Scorer.freq() which was introduced 3.1/4.0: https://issues.apache.org/jira/browse/LUCENE-2590 But I didn't become successful so far. What I did is: - in each visit methods in MockS

Re: Using FastVectorHighlighter for snippets

2010-09-21 Thread Koji Sekiguchi
(10/09/22 3:24), Devshree Sane wrote: I am using the FastVectorHighlighter for retrieving snippets from the index. I am a bit confused about the parameters that are passed to the FastVectorHighlighter.getBestFragments() method. One parameter is a document id and another is the maximum number o

Re: How to modify a document Field before the document is indexed?

2010-07-19 Thread Koji Sekiguchi
(10/07/20 7:31), Joe Hansen wrote: Hey All, I am using Apache Lucene (2.9.1) and its fast and it works great! I have a question in connection with Apache PDFBox. The following command creates a Lucent Document from a PDF file: Document document = org.apache.pdfbox.searchengine.lucene.LucenePDFD

Re: scoring and index size

2010-07-09 Thread Koji Sekiguchi
(10/07/09 19:30), manjula wijewickrema wrote: Uwe, thanx for your comments. Following is the code I used in this case. Could you pls. let me know where I have to insert UNLIMITED field length? and how? Tanx again! Manjula Manjula, You can set UNLIMITED field length to IW constructor: http

Re: phrase query highlighter spans matching

2010-05-31 Thread Koji Sekiguchi
(10/05/19 13:58), Li Li wrote: hi all, I read lucene in action 2nd Ed. It says SimpleSpanFragmenter will "make fragments that always include the spans matching each document". And also a SpanScorer existed for this use. But I can't find any class named SpanScorer in lucene 3.0.1. And the res

Re: Return Entire field from GetBestFragment in FastVectorHighlighter

2010-05-15 Thread Koji Sekiguchi
(10/05/12 20:32), Midhat Ali wrote: Is it possible to return entire field contents instead of a fixed size fragment. In Highlightrer, there is a Nullfragmenter. Whats's its counterpart in FastVectorhighlighter Currently, FVH doesn't have such function. I've opened a JIRA issue: https://iss

Re: FieldCache memory estimation - term values are interned?

2010-05-01 Thread Koji Sekiguchi
Yonik Seeley wrote: On Sat, May 1, 2010 at 8:23 PM, Koji Sekiguchi wrote: Yonik Seeley wrote: Values are not interned, but in a single field cache entry (String[]) the same String object is used for all docs with that same value. Yeah, you are right. Because I could see the

Re: FieldCache memory estimation - term values are interned?

2010-05-01 Thread Koji Sekiguchi
Yonik Seeley wrote: 2010/4/30 Koji Sekiguchi : Are Strings that are got via FieldCache.DEFAULT.getStrings( reader, field ) interned? Since I have a requirement for having FieldCaches of some fields in 250M docs index, I'd like to estimate memory consumed by FieldCache. By looki

FieldCache memory estimation - term values are interned?

2010-04-30 Thread Koji Sekiguchi
Hello, Are Strings that are got via FieldCache.DEFAULT.getStrings( reader, field ) interned? Since I have a requirement for having FieldCaches of some fields in 250M docs index, I'd like to estimate memory consumed by FieldCache. By looking at FieldCacheImpl source code, it seems that field name

Re: Term offsets for highlighting

2010-04-26 Thread Koji Sekiguchi
Stephen Greene wrote: Hi Koji, Thank you. I implemented a solution based on the FieldTermStackTest.java and if I do a search like "iron ore" it matches iron or ore. The same is true if I specify iron AND ore. The termSetMap[0].value[0] = ore and termSetMap[0].value[1] = iron. What am I missing

Re: Term offsets for highlighting

2010-04-24 Thread Koji Sekiguchi
Hi Steve, > is there a way to access a TermVector containing only matched terms, > or is my previous approach still the So you want to access FieldTermStack, I understand. The way to access it, I wrote it at previous mail: You cannot access FieldTermStack from FVH, but I think you can create i

Re: Term offsets for highlighting

2010-04-19 Thread Koji Sekiguchi
Stephen Greene wrote: Hi Koji, An additional question. Is it possible to access the FieldTermStack from the FastVectorHighlighter after the it has been populated with matching terms from the field? I think this would provide an ideal solution for this problem, as ultimately I am only concerned

Re: Term offsets for highlighting

2010-04-18 Thread Koji Sekiguchi
Stephen Greene wrote: Hi Koji, Thank you for your reply. I did try the QueryScorer without success, but I was using Lucene 2.4.x Hi Steve, I thought you were using 2.9 or later because you mentioned FastVectorHighlighter in your previous mail (FVH was first introduced in 2.9). If I remembe

Re: Term offsets for highlighting

2010-04-16 Thread Koji Sekiguchi
Stephen Greene wrote: Hello, I am trying to determine begin and end offsets for terms and phrases matching a query. Is there a way using either the highlighter or fast vector highlighter in contrib? I have already attempted extending the highlighter which would match terms but would not

Re: Trying to simplify MappingCharFilter to match whole field

2010-03-20 Thread Koji Sekiguchi
Paul Taylor wrote: I'm trying to create a CharFilter which works like MappingCharFilter but only changes the matchString if the match String matches the whole field rather than a portion in the field (this is to handle some exceptions wiyout effecting other data). Trouble is the code in Mappin

Re: Highlighting large documents (Lucene 3.0.0)

2010-03-01 Thread Koji Sekiguchi
-Arne- wrote: Hi Koji, thanks for your answer. Can you help me a once again? What exactly I suposse to do? The concrete program in my mind here: public class TestHighlightTruncatedSearchQuery { static Directory dir = new RAMDirectory(); static Analyzer analyzer = new BiGramAnalyzer();

Re: Highlighting large documents (Lucene 3.0.0)

2010-03-01 Thread Koji Sekiguchi
-Arne- wrote: Hi, I'm using Lucene 3.0.0 and have large documents to search (logfiles 0,5-20MB). For better search results the query tokens are truncated left and right. A search for "user" is made to "*user*". The performance of searching even complex queries with more than one searchterm is qu

Re: FastVectorHighlighter truncated queries

2010-02-24 Thread Koji Sekiguchi
halbtuerderschwarze wrote: query.rewrite() didn't help, for queries like ipod* or *ipod I still didn't get fragments. Arne You're right. This is still an open issue: https://issues.apache.org/jira/browse/LUCENE-1889 Koji -- http://www.rondhuit.com/en/ --

Re: FastVectorHighlighter and query with multiple fields

2010-01-29 Thread Koji Sekiguchi
Marc Sturlese wrote: I have FastVectorHighlighter working with a query like: title:Ipod OR title:IPad but it's not working when (0 snippets are returned): title:Ipod OR content:IPad This is true when you are going to highlight IPad in title field and set fieldMatch to true at the FVH constr

Re: Looking for a MappingCharFilter that accepts regular expressions

2009-12-15 Thread Koji Sekiguchi
Paul Taylor wrote: CharStream.Found it at http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/PatternReplaceFilter.java?revision=804726&view=markup, BTW why not ad this to the Lucene coebase rather than solr code base. Unfortunately it doesn't address my problem be

Re: I need to implement a TokenFilter to break season07

2009-12-15 Thread Koji Sekiguchi
Weiwei Wang wrote: Hi, all I currently need a TokenFilter to break token season07 into two tokens season 07 I'd recommend you to refer WordDelimiterFilter in Solr. Koji -- http://www.rondhuit.com/en/ - To unsubscri

Re: Offset Problem

2009-12-14 Thread Koji Sekiguchi
Weiwei Wang wrote: The offset is incorrect for PatternReplaceCharFilter so the hilighting result is wrong. How to fix it? As I noted in the comment of the source, if you produce a phrase from a term and try to highlight a term in the produced phrase, the highlighted snippet will be undesira

Re: Looking for a MappingCharFilter that accepts regular expressions

2009-12-13 Thread Koji Sekiguchi
Koji Sekiguchi wrote: Paul Taylor wrote: I want my search to treat 'No. 1' and 'No.1' the same, because in our context its one token I want 'No. 1' to become 'No.1', I need to do this before tokenizing because the tokenizer would split one value into

Re: Looking for a MappingCharFilter that accepts regular expressions

2009-12-13 Thread Koji Sekiguchi
Paul Taylor wrote: I want my search to treat 'No. 1' and 'No.1' the same, because in our context its one token I want 'No. 1' to become 'No.1', I need to do this before tokenizing because the tokenizer would split one value into two terms and one into just one term. I already use a NormalizeM

Re: Recover special terms from StandardTokenizer

2009-12-11 Thread Koji Sekiguchi
MappingCharFilter can be used to convert c++ to cplusplus. Koji -- http://www.rondhuit.com/en/ Anshum wrote: How about getting the original token stream and then converting c++ to cplusplus or anyother such transform. Or perhaps you might look at using/extending(in the non java sense) some ot

Re: Handling + as a special character in Lucene search

2009-10-22 Thread Koji Sekiguchi
Or you can use MappingCharFilter if you are using Lucene 2.9. You can convert "c++" into "cplusplus" prior to running Tokenizer. Koji -- http://www.rondhuit.com/en/ Ian Lea wrote: You need to make sure that these terms are getting indexed, by using an analyzer that won't drop them and using

Re: Filter before tokenize ?

2009-09-12 Thread Koji Sekiguchi
Hi Paul, CharFilter should work for this case. How about this? public class MappingAnd { static final String[] DOCS = { "R&B", "H&M", "Hennes & Mauritz", "cheeseburger and french fries" }; static final String F = "f"; static Directory dir = new RAMDirectory(); static Analyzer analyzer =

Re: Path Tokenizer?

2009-08-24 Thread Koji Sekiguchi
Hi Ryan, I've looked for it when I implemented SOLR-64 patch, but not there. So I implemented HierarchicalTokenFilterFactory. I've not looked into your patch yet, but my impression is that probably we can share such TokenFilter. Thanks, Koji Ryan McKinley wrote: Hello- I'm looking for a way

Re: SpanScorer problem?

2009-07-17 Thread Koji Sekiguchi
. Thanks a lot for > the test case - made this one fun. > > - Mark > > Koji Sekiguchi wrote: > >> Hello, >> >> This problem was reported by my customer. They are using Solr 1.3 >> and uni-gram, but it can be reproduced with Lucene 2.9 and >> White

SpanScorer problem?

2009-07-17 Thread Koji Sekiguchi
Hello, This problem was reported by my customer. They are using Solr 1.3 and uni-gram, but it can be reproduced with Lucene 2.9 and WhitespaceAnalyzer. The program for reproducing is at the end of this mail. Query: (f1:"a b c d" OR f2:"a b c d") AND (f1:"b c g" OR f2:"b c g") The snippet we expe

Re: Boolean retrieval

2009-07-13 Thread Koji Sekiguchi
tsuraan wrote: Make that "Collector" (new as of 2.9). HitCollector is the old (deprecated as of 2.9) way, which always pre-computed the score of each hit and passed the score to the collect method. Where can I find docs for 2.9? Do I just have to check out the lucene trunk and run javado

HitCollectorWrapper

2009-06-08 Thread Koji Sekiguchi
CHANGES.txt said that we can use HitCollectorWrapper: 12. LUCENE-1575: HitCollector is now deprecated in favor of a new Collector abstract class. For easy migration, people can use HitCollectorWrapper which translates (wraps) HitCollector into Collector. But it looks package private? Thank you,

Re: Possible bug in QueryParser when using CJKAnalyzer (lucene 2.4.1)

2009-06-01 Thread Koji Sekiguchi
I'm not sure this is the same case, but there is a report and patch for CJKTokenizer in JARA: https://issues.apache.org/jira/browse/LUCENE-973 Koji Zhang, Lisheng wrote: Hi, When I use lucene 2.4.1 QueryParser with CJKAnalyzer, somehow it always generates an extra space, for example, if the

Re: Why Lucene phrase searching fail?

2009-04-27 Thread Koji Sekiguchi
Another possible factor, if you are using omitTf feature, it causes phrase query doesn't work. Koji Ian Lea wrote: What does query.toString() say? Are you using standard analyzers with standard lowercasing, stop words etc? Knocking up a very simple program/index that demonstrates the problem

Re: Different Analyzer for different fields in the same document

2009-04-10 Thread Koji Sekiguchi
John Seer wrote: Hello, There is any way that a single document fields can have different analyzers for different fields? I think one way of doing it to create custom analyzer which will do field spastic analyzes.. Any other suggestions? There is PerFieldAnalyzerWrapper http://hudson.z

Re: Help to determine why an optimized index is proportionaly too big.

2009-04-09 Thread Koji Sekiguchi
Dan OConnor wrote: Thanks for the feed back Chris. Can you (or someone else on the list) tell me about the IndexMerge tool? Please see: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/misc/IndexMergeTool.html Koji -

Re: Lucene help with query

2009-04-09 Thread Koji Sekiguchi
John Seer wrote: Koji Sekiguchi-2 wrote: If you omit norms when indexing the name field, you'll get same score back. Koji During building I set omit norms, but result doesn't change at all. I am still getting the same score I meant if you set nameField.setOmitN

Re: How Can I make an analyzer that ignore the numbers o the texts ???

2009-04-08 Thread Koji Sekiguchi
Steven A Rowe wrote: Hi Ariel, As Koji mentioned, https://issues.apache.org/jira/browse/SOLR-448 contains a NumberFilter. It filters out tokens that successfully parse as Doubles. I'm not sure, since the examples you gave seem to use "," as the decimal character, how this interacts with the

Re: Lucene help with query

2009-04-08 Thread Koji Sekiguchi
If you omit norms when indexing the name field, you'll get same score back. Koji The Seer wrote: Hello, I have 5 lucene documents name: Apple name: Apple martini name: Apple drink name: Apple sweet drink I am using lucene default similarity and standard analyzer . When I am searching for

Re: How Can I make an analyzer that ignore the numbers o the texts ???

2009-04-08 Thread Koji Sekiguchi
Ariel wrote: Hi everybody: I would want to know how Can I make an analyzer that ignore the numbers o the texts like the stop words are ignored ??? For example that the terms : 3.8, 100, 4.15, 4,33 don't be added to the index. How can I do that ??? Regards Ariel There is a patch for filter

Re: Unexpected highlighted text

2009-04-06 Thread Koji Sekiguchi
This problem is filed at: https://issues.apache.org/jira/browse/LUCENE-1489 You may want to take a look at LUCENE-1522 for highlighting N-gram tokens: https://issues.apache.org/jira/browse/LUCENE-1522 Koji ito hayato wrote: > Hi All, > My name is Hayato. > > I have a question for Highlighter

Re: Term level boosting

2009-03-24 Thread Koji Sekiguchi
. :) Program snippets are there regarding Payload/BoostTermQuery/scorePayload(). Koji On 3/24/09, Koji Sekiguchi wrote: Seid Mohammed wrote: Hi All I want my lucene to index documents and making some terms to have more boost value. so, if I index the document "The quick fox jumps ove

Re: Term level boosting

2009-03-24 Thread Koji Sekiguchi
Seid Mohammed wrote: Hi All I want my lucene to index documents and making some terms to have more boost value. so, if I index the document "The quick fox jumps over the lazy dog" and I want the term fox and dog to have greater boost value. How can I do that Thanks a lot seid M How about

Re: Sort by count?

2009-03-09 Thread Koji Sekiguchi
> first, I rewrite the Similarity(include lengthNorm), but it not works..., so I modify the lucene source, by set the norm_table = 1.0(all). it can work If you overrides lengthNorm(), reindexing is needed to take effect. Koji

Re: Index Structure

2009-02-19 Thread Koji Sekiguchi
There is no additional setting for me... Koji Seid Mohammed wrote: I have trioed Amharic fonts, it displays square like character, may be there is a kind of setting for it? Seid M On 2/19/09, Koji Sekiguchi wrote: Seid Mohammed wrote: great, I have got it do luke support unicode

Re: Index Structure

2009-02-19 Thread Koji Sekiguchi
Seid Mohammed wrote: great, I have got it do luke support unicode? I am trying lucene in non-english languaguage Of course. I can see Japanese terms without problems. Koji - To unsubscribe, e-mail: java-user-unsubscr...@

Re: querying English conjugation of verbs and comparative and superlative of adjectives

2009-01-30 Thread Koji Sekiguchi
o investigate the stemmers would that work? I confess that I've never examined the output in detail, but they might help. I don't know of any synonym lists offhand, but then again I haven't looked. Best er...@miminallyhelpful.com On Mon, Jan 26, 2009 at 8:51 AM, Koji Sekiguchi wrot

querying English conjugation of verbs and comparative and superlative of adjectives

2009-01-26 Thread Koji Sekiguchi
Hello, I have a requirement to search English words with taking into account conjugation of verbs and comparative and superlative of adjectives. I googled but couldn't find solution so far. Do I have to have a synonym table to solve this problem or is there someone who have good solution in this l

  1   2   >