codec mismatch

2014-02-14 Thread Jason Wee
Hello, This is my first question to lucene mailing list, sorry if the question sounds funny. I have been experimenting to store lucene index files on cassandra, unfortunately the exception got overwhelmed. Below are the stacktrace. org.apache.lucene.index.CorruptIndexException: codec mismatch:

Re: codec mismatch

2014-02-14 Thread Michael McCandless
This means Lucene was attempting to open _0.fnm but somehow got the contents of _0.cfs instead; seems likely that it's a bug in the Cassanda Directory implementation? Somehow it's opening the wrong file name? Mike McCandless http://blog.mikemccandless.com On Fri, Feb 14, 2014 at 3:13 AM,

Tokenization and PrefixQuery

2014-02-14 Thread Yann-Erwan Perio
Hello, I am designing a system with documents having one field containing values such as Ae1 Br2 Cy8 ..., i.e. a sequence of items made of letters and numbers (max=7 per item), all separated by a space, possibly 200 items per field, with no limit upon the number of documents (although I would not

Re: Collector is collecting more than the specified hits

2014-02-14 Thread Michael McCandless
This is how Collector works: it is called for every document matching the query, and then its job is to choose which of those hits to keep. This is because in general the hits to keep can come at any time, not just the first N hits you see; e.g. the best scoring hit may be the very last one. But

Re: Tokenization and PrefixQuery

2014-02-14 Thread Michael McCandless
On Fri, Feb 14, 2014 at 6:17 AM, Yann-Erwan Perio ye.pe...@gmail.com wrote: Hello, I am designing a system with documents having one field containing values such as Ae1 Br2 Cy8 ..., i.e. a sequence of items made of letters and numbers (max=7 per item), all separated by a space, possibly 200

Re: Tokenization and PrefixQuery

2014-02-14 Thread Yann-Erwan Perio
On Fri, Feb 14, 2014 at 12:33 PM, Michael McCandless luc...@mikemccandless.com wrote: This is similar to PathHierarchyTokenizer, I think. Ah, yes, very much. I'll check it out and see if I can make something of it. I am not sure to what extent it'll be reusable though, as my tokenizer also sets

Re: Tokenization and PrefixQuery

2014-02-14 Thread Yann-Erwan Perio
On Fri, Feb 14, 2014 at 1:11 PM, Yann-Erwan Perio ye.pe...@gmail.com wrote: On Fri, Feb 14, 2014 at 12:33 PM, Michael McCandless luc...@mikemccandless.com wrote: Hi again, That should not be the case: it should match all terms with that prefix regardless of the term's length. Try to boil it

Re: Tokenization and PrefixQuery

2014-02-14 Thread Michael McCandless
On Fri, Feb 14, 2014 at 8:21 AM, Yann-Erwan Perio ye.pe...@gmail.com wrote: I have written a test which demonstrates that the mistake is indeed on my side. It's probably due to inconsistent rules for indexing/searching content having special characters (namely the plus sign). OK, thanks for

Re: Collector is collecting more than the specified hits

2014-02-14 Thread saisantoshi
I am not interested in the scores at all. My requirement is simple, I only need the first 100 hits or the numHits I specify ( irrespective of there scores). The collector should stop after collecting the numHits specified. Is there a way to tell in the collector to stop after collecting the

Reverse Matching

2014-02-14 Thread Siraj Haider
Hi There, Is there a way to do reverse matching by indexing the queries in an index and passing a document to see how many queries matched that? I know that I can have the queries in memory and have the document parsed in a memory index and then loop through trying to match each query. The

IndexWriter croaks on large file

2014-02-14 Thread John Cecere
I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file 2GB in size, it dies with the following exception: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be = startOffset, startOffset=-2147483648,endOffset=-2147483647 Essentially,

Extending StandardTokenizer Jflex to not split on '/'

2014-02-14 Thread Diego Fernandez
Hi guys, this is my first time posting on the Lucene list, so hello everyone. I really like the way that the StandardTokenizer works, however I'd like for it to not split tokens on / (forward slash). I've been looking at http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to

Re: Extending StandardTokenizer Jflex to not split on '/'

2014-02-14 Thread Steve Rowe
Welcome Diego, I think you’re right about MidLetter - adding a char to it should disable splitting on that char, as long as there is a letter on one side or the other. (If you’d like that behavior to be extended to numeric digits, you should use MidNumLet instead.) I tested this by adding

Re: IndexWriter croaks on large file

2014-02-14 Thread John Cecere
I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At any rate, I don't have control over the size of the documents that go into my database. Sometimes my customer's log files end up really big. I'm willing to have huge indexes for these things. Wouldn't just changing from

Re: IndexWriter croaks on large file

2014-02-14 Thread Glen Newton
You should consider making each _line_ of the log file a (Lucene) document (assuming it is a log-per-line log file) -Glen On Fri, Feb 14, 2014 at 4:12 PM, John Cecere john.cec...@oracle.com wrote: I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At any rate, I don't have

Re: IndexWriter croaks on large file

2014-02-14 Thread Tri Cao
As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though withthat approach though :)I do agree that indexing huge documents doesn't seem to have a lot of value, even when youknow a doc is a hit for a certain query, how are you going to display the results to

Only highlight terms that caused a search hit/match

2014-02-14 Thread Steve Davids
Hello, I have recently been given a requirement to improve document highlights within our system. Unfortunately, the current functionality gives more of a best-guess on what terms to highlight vs the actual terms to highlight that actually did perform the match. A couple examples of issues

char mapping in lucene-icu

2014-02-14 Thread alxsss
Hello, I try to use lucene-icu li in solr-4.6.1. I need to change a char mapping in lucene-icu. I have made changes to lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt and built jar file using ant , but it did not help. I took a look to lucene/analysis/icu/build.xml and see these

Re: Reverse Matching

2014-02-14 Thread Ahmet Arslan
Hi Siraj, MemoryIndex is used for such use case. Here is a couple of pointers:  http://www.slideshare.net/jdhok/diy-percolator http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html On Friday, February 14, 2014 8:21 PM, Siraj Haider si...@jobdiva.com

Re: char mapping in lucene-icu

2014-02-14 Thread Jack Krupansky
Do you get the exception if you run ant before changing the data files? Header authentication failed, please check if you have a valid ICU data file Check with the ICU project as to the proper format for THEIR files. I mean, this doesn't sound like a Lucene issue. Maybe it could be as

Re: Reverse Matching

2014-02-14 Thread Ahmet Arslan
Hi, Here are two more relevant links: https://github.com/flaxsearch/luwak http://www.lucenerevolution.org/2013/Turning-Search-Upside-Down-Using-Lucene-for-Very-Fast-Stored-Queries Ahmet On Saturday, February 15, 2014 3:01 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Siraj, MemoryIndex is

Re: char mapping in lucene-icu

2014-02-14 Thread alxsss
Hi Jack, I do not get exception before changing data files. And I do not get exception after changing data files and creating lucene-icu...jar by ant. But changing data files and running ant does not change the output. So I decided to manually create .nrm file by using steps outlined in the