Re: Fast way to get the start of document

2012-06-23 Thread Mike Sokolov
I got the sense from Paul's post that he wanted a solution that didn't require changing his index, although I'm not sure there is one. Paul if you're willing to re-index, you could also store the length of the text as a numeric field, retrieve that and use it to drive the decision about whethe

Re: Fast way to get the start of document

2012-06-23 Thread Jack Krupansky
Simply have two fields, "full_body" and "limited_body". The former would index but not store the full document text from Tika (the "content" metadata.) The latter would store but not necessarily index the first 10K or so characters of the full text. Do searches on the full body field and highli

Re: StandardTokenizer and split tokens

2012-06-23 Thread Mansour Al Akeel
Uwe, thank you for the advice. I updated my code. On Sat, Jun 23, 2012 at 3:15 AM, Uwe Schindler wrote: >> I found the main issue. >> I was using ByteRef without the length. This fixed the problem. >> >>                       String word = new > String(ref.bytes,ref.offset,ref.length); > > Pleas

RE: StandardTokenizer and split tokens

2012-06-23 Thread Uwe Schindler
> I found the main issue. > I was using ByteRef without the length. This fixed the problem. > > String word = new String(ref.bytes,ref.offset,ref.length); Please see my other mail, using no character set here is the second problem of your code, this is the correct way to do:

RE: StandardTokenizer and split tokens

2012-06-23 Thread Uwe Schindler
Don't ever do this: String word = new String(ref.bytes); This has following problems: - ignores character set!!! (in general: never ever use new String(byte[]) without specifying the 2nd charset parameter!). byte[] != String. Depending on the default charset on your computer this would return bul