Re: bytecount as String and prefix length

2005-11-01 Thread Marvin Humphrey
On Nov 1, 2005, at 9:51 AM, Doug Cutting wrote: Another approach might be to, instead of converting to UTF-8 to strings right away, change things to convert lazily, if at all. During index merging such conversion should never be needed. !! There ought to be some gains possible there, then.

Re: bytecount as String and prefix length

2005-11-01 Thread Yonik Seeley
Thanks for looking into this Marvin... very interesting stuff! I haven't had a chance to review it in detail, but my gut tells me that it should be able to be faster. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 - To u

Re: bytecount as String and prefix length

2005-11-01 Thread Doug Cutting
Another approach might be to, instead of converting to UTF-8 to strings right away, change things to convert lazily, if at all. During index merging such conversion should never be needed. You needn't do this systematically throughout Lucene, but only where it makes a big difference. For exa

Re: bytecount as String and prefix length

2005-11-01 Thread Doug Cutting
Marvin Humphrey wrote: I think it's time to throw in the towel. Please don't give up. I think you're quite close. I would be careful using CharBuffer instead of char[] unless you're sure all methods you call are very efficient. You could try avoiding CharBuffer by adding something (ugly) l

Re: bytecount as String and prefix length

2005-11-01 Thread Marvin Humphrey
I wrote: I've got one more idea... time to try overriding readString and writeString in BufferedIndexInput and BufferedIndexOutput, to take advantage of buffers that are already there. Too complicated to be worthwhile, it turns out. I think it's time to throw in the towel. Frustrating,

Re: bytecount as String and prefix length

2005-10-31 Thread Marvin Humphrey
On Oct 31, 2005, at 5:15 PM, Robert Engels wrote: All of the JDK source is available via download from Sun. Thanks. I believe the UTF-8 coding algos can be found in... j2se > src > share > classes > sun > nio > cs > UTF_8.java It looks like the translator methods have fairly high loop over

RE: bytecount as String and prefix length

2005-10-31 Thread Robert Engels
All of the JDK source is available via download from Sun. -Original Message- From: Marvin Humphrey [mailto:[EMAIL PROTECTED] Sent: Monday, October 31, 2005 6:31 PM To: java-dev@lucene.apache.org Subject: Re: bytecount as String and prefix length I wrote... > I think I'll take

Re: bytecount as String and prefix length

2005-10-31 Thread Marvin Humphrey
I wrote... I think I'll take a crack at a custom charsToUTF8 converter algo. Still no luck. Still 20% slower than the current implementation. The algo is below, for reference. It's entirely possible that my patches are doing something dumb that's causing this, given my limited experien

Re: bytecount as String and prefix length

2005-10-31 Thread Marvin Humphrey
I wrote... Unfortunately, once the changes to TermBuffer, TermInfosWriter, and StringHelper are applied, execution speed at index-time suffers a slowdown of about 20%. Perhaps this can be blamed on all the calls to getBytes("UTF-8") in TermInfosWriter? Maybe alternative implementations

bytecount as String and prefix length

2005-10-30 Thread Marvin Humphrey
Greets, I've been experimenting with using the UTF-8 bytecount as the VInt count at the top of Lucene's string format, as was discussed back in the "Lucene does NOT use UTF-8" thread. Changes were made to IndexInput and IndexOutput as per some of Robert Engel's suggestions. Here's the i