[jira] Assigned: (LUCENE-2215) paging collector

2010-01-19 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-2215:
---

Assignee: Grant Ingersoll

 paging collector
 

 Key: LUCENE-2215
 URL: https://issues.apache.org/jira/browse/LUCENE-2215
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4, 3.0
Reporter: Adam Heinz
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: IterablePaging.java, PagingCollector.java, 
 TestingPagingCollector.java


 http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
 Somebody assign this to Aaron McCurry and we'll see if we can get enough 
 votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2227) separate chararrayset interface from impl

2010-01-19 Thread Robert Muir (JIRA)
separate chararrayset interface from impl
-

 Key: LUCENE-2227
 URL: https://issues.apache.org/jira/browse/LUCENE-2227
 Project: Lucene - Java
  Issue Type: Task
  Components: Analysis
Affects Versions: 3.0
Reporter: Robert Muir
Priority: Minor


CharArraySet should be abstract
the hashing implementation currently being used should instead be called 
CharArrayHashSet

currently our 'CharArrayHashSet' is hardcoded across Lucene, but others might 
want their own impl.
For example, implementing CharArraySet as DFA with 
org.apache.lucene.util.automaton gives faster contains(char[], int, int) 
performance, as it can do a 'fast fail' and need not hash the entire string.

This is useful as it speeds up indexing in StopFilter.

I did not think this would be faster but i did benchmarks over and over with 
the reuters corpus, and it is, even with english text's wierd average word 
length of 5


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2010-01-19 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802235#action_12802235
 ] 

Renaud Delbru commented on LUCENE-1410:
---

On another aspect, why is the PFOR/FOR is encoding the number of compressed 
integers into the block header since this information is already stored in the 
stream header (block size information written in 
FixedIntBlockIndexOutput#init()). Is there a particular use case for that ?

 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, 
 LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, 
 LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
 TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1410) PFOR implementation

2010-01-19 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802235#action_12802235
 ] 

Renaud Delbru edited comment on LUCENE-1410 at 1/19/10 1:10 PM:


On another aspect, why the PFOR/FOR is encoding the number of compressed 
integers into the block header since this information is already stored in the 
stream header (block size information written in 
FixedIntBlockIndexOutput#init()). Is there a particular use case for that ? Is 
it for the special case when a block is complete (when the block encodes the 
reamaining integer of the list) ?

  was (Author: renaud.delbru):
On another aspect, why is the PFOR/FOR is encoding the number of compressed 
integers into the block header since this information is already stored in the 
stream header (block size information written in 
FixedIntBlockIndexOutput#init()). Is there a particular use case for that ?
  
 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, 
 LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, 
 LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
 TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2111) Wrapup flexible indexing

2010-01-19 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2111:
---

Attachment: LUCENE-2111.patch

Attached patch w/ various fixes:

  - Switch over payloads to use BytesRef, in flex API

  - DocsEnum.positions now returns null if no positions were indexed
(ie omitTFAP was set for the field).  Also fixed Phrase/SpanQuery
to throw IllegalStateException when run against an omitTFAP
field.

  - Rename PositionsConsumer.addPosition - .add


 Wrapup flexible indexing
 

 Key: LUCENE-2111
 URL: https://issues.apache.org/jira/browse/LUCENE-2111
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Flex Branch
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
 LUCENE-2111.patch, LUCENE-2111.patch


 Spinoff from LUCENE-1458.
 The flex branch is in fairly good shape -- all tests pass, initial search 
 performance testing looks good, it survived several visits from the Unicode 
 policeman ;)
 But it still has a number of nocommits, could use some more scrutiny 
 especially on the emulate old API on flex index and vice/versa code paths, 
 and still needs some more performance testing.  I'll do these under this 
 issue, and we should open separate issues for other self contained fixes.
 The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-01-19 Thread Adam Heinz (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802276#action_12802276
 ] 

Adam Heinz commented on LUCENE-2215:


Awesome, thanks!  I'll schedule some time in the coming week to patch our dev 
installation and sic some QA guys on it.

 paging collector
 

 Key: LUCENE-2215
 URL: https://issues.apache.org/jira/browse/LUCENE-2215
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4, 3.0
Reporter: Adam Heinz
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: IterablePaging.java, PagingCollector.java, 
 TestingPagingCollector.java


 http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
 Somebody assign this to Aaron McCurry and we'll see if we can get enough 
 votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize

2010-01-19 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2213:
---

Attachment: LUCENE-2213.patch

New patch, just renaming to ArrayUtil.oversize.

 Small improvements to ArrayUtil.getNextSize
 ---

 Key: LUCENE-2213
 URL: https://issues.apache.org/jira/browse/LUCENE-2213
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2213.patch, LUCENE-2213.patch, LUCENE-2213.patch, 
 LUCENE-2213.patch


 Spinoff from java-dev thread Dynamic array reallocation algorithms started 
 on Jan 12, 2010.
 Here's what I did:
   * Keep the +3 for small sizes
   * Added 2nd arg = number of bytes per element.
   * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively)
   * Still grow by 1/8th
   * If 0 is passed in, return 0 back
 I also had to remove some asserts in tests that were checking the actual 
 values returned by this method -- I don't think we should test that (it's an 
 impl. detail).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2010-01-19 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802335#action_12802335
 ] 

Paul Elschot commented on LUCENE-1410:
--

The only reason why the number of compressed integers is encoded in the block 
header here is that when I coded it I did not know that this was not necessary 
in lucene indexes.

That also means that the header can be used for different compression methods, 
for example in the following way:
cases encoded in 1st byte:
32 FrameOfRef cases (#frameBits) followed by 3 bytes for #exceptions (0 for 
BITS,  0 for PFOR)
16-64 cases for a SimpleNN variant
1-8 cases for run length encoding (for example followed by 3 bytes for length 
and value)
Total #cases is 49-104 or 6-7 bits.

Run length encoding is good for terms that occur in every document and for the 
frequencies of primary keys.

The only concern I have is that the instruction cache might get filled up with 
the code for all these decoding cases.
At the moment I don't know how to deal with that other than by adding such 
cases slowly while doing performance tests all the time.


 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, 
 LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, 
 LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
 TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene memory consumption

2010-01-19 Thread Sanne Grinovero
Hello Frederic,
I'm CCing java-dev@lucene.apache.org as Michael McCandless has been
very helpful on IRC in discussing the ThreadLocal implication, and it
would be nice you could provide first-hand information.

There's a good reading to start from at
http://issues.apache.org/jira/browse/LUCENE-1383
Basically your proposal is having a problem which is that when you
close the ThreadLocal it's only going to cleanup the resources stored
by the current thread, not by others; setting the reference to null
also won't help:
Quoting the TLocal source's comment:
* However, since reference queues are not
 * used, stale entries are guaranteed to be removed only when
 * the table starts running out of space.

About your issues:
 1. A ThreadLocal object should normally be a singleton used has key to the 
 thread map. Here it is repeatedly created and destroy!
It's only built in the constructor, and destroyed on close. So it's
lifecycle is linked to the Analyzer / FieldCache using it, probably a
long time, or the appropriate time to cleanup things.

 2. Setting t = null; is not affecting the garbage collection of the 
 ThreadLocal map since t is the key (hard ref) of the thread map.
Well t is unfortunately being reused as a variable name: t = null;
is clearing the reference to the threadlocal, which really is the key
of the map used by the threadlocal and referenced by the current
Thread instance, and TLocal uses weak *Keys* not values (and the key
is the TLocal itself).

 3. There are no call to t.remove() which will really clean the Map entry.
You could add one, but it would only cleanup the garbage from the
current thread, so it's ok but not enough. The current impl is making
sure all stuff is collected by wrapping it all in weak values.
Actually some stuff is not collected: the WeakReferences themselves,
but pointing to going-to-be-collected stuff. These WeakReferences are
going to be removed when the TLocal table is full, and should be
harmless (?).
As you pointed out, since Lucene 3 it's releasing what is possible to
release eagerly, but it's a very small slight optimization: you still
need the weak/hardref trick to clean the other values.

 4. A ThreadLocal Map is already a WeakReference for the value.
No, it's on the keys: a collected ThreadLocal will be cleaned up for.
eventually :-/

 5. Leaving objects on a ThreadLocal after it is out of your control is bad 
 practice. Another task may reuse the Thread and found dirty objects there.
Agree, but having weak values it's not a big issue. Also it's not
meant to be used by faint hearted, just people writing their own
Analyzer could have this wrong :)

 6. We found (in all our tests) the hardRef Map to be completely unnecessary 
 in Lucene 2.4.1, but here I'm lacking more in depth knowledge of the 
 lifecycle of the objects added to this CloseableThreadLocal.
Well as it's being used as a cache functionality will be the same,
performance should be worse. AFAIK all TokenFilters are able to
rebuild what they need when get() returns null, you might have a
problem on the unlikely case of
org.apache.lucene.util.CloseableThreadLocal:68 having the assertion
fail, but again not affecting functionality (assuming assertions are
disabled).

A vanilla ThreadLocal is obviously faster than this, but then we end
up reverting LUCENE-1383 and so introducing more pressure on the GC.

It would be very interesting to find out why your implementation is
performing better? Maybe because in your case Analyzers are used by
one thread at a time, and so you're not leaking memory?
Could you tell more about this to lucene-dev directly?

Regards,
Sanne

2010/1/6 Frederic Simon fr...@jfrog.org:
 Thanks Emmanuel,
 Yes the main issue is that the hardRef map in this class was forcing all the
 objects to go to the Old generation space in the JVM GC, instead of staying
 at a ThreadLocal level. So, all objects put in the CloseableThreadLocal were
 GC only on full GC. On heavy lucene usage, it generated around 500Mb of heap
 for each 5 secs until full GC kicks in. Our problem is that we really a lot
 on SoftReference for our cache and so this Lucene behavior is really bad for
 us (Customer feedback:
 http://old.nabble.com/What's-the-memory-requirements-for-2.1.3--to27026622.html#a27026622
 ).
 With my class all objects stay in young gen and so the performance boost for
 us was huge.

 The issues with the class:

 A ThreadLocal object should normally be a singleton used has key to the
 thread map. Here it is reapeatdly created and destroy!
 Setting t = null; is not affecting the garbage collection of the
 ThreadLocal map since t is the key (hard ref) of the thread map.
 There are no call to t.remove() which will really clean the Map entry.
 A ThreadLocal Map is already a WeakReference for the value.
 Leaving objects on a ThreadLocal after it is out of your control is bad
 practice. Another task may reuse the Thread and found dirty objects there.
 We found (in all our tests) the hardRef Map to 

[jira] Commented: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()

2010-01-19 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802449#action_12802449
 ] 

Paul Elschot commented on LUCENE-2217:
--

Btw. shouldn't IndexInput.bytes also be reallocated using 
ArrayUtils.getNextSize() ?
The growth factor there is a hardcoded 1.25 .

 SortedVIntList allocation should use ArrayUtils.getNextSize()
 -

 Key: LUCENE-2217
 URL: https://issues.apache.org/jira/browse/LUCENE-2217
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Paul Elschot
Assignee: Michael McCandless
Priority: Trivial
 Attachments: LUCENE-2217.patch, LUCENE-2217.patch


 See recent discussion on ArrayUtils.getNextSize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()

2010-01-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802456#action_12802456
 ] 

Michael McCandless commented on LUCENE-2217:


bq. Btw. shouldn't IndexInput.bytes also be reallocated using 
ArrayUtils.getNextSize()

+1  Wanna fold it into this patch?  (And any others you find..?).

 SortedVIntList allocation should use ArrayUtils.getNextSize()
 -

 Key: LUCENE-2217
 URL: https://issues.apache.org/jira/browse/LUCENE-2217
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Paul Elschot
Assignee: Michael McCandless
Priority: Trivial
 Attachments: LUCENE-2217.patch, LUCENE-2217.patch


 See recent discussion on ArrayUtils.getNextSize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()

2010-01-19 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802510#action_12802510
 ] 

Paul Elschot commented on LUCENE-2217:
--

Well, it's not that I'm searching, but I'll provide another patch that includes 
IndexInput for this.

Would you have any idea about testcases for that?
:)



 SortedVIntList allocation should use ArrayUtils.getNextSize()
 -

 Key: LUCENE-2217
 URL: https://issues.apache.org/jira/browse/LUCENE-2217
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Paul Elschot
Assignee: Michael McCandless
Priority: Trivial
 Attachments: LUCENE-2217.patch, LUCENE-2217.patch


 See recent discussion on ArrayUtils.getNextSize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

2010-01-19 Thread Deepak (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802535#action_12802535
 ] 

Deepak commented on LUCENE-2205:


Hi Aaron

I hope you will be able to post the files today

Regards
D

 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and 
 the index pointer long[] and create a more memory efficient data structure.
 ---

 Key: LUCENE-2205
 URL: https://issues.apache.org/jira/browse/LUCENE-2205
 Project: Lucene - Java
  Issue Type: Improvement
 Environment: Java5
Reporter: Aaron McCurry
 Attachments: patch-final.txt, RandomAccessTest.java, rawoutput.txt


 Basically packing those three arrays into a byte array with an int array as 
 an index offset.  
 The performance benefits are stagering on my test index (of size 6.2 GB, with 
 ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the 
 terminfos into memory were reduced to 17% of there original size.  From 291.5 
 MB to 49.7 MB.  The random access speed has been made better by 1-2%, load 
 time of the segments are ~40% faster as well, and full GC's on my JVM were 
 made 7 times faster.
 I have already performed the work and am offering this code as a patch.  
 Currently all test in the trunk pass with this new code enabled.  I did write 
 a system property switch to allow for the original implementation to be used 
 as well.
 -Dorg.apache.lucene.index.TermInfosReader=default or small
 I have also written a blog about this patch here is the link.
 http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util

2010-01-19 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802542#action_12802542
 ] 

Toke Eskildsen commented on LUCENE-1990:


Introducing yet another level of indirection and making a 
byte/short/int/long-prvider detached from the implementation of the packed 
values it tempting. I'm fairly afraid of the overhead of the extra 
method-calls, but I'll try it and see what happens.

I've read your (Michael McCandless) code an I can see that the tiny interfaces 
for Reader and Writer works well for your scenario. However, as the Reader must 
have (fast) random access, wouldn't it make sense to make it possible to update 
values? That way the same code can be used to hold ords for sorting and similar 
structures.

Instead of Reader, we could use

{code}
abstract class Mutator {
  public abstract long get(int index);
  public abstract long set(int index, long value);
}
{code}

...should the index also be a long? No need to be bound by Java's 31-bit limit 
on array-length, although I might very well be over-engineering here.

The whole 32bit vs. 64bit as backing array does present a bit of a problem with 
persistence. We'll be in a situation where the index will be optimized for the 
architecture used for building, not the one used for searching. Leaving the 
option of a future mmap open means that it is not possible to do a conversion 
when retrieving the bits, so I have no solution for this (other than doing 
memory-only).

 Add unsigned packed int impls in oal.util
 -

 Key: LUCENE-1990
 URL: https://issues.apache.org/jira/browse/LUCENE-1990
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1990.patch, 
 LUCENE-1990_PerformanceMeasurements20100104.zip


 There are various places in Lucene that could take advantage of an
 efficient packed unsigned int/long impl.  EG the terms dict index in
 the standard codec in LUCENE-1458 could subsantially reduce it's RAM
 usage.  FieldCache.StringIndex could as well.  And I think load into
 RAM codecs like the one in TestExternalCodecs could use this too.
 I'm picturing something very basic like:
 {code}
 interface PackedUnsignedLongs  {
   long get(long index);
   void set(long index, long value);
 }
 {code}
 Plus maybe an iterator for getting and maybe also for setting.  If it
 helps, most of the usages of this inside Lucene will be write once
 so eg the set could make that an assumption/requirement.
 And a factory somewhere:
 {code}
   PackedUnsignedLongs create(int count, long maxValue);
 {code}
 I think we should simply autogen the code (we can start from the
 autogen code in LUCENE-1410), or, if there is an good existing impl
 that has a compatible license that'd be great.
 I don't have time near-term to do this... so if anyone has the itch,
 please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

2010-01-19 Thread Vilaythong Southavilay (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802568#action_12802568
 ] 

Vilaythong Southavilay commented on LUCENE-1488:


I am developing an IR system for Lao. I've been searching for this kind of 
analyzers to be used in my development to index documents containing languages 
like Lao, French and English in one single passage.

I tested it for Lao language for Lucene 2.9 and 3.0 using my short passage. It 
worked correctly for both versions as I expected, especially for segmenting Lao 
single syllables. I also tried it with the bi-gram filter option for two 
syllables, which worked fine for simple words. The result contained some 
two-syllable words which do not make sense in Lao language. I guess this not a 
big issue. As Robert pointed out (in an email to me), we still need 
dictionary-based word segmentation for Lao, which can be integrated in ICU and 
used by this analyzer.

Any way, thanks for your assistance. This work will be helpful not only for 
Lao, but others as well because it's good to have a common analyzer for unicode 
characters.

I'll continue testing it and report any problems if I find one. 

 multilingual analyzer based on icu
 --

 Key: LUCENE-1488
 URL: https://issues.apache.org/jira/browse/LUCENE-1488
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, 
 LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt


 The standard analyzer in lucene is not exactly unicode-friendly with regards 
 to breaking text into words, especially with respect to non-alphabetic 
 scripts.  This is because it is unaware of unicode bounds properties.
 I actually couldn't figure out how the Thai analyzer could possibly be 
 working until i looked at the jflex rules and saw that codepoint range for 
 most of the Thai block was added to the alphanum specification. defining the 
 exact codepoint ranges like this for every language could help with the 
 problem but you'd basically be reimplementing the bounds properties already 
 stated in the unicode standard. 
 in general it looks like this kind of behavior is bad in lucene for even 
 latin, for instance, the analyzer will break words around accent marks in 
 decomposed form. While most latin letter + accent combinations have composed 
 forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
 i suppose). 
 I've got a partially tested standardanalyzer that uses icu Rule-based 
 BreakIterator instead of jflex. Using this method you can define word 
 boundaries according to the unicode bounds properties. After getting it into 
 some good shape i'd be happy to contribute it for contrib but I wonder if 
 theres a better solution so that out of box lucene will be more friendly to 
 non-ASCII text. Unfortunately it seems jflex does not support use of these 
 properties such as [\p{Word_Break = Extend}] so this is probably the major 
 barrier.
 Thanks,
 Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



NRT and IndexSearcher performance

2010-01-19 Thread jchang

The javadocs for IndexSearcher in Lucene 3.0.0 read:  For performance
reasons it is recommended to open only one IndexSearcher and use it for all
of your searches.

However, to use NRT, it seems I have to do this for every search, which
contradicts the advice above:
IndexSearcher myIndexSearcher = new
IndexSearcher(myIndexWriter.getReader());

Is there any way to take advantage of NRT and not run into these performance
problems under heavy load?

Is the advice from the javadoc above aimed more at
IndexSearcher(org.apache.lucene.store.Directory directory)?  Or is it also
aimed at  IndexSearcher(org.apache.lucene.index.IndexReader indexReader),
which I believe I have to use to get NRT (correct me if I am wrong)?
-- 
View this message in context: 
http://old.nabble.com/NRT-and-IndexSearcher-performance-tp27235434p27235434.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

2010-01-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802596#action_12802596
 ] 

Robert Muir commented on LUCENE-1488:
-

Thanks for sharing those results! Yes the bigram behavior (right now enabled 
for Han, Lao, Khmer, and Myanmar) is an attempt to boost relevance in a 
consistent way since we do not have dictionary-based word segmentation for 
those writing systems, only the ability to segment into syllables.

In the next patch I'll make it easier to configure this behavior, and turn it 
off when you want, without writing your own analyzer.

I am glad to hear the syllable segmentation algorithm is working well! 
The credit really belongs to the Pan Localization Project, I simply implemented 
the algorithm described here: 
http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf
You can see the code in Lao.rbbi in the patch, warning, as it mentions, I am 
pretty sure Lao numeric digits are not yet working correctly, but hopefully I 
will fix those too in the next version.


 multilingual analyzer based on icu
 --

 Key: LUCENE-1488
 URL: https://issues.apache.org/jira/browse/LUCENE-1488
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, 
 LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt


 The standard analyzer in lucene is not exactly unicode-friendly with regards 
 to breaking text into words, especially with respect to non-alphabetic 
 scripts.  This is because it is unaware of unicode bounds properties.
 I actually couldn't figure out how the Thai analyzer could possibly be 
 working until i looked at the jflex rules and saw that codepoint range for 
 most of the Thai block was added to the alphanum specification. defining the 
 exact codepoint ranges like this for every language could help with the 
 problem but you'd basically be reimplementing the bounds properties already 
 stated in the unicode standard. 
 in general it looks like this kind of behavior is bad in lucene for even 
 latin, for instance, the analyzer will break words around accent marks in 
 decomposed form. While most latin letter + accent combinations have composed 
 forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
 i suppose). 
 I've got a partially tested standardanalyzer that uses icu Rule-based 
 BreakIterator instead of jflex. Using this method you can define word 
 boundaries according to the unicode bounds properties. After getting it into 
 some good shape i'd be happy to contribute it for contrib but I wonder if 
 theres a better solution so that out of box lucene will be more friendly to 
 non-ASCII text. Unfortunately it seems jflex does not support use of these 
 properties such as [\p{Word_Break = Extend}] so this is probably the major 
 barrier.
 Thanks,
 Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

2010-01-19 Thread Aaron McCurry (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron McCurry updated LUCENE-2205:
--

Attachment: TermInfosReaderIndexDefault.java
TermInfosReaderIndex.java
TermInfosReader.java

The patch as it exists now.  It no longer needs any mods to the Term.java file.

 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and 
 the index pointer long[] and create a more memory efficient data structure.
 ---

 Key: LUCENE-2205
 URL: https://issues.apache.org/jira/browse/LUCENE-2205
 Project: Lucene - Java
  Issue Type: Improvement
 Environment: Java5
Reporter: Aaron McCurry
 Attachments: patch-final.txt, RandomAccessTest.java, rawoutput.txt, 
 TermInfosReader.java, TermInfosReaderIndex.java, 
 TermInfosReaderIndexDefault.java


 Basically packing those three arrays into a byte array with an int array as 
 an index offset.  
 The performance benefits are stagering on my test index (of size 6.2 GB, with 
 ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the 
 terminfos into memory were reduced to 17% of there original size.  From 291.5 
 MB to 49.7 MB.  The random access speed has been made better by 1-2%, load 
 time of the segments are ~40% faster as well, and full GC's on my JVM were 
 made 7 times faster.
 I have already performed the work and am offering this code as a patch.  
 Currently all test in the trunk pass with this new code enabled.  I did write 
 a system property switch to allow for the original implementation to be used 
 as well.
 -Dorg.apache.lucene.index.TermInfosReader=default or small
 I have also written a blog about this patch here is the link.
 http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

2010-01-19 Thread Aaron McCurry (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron McCurry updated LUCENE-2205:
--

Attachment: TermInfosReaderIndexSmall.java

Here's the last file.  I have also back patched 3.0.0 and 2.9.1 and place them 
on my blog incase you want to have a drop in replacement to try out.

http://www.nearinfinity.com/blogs/aaron_mccurry/low_memory_patch_for_lucene.html


 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and 
 the index pointer long[] and create a more memory efficient data structure.
 ---

 Key: LUCENE-2205
 URL: https://issues.apache.org/jira/browse/LUCENE-2205
 Project: Lucene - Java
  Issue Type: Improvement
 Environment: Java5
Reporter: Aaron McCurry
 Attachments: patch-final.txt, RandomAccessTest.java, rawoutput.txt, 
 TermInfosReader.java, TermInfosReaderIndex.java, 
 TermInfosReaderIndexDefault.java, TermInfosReaderIndexSmall.java


 Basically packing those three arrays into a byte array with an int array as 
 an index offset.  
 The performance benefits are stagering on my test index (of size 6.2 GB, with 
 ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the 
 terminfos into memory were reduced to 17% of there original size.  From 291.5 
 MB to 49.7 MB.  The random access speed has been made better by 1-2%, load 
 time of the segments are ~40% faster as well, and full GC's on my JVM were 
 made 7 times faster.
 I have already performed the work and am offering this code as a patch.  
 Currently all test in the trunk pass with this new code enabled.  I did write 
 a system property switch to allow for the original implementation to be used 
 as well.
 -Dorg.apache.lucene.index.TermInfosReader=default or small
 I have also written a blog about this patch here is the link.
 http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

2010-01-19 Thread Aaron McCurry (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802632#action_12802632
 ] 

Aaron McCurry edited comment on LUCENE-2205 at 1/20/10 2:57 AM:


Here's the last file.  I have also back patched 3.0.0 and 2.9.1 and placed them 
on my blog incase you want to have a drop in replacement to try out.

http://www.nearinfinity.com/blogs/aaron_mccurry/low_memory_patch_for_lucene.html


  was (Author: amccurry):
Here's the last file.  I have also back patched 3.0.0 and 2.9.1 and place 
them on my blog incase you want to have a drop in replacement to try out.

http://www.nearinfinity.com/blogs/aaron_mccurry/low_memory_patch_for_lucene.html

  
 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and 
 the index pointer long[] and create a more memory efficient data structure.
 ---

 Key: LUCENE-2205
 URL: https://issues.apache.org/jira/browse/LUCENE-2205
 Project: Lucene - Java
  Issue Type: Improvement
 Environment: Java5
Reporter: Aaron McCurry
 Attachments: patch-final.txt, RandomAccessTest.java, rawoutput.txt, 
 TermInfosReader.java, TermInfosReaderIndex.java, 
 TermInfosReaderIndexDefault.java, TermInfosReaderIndexSmall.java


 Basically packing those three arrays into a byte array with an int array as 
 an index offset.  
 The performance benefits are stagering on my test index (of size 6.2 GB, with 
 ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the 
 terminfos into memory were reduced to 17% of there original size.  From 291.5 
 MB to 49.7 MB.  The random access speed has been made better by 1-2%, load 
 time of the segments are ~40% faster as well, and full GC's on my JVM were 
 made 7 times faster.
 I have already performed the work and am offering this code as a patch.  
 Currently all test in the trunk pass with this new code enabled.  I did write 
 a system property switch to allow for the original implementation to be used 
 as well.
 -Dorg.apache.lucene.index.TermInfosReader=default or small
 I have also written a blog about this patch here is the link.
 http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: NRT and IndexSearcher performance

2010-01-19 Thread Jason Rutherglen
J,

The javadocs are illustrating there's no need to create new
IndexSearchers for each query.

Jason

On Tue, Jan 19, 2010 at 5:04 PM, jchang jchangkihat...@gmail.com wrote:

 The javadocs for IndexSearcher in Lucene 3.0.0 read:  For performance
 reasons it is recommended to open only one IndexSearcher and use it for all
 of your searches.

 However, to use NRT, it seems I have to do this for every search, which
 contradicts the advice above:
    IndexSearcher myIndexSearcher = new
 IndexSearcher(myIndexWriter.getReader());

 Is there any way to take advantage of NRT and not run into these performance
 problems under heavy load?

 Is the advice from the javadoc above aimed more at
 IndexSearcher(org.apache.lucene.store.Directory directory)?  Or is it also
 aimed at  IndexSearcher(org.apache.lucene.index.IndexReader indexReader),
 which I believe I have to use to get NRT (correct me if I am wrong)?
 --
 View this message in context: 
 http://old.nabble.com/NRT-and-IndexSearcher-performance-tp27235434p27235434.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: NRT and IndexSearcher performance

2010-01-19 Thread John Wang
I think the question here really is the cost of creating new IndexReader
instances per query.

Calling IndexWriter.getReader() for each query has shown to be expensive
from our benchmark and previous discussions.

-John

On Tue, Jan 19, 2010 at 8:12 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 J,

 The javadocs are illustrating there's no need to create new
 IndexSearchers for each query.

 Jason

 On Tue, Jan 19, 2010 at 5:04 PM, jchang jchangkihat...@gmail.com wrote:
 
  The javadocs for IndexSearcher in Lucene 3.0.0 read:  For performance
  reasons it is recommended to open only one IndexSearcher and use it for
 all
  of your searches.
 
  However, to use NRT, it seems I have to do this for every search, which
  contradicts the advice above:
 IndexSearcher myIndexSearcher = new
  IndexSearcher(myIndexWriter.getReader());
 
  Is there any way to take advantage of NRT and not run into these
 performance
  problems under heavy load?
 
  Is the advice from the javadoc above aimed more at
  IndexSearcher(org.apache.lucene.store.Directory directory)?  Or is it
 also
  aimed at  IndexSearcher(org.apache.lucene.index.IndexReader indexReader),
  which I believe I have to use to get NRT (correct me if I am wrong)?
  --
  View this message in context:
 http://old.nabble.com/NRT-and-IndexSearcher-performance-tp27235434p27235434.html
  Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
 
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org