Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Toke Eskildsen
On Wed, 2020-06-24 at 13:46 -0400, Alex K wrote:
> My implementation isn't specific to any particular dataset or access
> pattern (i.e. infinite vs. subset). 

Without a clearly defined use case, I would say that the sequential
scan approach is not the right one: As these things goes, someone will
come along and ask for scaling into the billions of images. "Someone"
might be my organization BTW: We do have a web archive and finding
similar images in that would be quite useful.

> Are you using Elasticsearch or Lucene directly?

None at the moment, as the driving project is currently at hold until
fall (at the earliest), and it was paused when I was about to switch
from prototyping (https://github.com/kb-dk/fairly-similar) to real
implementation. Hopefully I can twist another project in the direction
of using the same technology. If not, I'll just have to do it on my own
time :-)

I was hoping to use it with Solr, with an expectation of introducing
the necessary lower level mechanisms (and & bitcount of binary content)
at the Lucene level. Failing that, maybe Lucene directly. Using
Elasticsearch is a bit of a challenge as we don't do it currently and
it would require it to be added to Operation's support list.

> If you're using ES and have the time, I'd love some feedback on my
> plugin.

Sorry, not at the moment. Too many balls in the air before summer
vacation starts. I hope to find the time in August. Your post was just
too relevant to ignore.

> Also I've compiled a small literature review on some related research
> here: 
> https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit

You are clearly way ahead of us and I'll shamelessly piggyback on your
findings. I skimmed your notes and they look extremely useful.

> Fast and Exact NNS in Hamming Space on Full-Text Search Engines
> describes some clever tricks to speed up Hamming similarity.

The autoencoder-approach produces bitmaps where each bit is a distinct
signal, so I guess comparison would be equivalent to binary Hamming
distance?

> Large Scale Image Retrieval with Elasticsearch describes the idea of
> using the largest absolute magnitude values instead of the full
> vector.

That approach was very promising in our local proof of concept.

> Perhaps you've already read them but I figured I'd share.

A few of them, but not all. And your notes on the articles are great.

Thanks,
Toke Eskildsen, Royal Danish Library



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Toke Eskildsen
On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> I'm working on an Elasticsearch plugin (using Lucene internally) that
> allows users to index numerical vectors and run exact and approximate
> k-nearest-neighbors similarity queries.

Quite a coincidence. I'm looking into the same thing :-)

>   1. When indexing a vector, apply a hash function to it, producing
> a set of discrete hashes. Usually there are anywhere from 100 to 1000
> hashes.

Is it important to have "infinite" scaling with inverted index or is it
acceptable to have a (fast) sequential scan through all documents? If
the use case is to combine the nearest neighbour search with other
filters, so that the effective search-space is relatively small, you
could go directly to computing the Euclidian distance (or whatever you
use to calculate the exact similarity score).

>   4. As the BooleanQuery produces results, maintain a fixed-size
> heap of its scores. For any score exceeding the min in the heap, load
> its vector from the binary doc values, compute the exact similarity,
> and update the heap.

I did something quite similar for a non-Lucene bases proof of concept,
except that I delayed the exact similarity calculation and over-
collected on the heap.

Fleshing that out: Instead of producing similarity hashes, I extracted
the top-X strongest signals (entries in the vector) and stored them as
indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12,
1, 20] are [0, 3, 5]. The query was similar to your "match as many as
possible", just with indexes instead of hashes.

>- org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> runtime)

This sounds strange. How large is your queue? Object-based priority
queues tend to become slow when they get large (100K+ values).

> Maybe I could optimize this by implementing a custom query or scorer?

My plan for a better implementation is to use an autoencoder to produce
a condensed representation of the raw vector for a document. In order
to do so, a network must be trained on (ideally) the full corpus, so it
will require a bootstrap process and will probably work poorly if
incoming vectors differ substantially in nature from the existing ones
(at least until the autoencoder is retrained and the condensed
representations are reindexed). As our domain is an always growing
image collection with fairly defines types of images (portraits, line
drawings, maps...) and since new types are introduced rarely, this is
acceptable for us.

Back to Lucene, the condensed representation is expected to be a bitmap
where the (coarse) similarity between two representations is simply the
number of set bits at the same locations: An AND and a POPCNT of the
bitmaps.

This does imply a sequential pass of all potential documents, which
means that it won't scale well. On the other hand each comparison is a
fast check with very low memory overhead, so I hope it will work for
the few million images that we need it for.

- Toke Eskildsen, Royal Danish Library



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Static index, fastest way to do forceMerge

2018-11-02 Thread Toke Eskildsen
Dawid Weiss  wrote:
> Merging segments as large as this one requires not just CPU, but also
> serious I/O throughput efficiency. I assume you have fast NVMe drives
> on that machine, otherwise it'll be slow, no matter what. It's just a
> lot of bytes going back and forth.

We have quite a lot of experience with creating fully merged 900GB indexes. On 
our plain-SSD (Samsung 840) equipped machine this took ~8 hours with a single 
CPU-core at 100%. On our 7200 RPM spinning drive machines (same CPU class) it 
took nearly twice as long. Back of the envelope says reading & writing 900GB in 
8 hours is 2*900GB/(8*60*60s) = 64MB/s. I don't remember the interface for our 
SSD machine, but even with SATA II this is only ~1/5th of the possible fairly 
sequential IO throughput. So for us at least, NVMe drives are not needed to 
have single-threaded CPU as bottleneck.

And +1 to the issue BTW. It does not matter too much for us now, as we have 
shifted to a setup where we build more indexes in parallel, but 3 years ago our 
process was sequential so the 8 hour delay before building the next part was a 
bit of an annoyance.

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about threading in search

2018-08-17 Thread Toke Eskildsen
On Sat, 2017-09-02 at 18:33 -0700, Peilin Yang wrote:
> we're comparing two different indexes on the same collection - one
> with lots of different segments (default settings), and one with a
> force merged into one segment. It seems that search is sometimes
> faster with multiple segments.

If you are using Lucene 7+ and if some of the fields you are requesting
as part of your search result are stored as DocValues, you might have
encountered a performance regression with the streaming API:
https://issues.apache.org/jira/browse/LUCENE-8374

One peculiar effect of this issue is that fewer larger segments gets
slower DocValues retrieval, compared to more smaller segments. So a
force merge to 1 segment can result in worse performance.

- Toke Eskildsen, the Royal Danish Library, Denmark


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Why Two Levels of Indirection in BytesRefHash class ?

2016-12-11 Thread Toke Eskildsen
Adrien Grand  wrote:
> That would work if you are only interested in using BytesRefHash as a hash
> set for byte[]. However these incremental ids are useful if you want to
> associate data with each byte[]: you can create parallel arrays and use the
> ids returned by the BytesRefHash as indices in these arrays.

That could be solved by prepending the stored BytesRef with the counter value, 
then using a fixed +4 delta to the offset to get the BytesRef. Same space 
requirements as now, but with one less level of indirection meaning less 
CPU-cache invalidation.

However, this removes the nice property of providing insertion-order 
iterability of the DocValues in the structure, so it would be quite a change to 
current code.


One optimization, while we are on the subject, is to exploit the indirection. 
As the bytesStarts are monotonic incremental offsets in the ByteBlockPool, 
there is no need to store the length of the BytesRefs. They can be calculated 
with bytesStarts[id+1] - bytesStarts[id]. This saves 1-2 bytes per entry and 
upholds memory locality, so it should have the same performance as now (needs 
to be tested of course).

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: 500 millions document for loop.

2015-11-12 Thread Toke Eskildsen
Valentin Popov  wrote:

> We have ~10 indexes for 500M documents, each document
> has «archive date», and «to» address, one of our task is
> calculate statistics of «to» for last year. Right now we are
> using search archive_date:(current_date - 1 year) and paginate
> results for 50k records for page. Bottleneck of that approach,
> pagination take too long time and on powerful server it take 
>~20 days to execute, and it is very long.

Lucene does not like deep page requests due to the way the internal Priority 
Queue works. Solr has CursorMark, which should be fairly simple to emulate in 
your Lucene handling code:

http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: one large index vs many small indexes

2015-11-11 Thread Toke Eskildsen
Lutz Fechner  wrote:
> separated index will allow you split IO load over multiple
> physical drives as well as use different Analyzers (if your
> customers are having different content that will make sense).

Other ways to get better IO is RAID, SSD or RAM.

Multiple indexes makes a lot of sense from a functionality point of view 
(logistics, ranking, individualization), but it loses on price/performance if 
most of the data are in use most of the time. It boils down to the overhead of 
running an index.

Discussing this on the abstract level is hard as there are so many variables 
influencing the decision. The quality of our guesswork is proportional to the 
amount of information you give us, Sascha. It would help if we knew more, such 
as

* How many customers?
* How many customers in a year?
* How large is the average index data size per customer?
* How many documents per customer?
* Are all customer data treated equal or are some of it specialized?
* Are the sizes fairly uniform or are there a few huge outliers?
* How often does a customer update the data?
* How often does a customer issue searches?
* How many concurrent requests will there be at peak time?
* Is it okay to have a slow first-search but faster subsequent searches?


- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 5 : are FixedBitSet and SparseFixedBitSet thread-safe?

2015-09-13 Thread Toke Eskildsen
Selva Kumar  wrote:
> Subject: Lucene 5 : are FixedBitSet and SparseFixedBitSet thread-safe? 

Short answer: No.

Longer answer: Reading values and calling methods that does not modify the 
structure is fine. Writing values is not safe, but can work for FixedBitSet, if 
you take care not to update values within the same 64bit block from multiple 
Threads at a time.

It is not too hard to make it Thread-safe and efficient 
(AtomicLongArray.compareandSet is your friend) and I made such a variant for a 
project, but the Solr-code is generaly very 1-request-1-Thread, so there is a 
lot of places to change code in order for it to take advantage of a changes 
FixedBitSet. What is it you are trying to achieve?


- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Toke Eskildsen
Arjen van der Meijden  wrote:
> On 9-8-2015 16:22, Toke Eskildsen wrote:
> > Maybe you could update the JavaDoc for that field to warn against using it?
> It (probably) depends on the contents of the values.

That was my impression too, but we both seem to be second-guessing Robert's 
very non-nuanced and clearly oft-repeated recommendation. I hope Robert can 
shed some light on this and tell us if he finds the JavaDocs to be in order or 
if binary DocValues should not be used at all.

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Toke Eskildsen
Robert Muir  wrote:
> I am tired of repeating this:
> Don't use BINARY docvalues
> Don't use BINARY docvalues
> Don't use BINARY docvalues

> Use types like SORTED/SORTED_SET which will compress the term
> dictionary and make use of ordinals in your application instead.

This seems contrary to
http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html

Maybe you could update the JavaDoc for that field to warn against using it?

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread Toke Eskildsen
On Wed, 2015-05-20 at 09:12 +0200, András Péteri wrote:
> As Olivier wrote, multiple BytesRef instances can share the underlying byte
> array when representing slices of existing data, for performance reasons.
> 
> BytesRef#clone()'s javadoc comment says that the result will be a shallow
> clone, sharing the backing array with the original instance, and points
> to another utility method for deep cloning: BytesRef#deepCopyOf(BytesRef).

There is no hard contract for clone
https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#clone%
28%29
but the convention, as stated, is "the object returned by this method
should be independent of this object". That whole section of the
Oracle's JavaDocs describes the exact opposite of BytesRef clone
behaviour.

It really should have been the other way around, with BytesRef#clone
being deep and loyal to convention and the special method being
BytesRef#shallowCopyOf(BytesRef).


But we are where we are, so I don't find it viable to change behaviour.
More explicit documentation, as Dawid suggests, seems the best band aid.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: large amount of data cause performance problem

2015-02-05 Thread Toke Eskildsen
On Thu, 2015-02-05 at 04:00 +0100, Heeheeya wrote:
> i am recently puzzled by performance problem using lucene while the
> search result set is large. do you have any advice?

Without any information, how are we to help you?


Start by reading 
https://wiki.apache.org/solr/SolrPerformanceProblems

If that does not help, give us some information to work with: How large
is your index (byte size and document count), what hardware do you have,
how large is your JVM heap, how many documents do you request at a time,
what is a typical query?

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene scalability query

2015-01-08 Thread Toke Eskildsen
On Thu, 2015-01-08 at 12:03 +0100, sreedevi s wrote:
>I am doing a scalability analysis for lucene search in my application.I
> was running my junits with different sets of data which are like
> 1K,10K,100K and 1000K.

[...]

Your table copy-paste did not work. I tried extracting the key data:

10K, attempt 1: 152ms
10K, attempt 2: 32ms
10K, attempt 3: 20ms

100K, attempt 1: 136ms
100K, attempt 2: 28ms
100K, attempt 3: 27ms

> Ideally, it search time should have been higher with 100K data.Why is it
> that I get lesser searcher time with 100K data.

Based on your reported index time, your indexes are tiny. What you are
seeing is probably just statistical flukes. Try re-running your tests a
few times and you will see the numbers change.

- Toke Eskildsen



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: A question on performance

2015-01-07 Thread Toke Eskildsen
rama44ster [rama44s...@gmail.com] wrote:

[3 MUST clauses]

> 50 = 16 ms
> 75 = 52 ms
> 90 = 121 ms
> 95 = 262 ms
> 99 = 76010 ms
> 99.9 = 76037 ms

> Is the latency expected to degrade when the number of docs is as high as
> 480M?

Try plotting response times as a function of hit count. My guess is that your 
99 and 99.9 percentiles are for really high hitcounts, which will take a long 
time as they all needs to be scored. Alternatively, if it is easier for you, 
just check the queries in the 99+ percentiles manually and see if they hit a 
lot of documents.

If your response times grows about linear (with a bump at one point, due to 
switch from sparse to non-sparse docset) as a function of hitcount, there is 
not much about it besides sharding, with the current single-threaded processing 
of lucene queries.

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: how to "load" mmap directory into memory?

2014-12-03 Thread Toke Eskildsen
Li Li [fancye...@gmail.com] wrote:
> I am using mmap fs directory in lucene. My index is small (about 3GB
> in disk) and I have plenty of memory available. The problem is that
> when the term is first queried, it's slow. How can I "load" all
> directory into memory?

Yes. One of the smart things about MMap is that it is shared with the disk 
cache, so as long as you read the bytes somehow, for example with the Java 
program you suggest, it will be available in memory for Lucene.

On linux, the simplest is to 'cat myindexfolder/* > /dev/null'. Do not use 'cp' 
as it is smart enough to bypass the operation if the destination is /dev/null.

Caveat: This does not guarantee that your index stays fully cached. It can be 
evicted just like all other disk cache, if other programs perform enough disk 
operations.

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Memory consumption on lucene 2.4

2014-11-21 Thread Toke Eskildsen
Philippe Kernévez [pkerne...@octo.com] wrote:
> We use Lucene 2.4 (provided by Alfresco).

Lucene 2.4 is 6 years old. The obvious advice is to upgrade, but I guess you 
have your reasons not to.

> We looked at a memory dump with Eclipse Memory Analyser, and we were quite
> surprised to see that most of that memory is kept by enormous String[] that
> are yet mostly empty.

I am guessing you have a lot of documents in your index and that you are 
sorting on at least one String field?

http://www.lhelper.org/dev/lucene-2.4.0/docs/api/org/apache/lucene/search/Sort.html
states that sorting on String in Lucene means that all Strings for that field 
are kept in memory. There has to be one entry in the String array(s) for each 
document, even if the document does not have a value for that field.

If my guess is correct, the solution is to reduce the number of String sort 
fields, ideally to 0. Maybe you can use an integer field instead by doing some 
mapping?

> In our case we need to have some very short word indexed, so we desactivate
> 'stop words'. If we want to have the list of Term order by their index size
> what is good tool to do that (Luce?) and how ca we do such request ?

Luke has term statistics build-in. I don't remember the details, but I recall 
that it was straight forward.

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Caused by: java.lang.OutOfMemoryError: Map failed

2014-11-07 Thread Toke Eskildsen
Brian Call [brian.c...@soterawireless.com] wrote:
> Yep, you guys are correct, I’m supporting a slightly older version of our 
> product based
> on Lucene 3. 

> In my previous email I forgot to mention that I also bumped up the maximum 
> allowable
> file handles per process to 16k, which had been working well.

If you don't use compound indexes and all your indexes are handled under the 
same process constraint, then 16K seems quite low for hundreds of indexes. You 
could check by issuing a file count on your index folders.

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Caused by: java.lang.OutOfMemoryError: Map failed

2014-11-07 Thread Toke Eskildsen
Brian Call [brian.c...@soterawireless.com] wrote:

[Hundreds of indexes]

> ...
>at java.lang.Thread.run(Thread.java:724)
> Caused by: java.lang.OutOfMemoryError: Map failed
> at sun.nio.ch.FileChannelImpl.map0(Native Method)
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:846)

That error can also be thrown when the number of open files exceeds the given 
limit. "OutOfMemory" should really have been named "OutOfResources".

Check the maximum number of open files with 'ulimit -n'. Try raising it if it 
seems low.

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene Java Caching Question

2014-10-02 Thread Toke Eskildsen
parth_n [nagar...@asu.edu] wrote:
> I run my java application (set of spatial queries) to get the execution time
> and results for the queries. The application is terminated. Whenever I
> re-run the application with the same set of queries, the execution time is
> very low comparative to the first run. So I am assuming that there is some
> caching going on, but where is this stored?

It is the disk cache of your operating system. It is independent of Lucene and 
is in-memory. Most modern operating systems uses all free memory for disk cache.

Lucene uses random access all the time and search speed is largely dictated by 
how fast it can do such reads. If the data are in your disk cache, they will be 
fetched _very_ fast.

> I have looked on for similar question on this forum, but it seems no one has
> come across this particular problem.

Problem? You mean for testing? Well, it is quite hard to test Lucene 
performance. Related to disk cache there are three strategies:

1) Empty the disk cache before you test (how you do that depends on your 
operating system). This makes the tests fairly repeatable, but say nothing 
about real world performance, as there is always some amount of caching going 
on when you're running for real.

2) Fill the disk cache, either by repeating your test a few times and measuring 
the last result or by reading all your index files into disk cache before you 
start (on linux, 'cat * > /dev/null' should work). Again this ensures test 
repeatability, but it is only representative of real world performance if your 
production index size is less than the amount of free memory.

3) Try to simulate a real setup, with some queries from your production system, 
before you start your test. This is tricky to get right, but the only 
somewhat-sound approximation of real world performance.

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: 回复: Speed up searching in multiple-thread?

2014-09-15 Thread Toke Eskildsen
On Mon, 2014-09-15 at 11:41 +0200, Harry Yu wrote:

> 17ms / searches is the whole process of search service, include
> accessing complete data form db, calling REST service etc.

Try looking at QTime in solr.log and compare it with your measured
response times, to see if it is Solr or your other services that are the
bottleneck.

- Toke Eskildsen, State and University Library, Denmark




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Speed up searching in multiple-thread?

2014-09-15 Thread Toke Eskildsen
On Mon, 2014-09-15 at 09:10 +0200, Harry Yu wrote:
> I'm developing poi search application using lucene 4.8 . Recently, I
> met a trouble that the performance of IndexSearcher.search is bad in
> multiple-thread environment. According the test results, I found that
> if thread number is 1,  the response time of searching is only 17ms.
> But that would be 400ms when I increased threads to 30. 

With a single thread you have 1000ms/s / 17ms ~= 59 searches/second.
For 30 threads you have 1000ms/s / 400ms * 30 = 75 searches/second

As you have a dual-core machine, the ideal outcome would be 2*59 = 118
searches/second. Accounting for fluctuations in measuring, more GC and a
bit of congestion, it does not seem unreasonable with 75.

(17ms/search seems like quite a long time for such a small index, but
that is independent of the threading issue)


What was the expected outcome of your test?

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching on Large Indexes

2014-06-27 Thread Toke Eskildsen
On Fri, 2014-06-27 at 12:33 +0200, Sandeep Khanzode wrote:
> I have an index that runs into 200-300GB. It is not frequently updated.

"not frequently" means different things for different people. Could you
give an approximate time span? If it is updated monthly, you might
consider a full optimization after update.

> What are the best strategies to query on this index?

> 1.] Should I, at index time, split the content, like a hash based
> partition, into multiple separate smaller indexes and aggregate the
> results programmatically?

Assuming you use multiple machines or independent storage for the
multiple indexes, this will bring down latency. Do this if your searches
are too slow.

>  2.] Should I replicate this index and provide some
> sort of document ID, and search on each node for a specific range of
> document IDs?

I don't really follow that idea. Are your searches only ID-based?

Anyway, replication increases throughput. Do this if your server have
trouble keeping up with the full amount of work.

>  3.] Is there any way I can split or move individual
> segments to different nodes and aggregate the results?

Copy the full index. Delete all documents in copy 1 that matches one
half of your ID-hash function, do the reverse for the other. As your
corpus is semi-randomly distributed, scores should be comparable between
the indexes so that the result sets can be easily merged.

But at Jigar says, you should consider switching to SolrCloud (or
ElasticSearch) which does all this for you.

> I am not fully aware of the large scale query strategies. Can you
> please share your findings or experiences?

Depends on what you mean by large scale. You have a running system -
what do you want? Scaling up? Lowering latency? Increasing throughput?
More complex queries?

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Toke Eskildsen
On Mon, 2014-06-23 at 13:58 +0200, Jamie wrote:
> How does one sort the results of a collector as opposed to the entire 
> result set?

With only 50K as page size, this should not be necessary. But for the
record, you do it by implementing a Collector that can potentially hold
all documents in the index (well, their docID & sort key anyway) and
feed it to the search(Query query, Collector results) method in the
IndexSearcher. When the call has finished, run your own sort and extract
the top-X results.

> Do I need to implement my own sort algorithm or is there a way to do 
> this with Lucene?  If so, which API functions do I need to call?

InPlaceMergeSorter is a nice one to extend. But again, with 50K result
sets, this seems like overkill.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Toke Eskildsen
On Mon, 2014-06-23 at 13:53 +0200, Jamie wrote:
> if (startIdx==0) {
>topDocs = 
> indexSearcher.search(query,queryFilter,searchResult.getPageSize(), sort);
> } else {
>topDocs = indexSearcher.searchAfter(p.startScoreDoc, query, 
> queryFilter, searchResult.getPageSize(),sort);
> }
> The page size is set to 50,000.

Okay, that was strange. 50K is fine for a heap. How many concurrent
searches are you running?

> What are the best JVM collector settings for Lucene searching? We're 
> tried various options and they don't seem to make much difference.

I am no expert there, but I will advice you to check how much free
memory your JVM has when it is running searches. GC-tweaks does not help
much if the JVM is nearly our of memory.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Toke Eskildsen
On Mon, 2014-06-23 at 13:33 +0200, Jamie wrote:
> While running a search over several million documents, the Yourkit 
> profiler reports a deadlock on the following method. Any ideas?

> search worker <--- Frozen for at least 25m 37 sec
> org.apache.lucene.util.PriorityQueue.downHeap()

My guess is that you are requesting several million documents as your
result set, instead of just the top 10 or top 100. 

The heap implementation used by Lucene does not play well with large
result sets. Performance is bad and it allocates an excessive amount of
objects: Your machine is probably busy garbage collecting. The quick fix
is to allocate more memory for Java.

This is not a fault in the implementation as such, but rather the result
of using a heap for a large result set. If you really need a large
result set, I recommend you create your own collector that collects
everything the most compact way and perform sorting on the full
collection afterwards.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: search performance

2014-06-03 Thread Toke Eskildsen
ocs/ms.
  30,000,000 docs in mean  58,257 ms,   514 docs/ms.
  40,000,000 docs in mean  76,763 ms,   521 docs/ms.

The speedup of the merge sorter relative to PQ increases with the size of the 
collected result. Unfortunately we're still talking minute-class with 60M 
documents. It all points to the conclusion that collecting millions of sorted 
document IDs should be avoided if at all possible. A searchAfter that takes a 
position would either need to use some clever caching or perform the giant 
sorted collection when called.

- Toke Eskildsen
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Toke Eskildsen
On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote:
> Unfortunately, in this instance, it is a live production system, so we 
> cannot conduct experiments. The number is definitely accurate.
> 
> We have many different systems with a similar load that observe the same 
> performance issue. To my knowledge, the Lucene integration code is 
> fairly well optimized.

It is possible that the extreme slowness is a combination of factors,
but with a bit of luck it will boil down to a single thing. Standard
procedure it to disable features until it performs well, so

- Disable running updates
- Limit page size
- Limit lookup of returned fields
- Disable highlighting
- Simpler queries
- Whatever else you might think of

At some point along the way I would expect a sharp increase in
performance.

> I've requested access to the indexes so that we can perform further testing.

Great.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-02 Thread Toke Eskildsen
On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:

[200GB, 150M documents]

> With NRT enabled, search speed is roughly 5 minutes on average.
> The server resources are: 
> 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

5 minutes is extremely long. Is that really the right number? I do not
see a hardware upgrade changing that with the fine machine you're using.

What is your search speed if you disable continuous updates?

When you restart the searcher, how long does the first search take?


- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: search time & number of segments

2014-05-20 Thread Toke Eskildsen
De Simone, Alessandro [alessandro.desim...@bvdinfo.com] wrote:
> We have stopped optimizing the index because everybody told us it was a bad 
> idea.
> It makes sense if you think about it. When you reopen the index not all 
> segments must be reopened then you have:
> (1) better reload time
> (2) keep the OS file cache at maximum

(3) Makes it a lot faster to update the index. I find this to be the main 
selling point myself.

> I have never read any warning saying that doing so will have a big impact on 
> performance.

And we're back to the puzzle why you get so many more I/O operations with your 
16 segments.


Do you have some typical response times from the optimized index and the 
segmented one, after some hundred or thousand queries has been processed and 
the OS cache is properly warmed?

Can you give us a representative query?

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search time & number of segments

2014-05-20 Thread Toke Eskildsen
On Tue, 2014-05-20 at 15:04 +0200, De Simone, Alessandro wrote:

Toke:
> > Using the calculator, I must admit that it is puzzling that you have
> 2432  / 143 = 17.001 times the amount of seeks with 16 segments.
> 
> Do you have any clue? Is there something I could test?

If your segmented index was markedly larger than the optimized, I would
say you had a lot of redundancy across segments, but this is not the
case.

Alas, someone with better knowledge of Lucene internals will have to
step up.

> I don’t have the budget to change the hardware and it would be
> difficult for me to justify replacing a working hardware just to handle
> the same amount of data :-(

You are changing a system from being heavily optimized towards search to
be balanced between updates and search. There seems to be an assumption
that this will be without a change to hardware requirements, which I
find to be quite optimistic.

> Anyway, I certainly would have noticed a performance hit sooner or later if I 
> had a SSD.

That is trivially true for any hardware. The question is how much scale
an upgrade will buy you. We have been using SSDs in our search servers
since late 2008.

Some observations you might find relevant:
https://sbdevel.wordpress.com/2013/06/06/memory-is-overrated/

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-20 Thread Toke Eskildsen
On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:

Toke:
> Is 20 second an acceptable response time for your users? 
> 
> Shruthi: Its definitely not acceptable. PFA the piece of code that we
> are using..Its taking 20seconds. That’s why I drafted this ticket to
> see where I was going wrong.

Indexing 1000 documents/sec in Lucene is quite common, so even taking
into account large documents, 20 seconds sounds like quite a bit.

> Shruthi: Well,  its two stage process: Client is looking at
> historical data based on a parameters like names, dates,MRN, fields
> etc.. SO the query actually gets the data set fulfilling the
> requirements
> 
> If client is interested in doing a text search then he would pass the
> search phrase on the result set.

So it is not possible for a client to perform a broad phrase search to
start with. And it sounds like your DB-queries are all simple matching?
No complex joins and such? If so, this calls even more for a full
Lucene-index solution, which handles all aspect of the search process.
> 
- Toke Eskildsen, State and University Library, Denmark




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-20 Thread Toke Eskildsen
On Tue, 2014-05-20 at 10:40 +0200, Shruthi wrote:
> Just the indexing took 20 seconds L

That's more than I expected, but it leaves the same question:
Is 20 second an acceptable response time for your users?

I don't know your document size, but unless they are very large, the
response times from a full 10M document index will be way better than 20
seconds. Even on a low-RAM machine with spinning drives.

> We are yet to try on 64 bit server to check if that would change
> drastically.

I doubt it will.

Toke:
> RAMDirectory seems a better choice.
> 
> Shruthi : But RAM DIrectory  has bad concurrency on multithreaded
> environments.

I assumed you would be creating a dedicated index for each request,
thereby effectively having single threaded usage for each separate
index.

I just remembered that Lucene has an implementation dedicated to fast
indexing. Take a look at
http://lucene.apache.org/core/4_8_0/memory/org/apache/lucene/index/memory/MemoryIndex.html
It seems like just the thing for your use case.

> Shruthi : The same user from the same client will not be searching for
> same phrase again unless he has amnesia. This was already discussed
> with our architects.

If your architects base their decisions on observed user behaviour, then
fine. At our library, many users refines their queries, meaning that a
common pattern is 2-4 queries that are very much alike.

> Shruthi:  Actually we have a DB query that runs prior to indexing
> which fetches max. 500 docs from 10million+ docs in NASSHARE. We then
> have to apply search phrase only on the resultant set..So this way
> 
> The set is just limited to 500 -1000.

Frankly, the combination of a pre-selection with a DB query and the
addon of heavy index + search with Lucene seems like the absolute worst
of both worlds.

Does the DB-selector do anything that cannot easily be replicated in
Lucene?

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-20 Thread Toke Eskildsen
On Mon, 2014-05-19 at 12:40 +0200, Shruthi wrote:
> 1.   Client makes a request with a search phrase. Lucene
> application indexes a list of 500 documents(at max. ) and searches the
> phrase on the index constructed.

Fetching from NAS + indexing sounds like something that would take a
second or two. Have you tried this?

> We have decided to use MMapDirectory for above requirement.

As your index data are extremely transient and the datasets small,
RAMDirectory seems a better choice.

You state that you delete the index when the search has finished.
Wouldn't it be better to keep it a couple of minutes? That way further
searches from the same client would be fast.

Overall, I worry about your architecture. It scales badly with the
number of documents/client. You might not have any clients with more
than 500 documents right now, but can you be sure that this will not
change?

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search time & number of segments

2014-05-19 Thread Toke Eskildsen
On Mon, 2014-05-19 at 11:54 +0200, De Simone, Alessandro wrote:

[24GB index, 8GB disk cache, only indexed fields]

> The "IO calls" I was referring to is the number of time the
> "BufferedIndexInput.refill()" function is called. So it means that we
> have 16 times more bytes read when there are 16 segments for the exact
> same result.

Using the calculator, I must admit that it is puzzling that you have
2432  / 143 = 17.001 times the amount of seeks with 16 segments. I would
have expected that number to be smaller than 16, due to pure chance of
data being in the same blocks in some segments.

Is the total file size of the optimized index about the same as the
segmented one?

Toke:
> > I am guessing that you are using spinning drives and that there is not much 
> > RAM in the machine? 
> 
> As you can see we have a lot of RAM.

Not if you're using spinning drives and have no stored fields.
http://wiki.apache.org/solr/SolrPerformanceProblems

While I find the number of seeks to be an interesting problem, I wonder
why you don't just solve the performance problem by throwing hardware at
it? Consumer SSDs are dirt cheap nowadays and even the enterprise ones
are not that pricey. Same goes for RAM as long as we're talking about a
relative small amount such as 32GB.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: search time & number of segments

2014-05-17 Thread Toke Eskildsen
De Simone, Alessandro [alessandro.desim...@bvdinfo.com] wrote:
> We have a performance issue ever since we stopped optimizing the index. We 
> are using Lucene 4.8 (jvm 32bits for searching, 64bits for indexing) on 
> Windows 2008R2.

How much RAM does your search machine have?

> For instance, a search with (2 termQuery + 1 spanquery) x 6 fields made 143 
> IO calls. Now with 16 segments  we have 2432 IO calls and the search time is 
> really bad.

[...]

That sounds right. Although each segment is 1/16 of the full index size, the 
number of seeks per segment is not 1/16: Larger indexes require relatively 
fewer seeks. Think binary search and log(values_in_field), although that is 
highly simplified.

> The size of the Index is ~24gb (14millions documents). No field are stored, 
> only indexed.

Normally the penalty of running un-optimized is not that great, so it sounds 
like your machine cannot provide the I/O speed it needs (as opposed to having a 
great logistics overhead from the multiple segments). I am guessing that you 
are using spinning drives and that there is not much RAM in the machine? The 
easy solution is either to throw RAM at the problem or switch to SSD.

- Toke Eskildsen
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can RAMDirectory work for gigabyte data which needs refreshing of the index all the time?

2014-05-16 Thread Toke Eskildsen
On Wed, 2014-05-07 at 15:46 +0200, Cheng wrote:
> I have an index of multiple gigabytes which serves 5-10 threads and needs
> refreshing very often. I wonder if RAMDirectory is the good candidate for
> this purpose. If not, what kind of directory is better?

RAMDirectory will probably give you poor performance:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Stick to MMapDirectory: As you are considering using a RAMDirectory,
your index must be smaller than the amount of free RAM, which means that
everything will be fully cached and fast.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene: Index Writer to write in multiple file instead make one heavy file

2014-05-13 Thread Toke Eskildsen
Yogesh patel [yogeshpateldai...@gmail.com] wrote:
> I am using lucene 3.0.1. I am writing many documents with lucene
> Indexwriter. But Indexwriter add all documents into file which becomes more
> than 4GB in my case. so can i distribute files or partitioned ?

Normally Lucene does not produce a single large file. I guess you are 
performing an optimize. Don't do that (it is not really recommended anyway) and 
you should have multiple smaller files.

If that was not clear, then please show us the part of your code that handles 
index updates.

- Toke Eskildsen
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Compare scores from multiple indices

2014-04-17 Thread Toke Eskildsen
Sven Teichmann [s.teichm...@s4ip.de] wrote:
> If I understand Lucene's scoring correctly, I am not able to compare the
> scores returned by the same query executed on multiple indices, because
> there are factors affecting scoring which are different in each index.
> Is that right?

Yes and no. If your indices contains the same kind of documents, scores are 
roughly comparable, with the comparability growing with index size (less chance 
of outliers skewing the scores). If your indices are dedicated to different 
resources (e.g. one for physics papers, one for biology etc.), then the scores 
between them will be very poorly comparable.

> If so, what can I do to make the scores from multiple indices comparable?

Wait for https://issues.apache.org/jira/browse/SOLR-1632 or ensure that the 
content (and sizes) of your indices are homogenous.

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Stored fields and OS file caching

2014-04-05 Thread Toke Eskildsen
Vitaly Funstein [vfunst...@gmail.com] wrote:
> It's a bit of a guess on my part, but I did get better write and search
> performance with size <= 2K, as opposed to the default 16K.

For search that sounds plausible as that is very random access heavy and the 
disk cache will contain a larger amount of data actually needed with smaller 
blocks. For writes (assuming Solr writes, which are very bulk-oriented), it 
does not make sense that a smaller block size should be faster. Smaller block 
sizes means more overhead and leads to more fragmentation, both of which are 
anti-bulk.

The 2K does not always make sense BTW: Never harddrives used 4K as the smallest 
physical entity: http://en.wikipedia.org/wiki/Disk_sector

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Aw: Re: Strange performance of Lucene 4.4.0

2013-09-09 Thread Toke Eskildsen
On Sun, 2013-09-08 at 15:15 +0200, Mirko Sertic wrote:
> I have to check, but my usecase does not require sorting or even
> scoring at all. I still do not get what the difference is...

Please describe how you perform your measurements. How do you ensure
that the index is warmed equally for the two cases?

- Toke


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene handling of duplicate terms

2013-09-05 Thread Toke Eskildsen
On Thu, 2013-09-05 at 09:28 +0200, Kristofer Karlsson wrote:
> For an example, I may have a million documents with just the term "foo" in
> field A, and one particular document with the term "foo" in both field A
> and B, or have two terms "foo" in the same field.
> 
> If I search for "foo foo" I would like to filter out all the documents with
> only one matching term - is this possible?

A bit of creative querying should do it:

For the "only one foo-field"-case, you could do
  (A:foo NOT B:foo) OR (B:foo NOT A:foo)

To avoid two foo's in the same field, you could do
  NOT field:"foo foo"~1000

Combining those we get
  ((A:foo NOT B:foo) OR (B:foo NOT A:foo)) NOT A:"foo foo"~1000 NOT
B:"foo foo"~1000


Or did I misunderstand? Do you want to keep the documents that has at
least two foo's and discard the ones that only has one? That is simpler:
  (A:foo AND B:foo) OR A:"foo foo"~1000 OR B:"foo foo"~1000


This all works under the assumption that you have less than 1000 terms
in each instance of your fields. Adjust accordingly.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: In memory index (current status in Lucene)

2013-07-02 Thread Toke Eskildsen
On Mon, 2013-07-01 at 16:07 +0200, Emmanuel Espina wrote:
> Just to add to this conversation, I found an interesting link to
> Mike's blog about memory resident indexes (using another virtual
> machine) 
> http://blog.mikemccandless.com/2012/07/lucene-index-in-ram-with-azuls-zing-jvm.html

Testing the Zing with MMapDirectory vs. RAMDirectory would be a great
addition to Mike's blog post.

I wonder if Java's ByteBuffer could be used to make a more GC-friendly
RAMDirectory?

Regards,
Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: why did I build index slower and slower ?

2013-05-13 Thread Toke Eskildsen
On Mon, 2013-05-13 at 05:05 +0200, wgggfiy wrote:
> My situation is that There are 10,000,000 documents, and I Build index every
> 5,000 documents. while *in every build*, I follow these steps:
> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer);   
> 

You skipped the part where you commit and close. If you are optimizing
the index down to a low number of segments, that would explain most of
the slowdown.

Another part is that you open the index for each batch. That takes a bit
of time. If you have frequent batch runs, you might want to switch to a
setup where the index writer is persistent.

- Toke Eskildsen, state and University Library, Denmark


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: search-time facetting in Lucene

2013-05-06 Thread Toke Eskildsen
kiwi clive [kiwi_cl...@yahoo.com]:
> Thanks very much for the reply. I see there is not a quick win here but as
> we are going through an index consolidation process, it may pay to make
> the leap to 4.3 and put in facetting while I'm in there. We will get facetting
> slowly through the back door while the consolidation runs (we have 10,000+
> shards). If it were not for the consolidation required, I thin bobo would have
> been the way forward.

I do not know much about Bobo Browse, but it seems that it works directly with 
existing indexes: 
https://linkedin.jira.com/wiki/display/BOBO/Create+a+Browse+Index

The implicit requirement is that the values for your facet fields are already 
indexed so that the analyzed content fits your faceting requirements.

- Toke Eskildsen
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how do I paginate Lucene search results deeply

2013-03-14 Thread Toke Eskildsen
On Thu, 2013-03-14 at 11:03 +0100, Toke Eskildsen wrote:
>   (timestamp_in_ms << 10) & counter++

This should be

  (timestamp_in_ms << 10) | counter++



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how do I paginate Lucene search results deeply

2013-03-14 Thread Toke Eskildsen
On Thu, 2013-03-14 at 04:11 +0100, dizh wrote:
> each document has a timestamp identify the time which it is indexed, I
> want search the documents using sort, the sort field is the timestamp,

[...]

> but when you do paging, for example in a web app , the user want to go
> to the last 4980-500, well, it is slowly...

Yes. The problen is that it performs a sliding window search with a
window size of page+topX and that does not work well with 5M entries,
especially not as it used a heap, which work very well for small windows
but horrible for large windows.

> I have a large number of Log4J logs, and I want to index them and
> present them using web ui. 

I still don't see why you would want to page to 5M, but okay.

Instead of representing the timestamps directly, convert them to unique
longs when indexing. Guessing that you always have less than 1000 log
entries/ms, your long would be 
  (timestamp_in_ms << 10) & counter++
where the counter is set to 0 each time a different timestamp is
encountered. This also ensures that the order of your log entries is
preserved. Let's call the modified timestamps for utime.

When you do a paginated search for 20 results, keep track of the last
utime. When you request the next page, add a NumericRangeFilter going
from the last utime (non-inclusive) with no upper limit and ask for the
top-20 results again


NB: Please get rid of the garbage that follows each of your posts on
this mail list. The Confidentiality Notice has negative value here.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RAM or SSD...

2012-07-18 Thread Toke Eskildsen
On Wed, 2012-07-18 at 17:50 +0200, Dragon Fly wrote:
> If I want to improve performance, which of the following is better and why?
> 
> 1. Buy a machine with a lot of RAM and use a RAMDirectory for the index.

As others has pointed out, MMapDirectory should work better than
RAMDirectory. I am sure it will work fine with a relative small index
such as yours. However, it does not scale that well with index size.

> 2. Put the index on a solid state drive.

Why anyone buys computers without SSD's is a mystery to me. Use SSDs for
the small low-latency stuff and a secondary spinning drive for the large
slow stuff. Nowadays, a 30GB index (or 100GB for that matter) falls into
the small low-latency bucket. SSDs speeds up almost everything, saves
RAM and spares a lot of work hours optimizing I/O-speed.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sort runs out of memory

2012-05-21 Thread Toke Eskildsen
On Thu, 2012-05-17 at 23:03 +0200, Robert Bart wrote:
> I am running Lucene 3.6 in a system that indexes about 4 billion documents
> across several indexes, and I'm hoping to get documents in order of a
> certain NumericField.

What is the maximum size on any single index, in terms of number of
documents? What is the type of the NumericField?

> I've tried using Lucene's Sort implementation, but it looks like it tries
> to do the entire sort in memory by allocating a huge array with space for
> every document in the index.

The FieldCache allocates an array of length #documents with the same
type that your NumericField is. The sort itself is of the sliding window
type, meaning that it only takes up memory lineary to the number of
documents wanted in the response. Do you require millions of documents
to be returned as part of a search?

Sanity check: You do specify the type when performing a sorted search,
right? If not, the values will be treated as Strings.

>  On my index, this quickly runs out of memory.

Assuming that your largest index is 1B documents and that your
NumericField is of type Integer, the FieldCache's values for the sort
should take up 1B * 4 = 4GB. Are you hoping for less?

> Are there any alternatives or better ways of getting documents in order of
> a NumericField for a very large index?

Be sure to select the type of NumericField to be as small as possible.
If you have few unique sort values (e.g. 17, 80, 2000 and 5678), you
might map them down (to 0, 1, 2 and 3 for this example) and store them
as a byte.

Currently Lucene only supports atomic types for numerics in the
FieldCache, so the smallest one is byte. It is possible to use only
ceil(log2(#unique_values)) bits/document, although that requires a bit
of custom coding.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extracting all documents for a given search

2011-09-19 Thread Toke Eskildsen
On Sat, 2011-09-17 at 03:57 +0200, Charlie Hubbard wrote:
>  I really just want to be called back when a new document is found by the
> searcher, and I can load the Document, find my object, and drop that to a
> file.  I thought that's essentially what a Collector is, being an interface
> that is called back whenever it encounters a Document that matches a query.

That is correct. It is a simple API so implementing your own
ZIPCollector seems to be the best solution. Partly copied from Trejkaz:

class ZIPCollector extends Collector {
public void collect(int doc) {
addZIPContent(resolveDocumentContent(doc));
}
}

ZIPCollector collector = new ZIPColelctor();
getSearcher().search(query, filter, collector);



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question on the increase in the index space for larger indexes

2011-09-07 Thread Toke Eskildsen
On Tue, 2011-09-06 at 17:32 +0200, Saurabh Gokhale wrote:
> Then I saw index size started exponentially increasing and by the end of 1
> year worth of data processing, I was expecting the index to be 60 to 70 GB
> but the size grew to more than 120GB.
> 
> 1. Is it an expected behavior?

No, quite the opposite in fact. Recurring terms will only be stored once
(for each segment) so normal behavior is that the index gets smaller,
relative to the number of documents. Worst case is that it grows linear
to the number of documents. Of course that only holds if your documents
are similar, which seems not to be the case for your corpus.

There might be another explanation though: If you measure index size by
summing file sizes in the index folder while the index is being build,
you might have done it during a merge: When the index writer collapses
segments, temporary storage space is used. If you want to be sure about
the size, you need to stop indexing while you measure.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Memory issues

2011-09-05 Thread Toke Eskildsen
On Sat, 2011-09-03 at 20:09 +0200, Michael Bell wrote:
> To be exact, there are about 300 million documents. This is running on a 64 
> bit JVM/64 bit OS with 24 GB(!) RAM allocated.

How much memory is allocated to the JVM?

> Now, their searches are working fine IF you do not SORT the results. If you 
> do SORT, you get stuff like
> 
> 2011-08-30 13:01:31,489 [TP-Processor8] ERROR 
> com.gwava.utils.ServerErrorHandlerStrategy - reportError: nastybadthing :: 
> com.gwava.indexing.lucene.internal.LuceneSearchController.performSearchOperation:229
>  :: EXCEPTION : java.lang.OutOfMemoryError: Requested array size exceeds VM 
> limit java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>  at 
> org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(FieldCacheImpl.java:624)
[...]

> Looking at the sort class, the api docs appear to say it would create an 
> element of 1.2 billion items (4*300m). 

The StringIndexCache in Lucene 3 keeps two arrays in memory: int[#docs]
and String[#docs+1]. With 300M documents that is 1.2 billion bytes for
the int-array, which should not be a problem for the machine.

Unfortunately the String-array is a big problem. Keeping in mind that a
String in Java takes up approximately 50 + 2 * length bytes and setting
the average length of the terms to 10 chars, the array takes up a
maximum of 300M * (50 + 2 * 10) byte = 21,000 MByte or about 20 GByte.

In reality it is not that bad as duplicates only count once, but the
problem should be obvious.

> Is this correct? Is the issue going beyond signed int32 limits of an array ( 
> 2 billion items) or is it really a memory issue? How best to diagnose?

Open your index with Luke and count the number of unique terms for your
sort field. Using the formula above, you'll get an estimate of the
memory required for sorting on String in Lucene 3.

The int32 limit is only for the number of unique terms and there is a
maximum of one term/document when sorting. With 300M documents there's a
lot of room before that will be a problem.

If your field is numeric, changing the sort type should solve your
problem. If you really are comparing Strings, it is not so easy.

Lucene 4 is unfortunately not ready for production, but it has huge
improvements with regard to memory usage on sorting.

If you are feeling adventurous, you can take a look at
https://issues.apache.org/jira/browse/LUCENE-2369
which drastically reduces the memory needed for sorting. An experiment
with 200M unique terms required 1,7 GByte with the trade-off that it
took 8 minutes to open the index. One of the earlier patches works
against Lucene 3, while the later ones are Lucene 4 only.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SSD Experience (on developer machine)

2011-08-26 Thread Toke Eskildsen
On Wed, 2011-08-24 at 11:46 +0200, David Nemeskey wrote:
> Theoretically, in the case described above, it would be possible to move 
> 'static' data (data of cells that have not been written to for a long time) 
> to 
> the 5GB in question and use the 'fresher' cells as free space; this could be 
> done in a round-robin fashion.

A fine idea. Of course it is not guaranteed that data remains static,
but the probability if high.

> Do SSDs (or some one them) implement a similar 
> functionality? Or alternatively, are there tools that do this?

I am sorry, but I have no idea if this is the case.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SSD Experience (on developer machine)

2011-08-26 Thread Toke Eskildsen
On Wed, 2011-08-24 at 13:42 +0200, Federico Fissore wrote:
> I add a question. Toke you said that "the current state of wear can be 
> queried". How?

With a S.M.A.R.T.-tool, preferably up-to-date to get it to display the
vendor-specific properties in an easy to understand manner.

On my Ubuntu-box with a 160GB Intel X25-M G2, 
sudo smartctl -A /dev/sdb1
gives me (abbreviated by me)
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours  [...]   9372
 12 Power_Cycle_Count   [...] 50
225 Host_Writes_Count   [...] 176888
233 Media_Wearout_Indicator [...]  0
and 
http://forums.storagereview.com/index.php/topic/28649-intel-x25-interpreting-smart-values/
tells me that I'll have to divide the raw value from #225 with 29,8,
resulting in 5,8TB written in total (about 15GB/day).

Some vendors provide specific software that makes it much easier.
Intel makes the Intels SSD Toolbox that unfortunately is Windows only.

> AFAIK, cells target for a write are chosen just randomly between the 
> free ones, ignoring other factors

That would be a very bad wear-leveling strategy. Keeping a counter for
each cell and selecting the free cell with the lowest count is trivial.
However, given the bumpy road to great SSDs, I am sure that some vendors
has done it this way.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SSD Experience (on developer machine)

2011-08-23 Thread Toke Eskildsen
On Tue, 2011-08-23 at 17:56 +0200, Federico Fissore wrote:
> Great reply, thank you. Will re-read it and re-evaluate my position

Thanks for having an open mind.

Toke:
> > Let's say you have a drive with just 5GB left. Let's say that the cells
> > can handle 10,000 writes. Doing constant rewrites of the 5GB gives you
> > 10,000 * 5GB = 50TB before the drive gives up.
> [...]
>
> 50TB before _every single cell in the drive_ gives up. You will change 
> the drive much sooner, probably at the first two occasions of corrupted 
> data.

50TB for the 5GB of cells. The rest of the cells will be fine since
they've only ever been written to once in this artificial
destroy-the-drive test case. Of course that still leaves the drive
unusable. But oh well, I am firmly in the "this is so hot that I will
accept an astonishing amount of insane"-camp, so I tend focus a lot on
the positive things about SSDs.

As for corrupted data then at least some SSDs (Intel comes to mind) will
just stop accepting new data if they ever reach the point of being worn
out. They can still be read. The really great thing, seen from a server
perspective, is that the current state of wear can be queried: If it is
used for some special setup which requires an insane amount of small
writes, then it is a great help for the admin to be warned about pending
death.

Of course nothing guarantees that everything works as advertised. This
is where some real statistics on models and recalls would come in handy.

> fede

Heh. I'm sorry, but in danish "fede" means "fatty". On the other hand, I
also know what "Toke" means in english.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SSD Experience (on developer machine)

2011-08-23 Thread Toke Eskildsen
On Tue, 2011-08-23 at 17:20 +0200, Paul Libbrecht wrote:
> Funnily, I had such an experience: an SSD on the laptop of the brand SanDisk, 
> guaranteed for 80 TB of writes.
> Well, I had it twice changed under guarantee. Then the shop provided me an 
> OCZ.
> Maybe that lasts longer... I'm still in guarantee.

Do you know if it was wear that broke the drive or some other defect?
There seems to be a strong tendency to attribute failing SSDs to wear,
while there can be a number of reasons. I am not claiming that SSDs are
unfailable, but I do claim that the fear of wearing them out is
unfounded.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SSD Experience (on developer machine)

2011-08-23 Thread Toke Eskildsen
On Tue, 2011-08-23 at 16:10 +0200, Federico Fissore wrote:

[Toke: Re-writes is not a problem now]
> Maybe this still is a point, thinking at how easy is today to fill your 
> local storage: for example, a "common" user will store video files.

It is only a problem if the SSD is stored to the brim (and don't have
hidden cells to counter the problem). If you store to the brim, you will
have problems working actively with the device - temporary files,
logging and whatnot tends to require that a non-trivial amount of
storage is free. If you are not working actively with the device, the
wear on the cells is not a problem. This brings us back to my initial
point: Yes, you can construct cases where there will be problems. But
they tend to be artificial:

Let's say you have a drive with just 5GB left. Let's say that the cells
can handle 10,000 writes. Doing constant rewrites of the 5GB gives you
10,000 * 5GB = 50TB before the drive gives up. I asked my drive about my
daily write average some time ago. It was 13GB/day. With that scenario,
the drive would live 10+ years.

Admittedly this is just back-of-the-envelope and it ignores a lot of
factors, but it does provides an idea of the amount of punishment they
can take.

> I don't because that one and other articles have scared me (and here 
> definitely fear = lack of information)

I suggest AnandTech. They provide some excellent articles where they do
in-depth analysis and cut through many of the misconceptions as well as
hyperbole that has surrounded SSDs.

> How long past that point do you think we are? Can you give some minimum 
> "model" age? Say, OCZ Vertex since 2 and Intel since 320 ?

I consider Intel X25 kind of a turning point in the history of SSDs.
That drive provided most of the features that modern SSDs has. Later
drives added better bulk speed, better maximum latency and better
TRIMming. Nice things but not as game-changing as the introduction of a
(relatively) cheap, reliable, wear-leveling drive with high performance
for both reads & writes.

[Toke: Use the SSD for tmp files and swap]

> ok for the swap speed, but in using the ssd with swap and temp files 
> enabled, you are saying the opposite of articles around such as
> http://www.howtogeek.com/62761/how-to-tweak-your-ssd-in-ubuntu-for-better-performance/

The OCZ Onyx that they test is a pretty old drive, but ignoring that,
they do make statements such as "You can help increase the life of your
SSD by reducing how much the OS write to disk" which is technically
correct, but of no real value as I argued above.

They disable atime which I find is a fine idea, but since the OCZ sucks
at random writes (relatively to other SSDs) I guess they gain a fair
deal of performance there. 

They put tmp in RAM without any explanation (although it fits well with
the atime-thing), but it does not matter since none of their tests use
tmp. For that matter, none of their tests use swap either. They might
claim that their article is based on testing but that is only true for a
subset of their tweaks. The wear due to tmp/noatime is just a claim they
make, without any explanation or calculations.

> btw, if less free disk space = more destructive scenario, then the 
> bigger the safer, and here the price/size ratio suggest a conservative 
> use of SSD. Mine is 120GB and is 60% filled and I'd like not to go 
> beyond that point, to avoid surprises

With 10,000 writes you've got around 720 TB of writes. That is 200GB/day
for the next 10 years. I would suggest checking with S.M.A.R.T-tool to
see if it provides you with write-statistics. I would be surprised if
they were that high.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SSD Experience (on developer machine)

2011-08-23 Thread Toke Eskildsen
On Tue, 2011-08-23 at 14:07 +0200, Marvin Humphrey wrote:
> I'm a little confused.  What do you mean by a "full to-hardware flush"
> and how is that different from the sync()/fsync() calls that Lucene
> makes by default on each IndexWriter commit()?  

A standard flush from the operating system flushes all OS write caches
to storage. That does not guarantee that the storage hardware has
flushed to persistent storage - it can still be in RAM on the device.
If the operating system flushes 1000 times/second, the SSD might still
only perform a single write to a solid state cell.

Now, running an important database with a lot of small updates, the "not
sure if it is really written"-approach might not be what the admin
wants. Disabling the on-SSD write cache ensures that all writes are
flushed. Since SSDs writes in blocks and not individual bits, this means
that a block will be written for each write from the OS. Couple that
with thousands of writes/second and the expected life of the SSD drops
drastically.

(one obvious solution to me is to buy a SSD with a battery that
guarantees that the SSD-cache can be flushed in case of power failure,
but then again I am not a database admin so there might be problems with
this approach)

All of this is not a problem when building Lucene indexes. It is just a
standard flush and even if the SSD disk cache is disabled, we're talking
about "real" updates with substantial data, not just single bits
flipping, so the wear on the cells will be in the same ballpark as
standard SSD use.

NB: Sorry about the private email at first. I pressed the wrong button.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SSD Experience (on developer machine)

2011-08-23 Thread Toke Eskildsen
On Tue, 2011-08-23 at 11:52 +0200, Federico Fissore wrote:
> we are probably running out of topic here, but for the record, there is 
> also someone lamenting about ssd

I find all of this highly on-topic. SSD reliability is an important
issue. We use customer-grade SSDs (Intel 510 were the latest ones
bought) in our servers as we see no point in enterprise-grade
reliability when we are mirroring machines.

> http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html
> 
> the underlying point is correct: SSD offer much less re-writes of the 
> same "sector" than disk based

Please, can't we kill this misconception once and for all?

Yes, the first generation of SSDs had bad wear-leveling and there has
been some exceptionally bad eggs along the way, but we're long past that
point now. All brand name SSDs use wear leveling and unless you set up
pathological destruction cases (fill the drive to 99% and keep
re-writing the last 1% ) the drive will be obsolete before it wears out.

What kills modern SSDs are either non-rewrite-related errors or server
use that requires a full to-hardware flushes after all small changes.
Even the author of the article you link to does not blame the failures
on re-writes.

Regarding that, it would be nice with an analysis of SSD failure rates
that wasn't anecdotally based. I'm certainly interested.

> so, as far as developer machines are involved, you should go for OSes 
> that use the disk efficiently [...]

Efficiently as speed, yes. Efficiently as minimizing writes, no. On the
contrary, disk swapping is much faster on SSDs along with temporary
files and all the other secondary writes that are done throughout the
day. Hit them hard. They're designed for it.

An backup? Why yes, we all do that anyway. Right?


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SSD Experience

2011-08-23 Thread Toke Eskildsen
On Tue, 2011-08-23 at 10:23 +0200, Dawid Weiss wrote:
> This one is humorous (watch for foul language though). It does get to 
> the point, however, and Bergman is a clever guy:
>
http://www.livestream.com/oreillyconfs/video?clipId=pla_3beec3a2-54f5-4a19-8aaf-35a839b6ecaa

We installed SSDs in all developer machines in 2009 (Intel X25) and
haven't looked back. They are now default for new laptops, most desktops
and all search server machines (okay, except for a rusty old test
machine, but that is mostly used for index building).

I think one of the important points that Bergman states is that SSDs are
cheaper than harddrives, if we're looking at performance instead of disk
space. The gut reaction is "but they are so expensive" but when we're
comparing to the alternative of RAIDing and bulking up on RAM, this
often turns out to be wrong. For developer machines, the advantage of a
more responsive machine is obvious.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SSD Experience

2011-08-23 Thread Toke Eskildsen
On Mon, 2011-08-22 at 18:49 +0200, Rich Cariens wrote:
> Does anyone have any experiences or stories they can share about how SSDs
> impacted search performance for better or worse?

Our measurements are getting old, but since spinning disks hasn't
improved and SSDs has improved substantially since then, the conclusion
should be about the same: Buy SSDs. They offer performance rivaling RAM
with the obvious advantages of being persistent in case of power loss.

Details at http://wiki.statsbiblioteket.dk/summa/Hardware


Just be sure that IO really is your bottleneck before you improve on
that part.





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: deleting 8,000,000 indexes takes forever!!!! any solution to this...

2011-07-06 Thread Toke Eskildsen
On Tue, 2011-07-05 at 17:50 +0200, Hiller, Dean x66079 wrote:
> We are using a sort of nosql environment and deleting 200 gig on one machine 
> from the database is fast, but then we go and delete 5 gigs of indexes that 
> were created and it takes forever

8 million indexes is at a minimum 16 (24?) million files. If you are
using a conventional harddisk for that, then yes, it takes forever.
SSD is the answer to that problem, but then again, SSD is the answer to
most IO-performance problems.

Just a quick sanity check: I hope you are not storing the individual
index folders under the same root folder? If you have
indexes/index001/
indexes/index002/
indexes/index003/
...
indexes/index800/
in the same folder, you are asking for trouble since most file systems
don't perform well with folders with millions of entries. If that is the
case, split them in sub folders for every X order of magnitude, such as
indexes/000/000/index001/
indexes/000/000/index002/
indexes/000/000/index003/
...
indexes/000/500/index051/
...
indexes/008/000/index800/


Having that many tiny indexes sets off an alarm bell for me. That's
quite a special use of Lucene you've got going.

> Is there any option in lucene to make it so it uses LARGER files and less 
> count of files so it is easier to maintain and wipe out an index much faster?

Use compound files, optimize to single segment. If I understand
correctly, your indexes are tiny, so this should not give any noticeable
performance hit.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: distributing the indexing process

2011-06-30 Thread Toke Eskildsen
On Thu, 2011-06-30 at 11:45 +0200, Guru Chandar wrote:
>  Thanks for the response. The documents are all distinct. My (limited)
>  understanding on partitioning the indexes will lead to results being
>  different from the case where you have all in one partition, due to
>  Lucene currently not supporting distributed idf. Is this correct?

Yes, that is a prime blocker for proper distributed search with Lucene.

>  Is there a way to make it work seamlessly?

There's some work being done with Solr, but it is not stable:
https://issues.apache.org/jira/browse/SOLR-1632

We're experimenting with distributed idf by assigning different weights
to the queries sent to different searchers, based on term statistics
from the different indexes. However, it is quite a hack and one we've
done because one of the indexes is external and out of our control.


3 years ago (or was it 4?) we also used distributed indexing with
searching being done on a single merged index. It worked surprisingly
well, but it was replaced by indexing on a single machine, when we
finally got around to doing a proper profiling of our indexing process
and removed some serious bottlenecks.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: field sorted searches with unbounded hit count

2011-06-23 Thread Toke Eskildsen
On Thu, 2011-06-23 at 22:41 +0200, Tim Eck wrote:
>  I don't want to accuse anyone of bad code but always preallocating a 
>  potentially large array in org.apache.lucene.util.PriorityQueue seems
>  non-ideal for the search I want to run.

The current implementation of IndexSearcher uses threaded search where
each slice collects docID's independently, then adds them to a shared
PriorityQueue one at a time. With this architecture, making the
PriorityQueue size-optimized would either require multiple resizings
(more GC activity, slightly more processing) or that all search-threads
finishes before constructing the queue (longer response time).

The current implementation works really well when requesting small
result sets. It is not so fine for larger sets (partly because of memory
allocation, partly because the standard heap-based priority queue has
horrible locality, making it perform rather bad when it cannot be
contained in the cache) and - as you have observed - really bad for the
full document set. Finding a better general solution that covers all
three cases is a real challenge, a very interesting one I might add.
Of course one can always special case, but using a Collector seems like
the way to go there.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boosting a document at query time, based on a field value/range

2011-06-15 Thread Toke Eskildsen
On Wed, 2011-06-15 at 11:22 +0200, Sowmya V.B. wrote:
> [...] "OR **field**:[20 TO 30]^10"
> 
> Well, my question is partly answered with this clarification. But, I am
> still wondering how to do that programmatically.
> the (20-30) range is not a fixed range. Its chosen by the user. It can as
> well be (12-34) too. I am not able to figure out if there is any function in
> the searcher classs, which will enable me give these specifications
> ...something like... a setboost(), which exists during index time.

The boost is something you do with your query, before it is issued to
the searcher.

If you use the query parser, you can provide the additional query
parameters by concatenating them to the standard user query:
String fullQuery = userQuery 
  + " OR myrange:[" + from + " TO " + to + "30]^" + myBoost;

If you build your Query by code, you can use ConstantScoreRangeQuery or
RangeQuery for the range part, where you can call setBoost(float).

- Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index size and performance degradation

2011-06-14 Thread Toke Eskildsen
On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote:
> The whole point of my question was to find out if and how to make 
> balancing on the SAME machine. Apparently thats not going to help and at 
> a certain point we will just have to prompt the user to buy more hardware...

It really depends on your scenario. If you have few concurrent requests
and are looking to minimize latency, sharding might help; assuming you
have fast IO and multiple cores. You basically want to saturate all
available resources for all requests.

On the other hand, if throughput is the issue, sharding on a single
machine is counter-productive due to increased duplication and merging.

> Out of curiosity, isn't there anything that we can do to avoid that? for 
> instance using memory-mapped files for the indexes? anything that would 
> help us overcome OS limitations of that sort...

One standard advice for speeding up searches is using SSD's. Our
(admittedly old) experiments puts SSD-performance near RAM. With the
prices we have now, SSD's seems like an obvious choice for most setups.

We tried a few performance tests at different index sizes and for us,
index size vs. performance looked like the power law: Heavy performance
degradation in the beginning, less later. It makes sense when we look at
caching and it means that if you do not require stellar performance, you
can have very large indexes on few machines (cue Hathi Trust).

- Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boosting a document at query time, based on a field value/range

2011-06-10 Thread Toke Eskildsen
On Fri, 2011-06-10 at 10:38 +0200, Sowmya V.B. wrote:
> I am looking for a possibility of boosting a given document at query-time,
> based on the values of a particular field : instead of plainly sorting the
> normal lucene results based on this field.

I think you misunderstand Eric's answer, as his suggestion does exactly
what you ask for. Have you tried the "OR *field*:[20 TO 30]^10" method?


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RAMDirectory doesn't win over FSDirectory all the time, why?

2011-06-07 Thread Toke Eskildsen
On Mon, 2011-06-06 at 15:29 +0200, zhoucheng2008 wrote:
> I read the lucene in action book and just tested the
> FSversusRAMDirectoryTest.java with the following uncommented:
> [...]Here is the output:
> 
> RAMDirectory Time: 805 ms
> 
> FSDirectory Time : 728 ms

This is the code, right?
http://java.codefetch.com/example/in/LuceneInAction/src/lia/indexing/FSversusRAMDirectoryTest.java

The test is problematic as the same two tests run sequentially.

If you change 
  long ramTiming = timeIndexWriter(ramDir);
  long fsTiming = timeIndexWriter(fsDir);
to
  long fsTiming = timeIndexWriter(fsDir);
  long ramTiming = timeIndexWriter(ramDir); 
my guess is that RAMDirectory will be faster. For a better
comparison, perform each test in separate runs (make a test
class just for RAMDirectory and one just for FSDirectory,
then run them one at a time, each in its own JVM).

One big problem when comparing RAMDirectory to file-access
is caching. What you measure with a test might not be what
you see in production, as the production index might be
large compared to RAM available for file caching.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Federated relevance ranking

2011-06-06 Thread Toke Eskildsen
On Thu, 2011-06-02 at 21:51 +0200, Clint Gilbert wrote:
> We're also considering a home-grown scheme involving normalizing the
> denominators of all the index components in all our indices, based on
> the sums of counts obtained from all the indices.  This feels like
> re-inventing the wheel, and it's not clear to me yet that the low-level
> manipulation of indices that we'd need to do is even possible.

We're currently experimenting with this approach, albeit only for two
searchers. Since we have very little control of the secondary searcher,
just a basic search-API, we're really hacking and performing a query
rewrite based on term statistics. This only works for basic term queries
(no wildcards, ranges etc.), but fortunately our search logs show that
they are by far the most common.

The math is not too bad: Extract occurrence counts for the terms, sum
them, calculate the difference when sending a request to a specific
searcher and set a term boost in the textual query, so that the standard
ranking formula in Lucene will yield the desired score.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing speed on NTFS

2011-05-31 Thread Toke Eskildsen
On Tue, 2011-05-31 at 08:52 +0200, Maciej Klimczuk wrote:
> I did some testing with 3.1.0 demo on Windows and encountered some strange  
> bahaviour. I tried to index ~6 small text documents using the demo.
> - First trial took about 18 minutes.
> - Second and third trial took about 2 minutes.

First trial sounds strange, even if the documents are single files and
on a traditional harddisk. The 500 documents/second in the subsequent
trials sounds okay for small documents.

> [...] I repeated this with 30B, 60MB and 100MB, but all the time I
> aborted the process and removed index, it was recreated to the
> previous size in a matter of tens of seconds (less than minute),
> and after that it was growing slowly.

It seems like you have extremely slow read access from your storage and
a small enough data set so that the generated index is still in the
write buffer.

Are you perhaps using Windows XP? It drops back to PIO-mode under some
circumstances and it really hurts performance. You can read about it at
http://winhlp.com/node/10

> If there is a document or site explaining this or it was asked before,  
> please forgive me; just searching about Lucene indexing performance on  
> NTFS doesn't help me much...

There should not be any problems like the one you describe with NTFS.
I have used Windows XP with NTFS myself for a year or two and did not
encounter anything like it.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sharding Techniques

2011-05-13 Thread Toke Eskildsen
On Fri, 2011-05-13 at 12:11 +0200, Samarendra Pratap wrote:
> Comparison between - single index Vs 21 indexes
> Total Size - 18 GB
> Queries run - 500
> % improvement - roughly 18%

I was expecting a lot more. Could you test whether this is an IO-issue
by selecting a slow query and performing the exact same search with that
query several times? The OS cache should kick in quickly so that the
storage isn't touched at all on the later searches.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sharding Techniques

2011-05-10 Thread Toke Eskildsen
On Mon, 2011-05-09 at 13:56 +0200, Samarendra Pratap wrote:
>  We have an index directory of 30 GB which is divided into 3 subdirectories
> (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories
> (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21).

So each part is about ½ GB in size? That gives you a serious logistic
overhead. You state later that you only update the index once a day, so
it would seem that you have no need for the fast update times that such
small indexes give you. My guess is that you will get faster search
times by using a single index.


Down to basics, Lucene searches work by locating terms and resolving
documents from them. For standard term queries, a term is located by a
process akin to binary search. That means that it uses log(n) seeks to
get the term. Let's say you have 10M terms in your corpus. If you stored
that in a single field in a single index with a single segment, it would
take log(10M) ~= 24 seeks to locate a term. This is of course very
simplified.

When you have 63 indexes, log(n) works against you. Even with the
unrealistic assumption that the 10M terms are evenly distributed and
without duplicates, the number of seeks for a search that hits all parts
will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
begun to estimate the merging part.

Due to caching, a seek is not equal to the storage being hit, but the
probability for a storage hit rises with the number of seeks and the
inevitable term duplicates when splitting the index.

> We have almost 40 fields in each index (is it a bad to have so many
> fields?). most of them are id based fields.

Nah, our index is about 40GB with 100+ fields and 8M documents. We use a
single index, optimized to 5 segments. Response times for raw searches
are a few ms, while response times for the full package (heavy faceting)
is generally below 300ms. Our queries are mostly simple boolean queries
across 13 fields.

> Keeping parts of indexes on different servers search on all of them and then
> merging the results - what could be the best approach?

Locate your bottleneck. Some well-placed log statements or a quick peek
with visualvm (comes with the Oracle JVM) should help a lot.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene search result produced wrong result (due to java Collation)?

2011-02-28 Thread Toke Eskildsen
On Mon, 2011-02-28 at 22:44 +0100, Zhang, Lisheng wrote:
> Very sorry I made a typo, what I meant to say is that lucene sort produced 
> wrong
> result in English names (String ASC):
> 
> liu yu
> l yy

The standard Java Collator ignores whitespace. It can be hacked, but you
will have to write your own implementation to get Lucene to sort in the
desired way. FieldComparatorSource is a good place to start.

A code snippet demonstrating the Collator-hack:

public void testJavaStandardCollator() throws Exception {
java.text.Collator javaC =
java.text.Collator.getInstance(new Locale("EN"));
assertTrue("Spaces should be ignored per default",
   javaC.compare("liu yu", "l yy") < 0);

java.text.RuleBasedCollator adjustedC = new
java.text.RuleBasedCollator(
((java.text.RuleBasedCollator)javaC).getRules().
replace("<'\u005f'", "<' '<'\u005f'"));
assertTrue("Spaces should be significant inside strings after
adjust",
   adjustedC.compare("liu yu", "l yy") > 0);
}


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scale out design patterns

2011-02-03 Thread Toke Eskildsen
On Fri, 2011-02-04 at 05:54 +0100, Ganesh wrote:
> 2. Consider a scenario I am sharding based on the User, I am having single 
> search server and It is handling 1000 members. Now as the memory consumption 
> is high,  I have added one more search server. New users could access the 
> second server but what about the old users, their data will be still added to 
> the server1. How to address this issue. Is rebuilding the index the only way.

You can move old users by reindexing their data at the new server and
deleting them from the old one? That's only a partial modification.

If you are about to move a whole lot of users, you can copy the old
index, delete all documents from the copy, except the ones that are to
be moved, then merge the pruned index with the new one.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Re: Scale up design

2010-12-16 Thread Toke Eskildsen
On Thu, 2010-12-16 at 06:59 +0100, Ganesh wrote:
> 250 GB of data, 40 GB of Index Size, 60 million records is
> working fine with 1 GB RAM. We are storing minmal amount
> of data in index. We are doing sorting on Date. Even in
> single system, the database are shard.

Looking back in the list, I see that you're sharding on weeks with 50+
weeks in the index.

> build hosted solution. This stats will
> increase by minimum 10 times in 2 - 3 years. I plan to use
> 64 Bit, with 8 - 10 GB RAM allocated to JVM.

When making a conservative estimate and multiplying with 10, you must
remember to do the same for the system memory available for disk cache.

If your shards are searched sequentially, you could measure the response
time for a single shard (after warm up and with different queries), then
create a test-shard by merging 10 shards and measure response-time for
that. Subtracting the two numbers (to remove the overhead of the
front-end layer) and multiplying with 50 should give you a rough
estimate for the performance of an upscaled setup.

Another measurement suggestion: Divide the current performance of the
full setup with the performance of a single shard, then multiply the
performance of a single created by merging 10 shards with that number.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scale up design

2010-12-15 Thread Toke Eskildsen
On Wed, 2010-12-15 at 09:42 +0100, Ganesh wrote:
> What is the advantage of going for 64 Bit.

Larger maximum heap, more memory in the machine.

> People claim performance and usage of more RAM.

Yes, pointers normally take up 64bit on a 64bit machine. Depending on
the application, the overhead can be anything from practically
non-existing to close to 100%. You can set an option for the JVM to try
and use smaller pointers on 64bit machines. This limits the maximum
memory allocation in the JVM to 32GB, which seems like a fair compromise
at this point in time.
http://wikis.sun.com/display/HotSpotInternals/CompressedOops

> In 32 Bit OS, JVM handles 1 to 1.5 GB of RAM then in case
> of 64 Bit, Single JVM cannot use more than 1.5 GB RAM?

Say what? When running on a 64bit, the JVM heap limit is normally the
system's per-process memory limit. For Linux this is generally well
above any real world hardware. For Windows it seems like you need to
enable something:
http://msdn.microsoft.com/en-us/library/aa366778%28v=vs.85%29.aspx
(note: I have no experience with 64bit Windows)

> Please help me with some more ideas. We need to design whether
> to scale out or scale up.

Maybe you could describe your vision in more detail? What scale are you
looking at? How large is your index in GB, how many documents, how fast
do you need the searcher to respond, are you doing any sorting or
faceting (and do you facet on a few unique values or things like title
or author)?

It makes little sense to try and get a single machine to handle billions
of documents with large faceting, but it seems silly to distribute 10GB
of index with 1 million documents. As a general rule of thumb. As always
your mileage might wary.

For the record, our current index is 40GB/9 million records. We're doing
sorting on title and faceting on 15 fields, out of which 2 has 4-6
million unique values. This runs on a single machine (okay, 2, but they
are mirrored) with 6GB of RAM and it works fine with sub-second response
times (normally <300ms AFAIR). Our experimental setup can get by with
1.2GB and would thus not require 64bit.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: a proof that every word is indexing properly

2010-12-02 Thread Toke Eskildsen
On Thu, 2010-12-02 at 03:54 +0100, David Linde wrote:
> Has anyone figured out a way to logically prove that lucene indexes ever
> word properly?

The "Precision and recall in lucene"-thread seems relevant here.

> Our company has done alot of research into lucene, all of our IT department
> is really impressed and excited about lucene *except* one of the older
> search/indexing experts.
> Who doesn't want to move to a new search engine, is there anyway to
> logically prove, that lucene indexes every word properly?

That is a straw man argument. As the precision-thread shows, it is
extremely hard to define what "properly" means in relation to a
non-trivial retrieval system like Lucene.

If your grouch is an old-school database man, he might equal "properly"
to "Every word that exists in the source should be indexed so that a
search for that word will return all documents that contains it and no
other documents (phew)". As David implies, this is a bad test: It
satisfies the guy but does not a proper search system make.

But I'm just guessing here and it sounds like you're doing the same,
asking for ideas of proving Lucene functionality. Maybe you could turn
it around? Ask the guy what it would take for him to accept Lucene or
any other option for that matter. When you have that, you can discuss
whether his requirements are valid or not.

> One idea we considered is attempting to rebuild the source from the index,
> but it seems like doing that would take a huge effort.

It is also not possible in general. Writing specific code, you could
just cheat massively and store everything, giving you instant and 100%
correct rebuild.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene Software/Hardware Setup Question

2010-10-27 Thread Toke Eskildsen
On Tue, 2010-10-26 at 23:17 +0200, Kovnatsky, Eugene wrote:
> Thanks Toke. Very descriptive. A few more questions about your SSD
> drive(s)
>  - what is its current size 

4 * 64GB Samsung MCCOE64G5MPP-0VA00 drives. They were pretty cool two
years ago and still work very well for search-servers (random writes are
not good, but we don't need that for searching):
http://www.tomshardware.com/reviews/flash-ssd-hard-drive,2000-19.html

>  - do you project any growth in your index size

Yes. Hopefully an internal project for maintaining digital objects will
change gears during this year. This will result in objects with a lot of
meta data and some fulltexts. Depending on economy, the amount of
objects will range from 100.000+ to a few million. I would guesstimate
that this would mean a doubling of the index size, due to the richness
of the new objects.

Further out, the projections are unreliable. As a technician I hope for
a serious jump in size within a year or two, but I have hoped for that
the last two years. Politics does not move as fast as technology.

>  - if yes then how do you plan to correlate that with your hardware
> needs

A doubling of the index size makes the existing 256GB/machine a tight
fit. I seem to remember that there are two free slots in our servers, so
adding 2 new consumer-class SSDs is the obvious upgrade. We're switching
to a more memory- and CPU-efficient way of handling sorting and
faceting, so we should not need to boost CPU and RAM.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Next Word - Any Suggestions?

2010-10-26 Thread Toke Eskildsen
On Tue, 2010-10-26 at 14:27 +0200, Lucene wrote:
> In simple words, I need facet on the next word given a target word.

How large do you expect the documents to be and how many hits do you
need to process? If both are low, this seems to be a fairly straight
forward iteration with a HashMap to collect the counts.

If you're scaling up, you could make a field with all bi-grams and do
prefix faceting on that. 

> Long-term, do the opposite - frequency of the distinct terms before the word
> "fox":

By remembering the previous term, the iteration method should be about
the same as above. For the faceting method, just reverse the order in
the bi-grams.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Software/Hardware Setup Question

2010-10-26 Thread Toke Eskildsen
On Tue, 2010-10-26 at 02:16 +0200, Kovnatsky, Eugene wrote:
> I am trying to get some information on what enterprise hardware folks
> use out there. We are using Lucene extensively. Our total catalogs size
> is roughly 50GB between roughly 8 various catalogs, 2 of which take up
> 60-70% of this size.

That sounds a lot like our setup at the State and University Library,
Denmark. We have about 9M records with an index size of 59GB, with 4,5M
OAI-PMH harvested records and 2,5M bibliographic records from our
Aleph-system. The rest of the records are divided among 16 different
sources.

> So my question is - if any of you guys have similar catalog sizes then
> what kind of software/hardware do you have running, i.e. what app
> servers, how many, what hardware are these app servers running on?

We use a home brewed setup called Summa (open source) to handle the
workflow and the searching. It uses plain Lucene with a few custom
analyzers and some sorting, faceting, suggest and DidYouMean code.
One index holds all the material. Currently the index is updated on one
server and synced to two search-machines, but we're in the middle of
moving the index updating to the servers to get faster updates.

The hardware is 2 mirrored servers for fail-safe. They are running some
Linux variant and have 2.5GHz quad-core Xeons CPU's with 6MB of level 2
cache and 16GB of RAM. We are not using virtualization for this. The
machines uses traditional harddisks for data storage and fairly old
enterprise-class SSD's for the index. To be honest, they are currently
overkill - without faceting the throughput is 50-100 searches/second,
including the overhead of using web-service calls. Faceting slows this
somewhat, but as our traffic is something like 5-10 searches/second at
prime time (guesstimating a lot here, as it is has been a year or two
since I looked at the statistics), most of the time is spend on idle.

Before that we used dual-core Xeons, again with 16GB of RAM and SSD's.
They also performed just fine with our workload and were only replaced
due to a general reorganization of the servers. Before that, we used
used some older 3.1GHz single-core Xeon machines with only 1MB of level
2 cache, 32GB of slow RAM and traditional harddisks. My old 1.8GHz
single-core laptop were about as fast for indexing & searching and they
stand testament that a lot of RAM and GHz does not help much when the
memory system is lacking.

We did a lot of testing some time ago and found that out searches were
mostly CPU-bound when using SSDs. We've talked with our hardware guys
about building new servers in anticipation of more data and the current
vision is relatively modest machines with quad-core i7, 16GB of RAM and
consumer-grade SSDs (Intel or SandForce). As we have mirrored servers
and since no one dies if they can't find a book at our library, using
enterprise-SSDs is just a waste of money.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to index large number of files?

2010-10-21 Thread Toke Eskildsen
On Thu, 2010-10-21 at 05:01 +0200, Sahin Buyrukbilen wrote:
> Unfortunately both methods didnt go through. I am getting memory error even
> at reading the directory contents.

Then your problem is probably not Lucene related, but the sheer number
of files returned by listFiles.

A Java File contains the full path name for the file. Let's say that
this is 50 characters, which translates to about (50 * 2 + 45) ~ 150
bytes for the Java String. Add an int (4 bytes) plus bookkeeping and
we're up to about 200 bytes/File.

4.5 million Files thus takes up about 1 GB. Not enough to explain the
OOM, but if the full path name of your files is 150 characters, the list
takes up 2 GB.

> Now, I am thinking this: What if I split 4.5million files into 100.000 (or
> less depending on java error) files directories, index each of them
> separately and merge those indexes(if possible).

You don't need to create separate indexes and merge them. Just split
your 4.5 million files into folders of more manageable sizes and perform
a recursive descend. Something like

public static void addFolder(IndexWriter writer, File folder) {
 File[] files = folder.listFiles();
 for (File file: files) {
if (file.isDirectory()) {
  addFolder(writer, file);
} else {
  // Create Document from file and add it using the writer
}
  }
}

- Toke


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to find performance bottleneck

2010-10-06 Thread Toke Eskildsen
On Wed, 2010-10-06 at 12:22 +0200, Sergey wrote:
> When running application on Windows XP 32 bit machine the search time is 0.5 
> second. JVM is IBM Java 5 for 32 bit.
> But when running the same application on much more powerfull Windows Server 
> 2007 64 bit machine the search time is 3 seconds. JVM is IBM Java 5 for 64 
> bit.

If your memory allocation (-Xmx) is the same, the 64 bit machine will
have less available heap due to the 64 bit pointers. This could result
in excessive garbage collection. Try increasing the memory allocation
for the 64 bit machine with 50% or more.

Besides that, there should be no significant difference and you're left
with profiling. I recommend Process Explorer
http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx
for general system and single process load inspection. I/O is often the
sinner, så check read and write calls.

Switching to the Java part, try using visualvm
https://visualvm.dev.java.net/
with the Visual GC-plugin to see where the time is spend.

- Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Bettering search performance

2010-08-27 Thread Toke Eskildsen
On Fri, 2010-08-27 at 05:34 +0200, Shelly_Singh wrote:
> I have a lucene index of 100 million documents. [...]  total index size is 
> 7GB.

[...]

> I get a response time of over 2 seconds.

How many documents match such a query and how many of those documents do
you process (i.e. extract a term for)?


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Sorting a Lucene index

2010-08-25 Thread Toke Eskildsen
On Wed, 2010-08-25 at 07:16 +0200, Shelly_Singh wrote:
> I have 1 bln documents to sort. So, that would mean ( 8 bln bytes == 8GB RAM) 
> bytes. 
> All I have is 8 GB on my machine, so I do not think approach would work.

This implies that your numeric value can be more than 2 billion. Are you
sure that is true?


First suggestion (simple): Ensure that your sort field is stored and
sort by requesting the value for each document in the search result.
This works okay when the number of hits is small.

Second suggestion (complex): Make an int-array with the sort-order of
your documents. This takes 4GB and needs to be calculated fully before
use, which will take time. After that sorted searches will be very fast
and handle a large number of hits well.

You can let your indexer maintain the sort-array so that the existing
order ban be re-used when adding documents. Whether modifying an
existing order-array is cheaper than a full re-sort or not depends on
your batch size.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: slow search threads during a disk copy

2010-08-24 Thread Toke Eskildsen
On Mon, 2010-08-23 at 11:43 +0200, gag...@graffiti.net wrote:
> Intererstingly, the copy is quite fast (around 30s) when there are no 
> searches in progress.

I agree with Anshum: This looks very much like IO contention.

However, it might not just be a case of seek-time trouble: We've had
similar problems on our Linux server with when copying large files from
harddisk A to A while performing searches on SSD B.

Our low-level guys tells me that this was due to the write-buffer being
filled and that reads were blocked while it was flushed (AFAIR). We've
experimented by mounting with sync instead of async which gave us much
better response times during copying at the cost of a substantially
slower copy. dirsync should also be worth looking into.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: scalability limit in terms of numbers of large documents

2010-08-16 Thread Toke Eskildsen
On Sat, 2010-08-14 at 03:24 +0200, andynuss wrote:
> Lets say that I am indexing large book documents broken into chapters.  A
> typical book that you buy at amazon.  What would be the approximate limit to
> the number of books that can be indexed slowly and searched quickly.  The
> search unit would be a chapter, so assume that a book is divided into 15-50
> chapters.  Any ideas?

Hathi Trust has an excellent blog where they write about indexing 
5 million+ scanned books.  http://www.hathitrust.org/blogs
They focus on OCR'ed books where dirty data is a big problem, but most
of their thoughts and solutions can be used for clean data too.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Best practices for searcher memory usage?

2010-07-16 Thread Toke Eskildsen
On Thu, 2010-07-15 at 20:53 +0200, Christopher Condit wrote:

[Toke: 140GB single segment is huge]

> Sorry - I wasn't clear here. The total index size ends up being 140GB
> but to try to help improve performance we build 50 separate indexes 
> (which end up being a bit under 3gb each) and then open them with a 
> parallel multisearcher.

Ah! That is an whole other matter then. Now I understand why you go for
single segment indexes.

[Toke (assuming a single index): Why not optimize to 10 segments?]

> Is preferred(in terms of performance) to the above approach (splitting
> into multiple indexes)?

It's been 2 or 3 years since I experimented with the MultiSearcher, so
this is mostly guesswork from my part. Searching on a single index with
multiple segments and multiple indexes of single segments has the same
penalties: The weighting of the query requires a merge of query term
statistics from the parts. In principle it should be the same but as
always the devil is in the details.


50 parts do sound like a lot though. Even without range searches or
similar query-exploding searches, there is an awful lot of seeks to be
done. The logarithmic nature of term lookups work against you here.

A rough estimate: A simple boolean query with 5 field/terms is weighted
by each searcher. Each index has 50K terms (conservative guess) so for
each condition, the searchers performs ~log2(50K) = 16 lookups. With 50
indexes that's 50 * 5 * 16 = 4000 lookups.

The 4K lookups does of course not all result in a remote NFS request but
with 10-12GB of RAM on the search machine taken already, I would guess
that there is not much left for caching of the 140GB of index data?

Is it possible for you to measure the number of read requests that your
NFS server receives for a standard search? Another thing to try would be
to measure the same slow query 5 times after each other, thereby
ensuring that everything is fully cached. This should indicate if the
remote I/O is the main bottleneck or not.

The other extreme, a single fully optimized index, would (pathological
worst case compared to the rough estimate above) require 1 * 5 *
log2(50*50K) ~= 110 lookups for the terms.

I would have guessed that the 50 indexes is partly responsible for your
speed problems, but it sounds like you started out with a lower number
and later increased it?

> Not yet! I've added some benchmarking code to keep track of all 
> performance as I add these changes. Do you happen to know if the 
> Lucene benchmark package is still in use / a good thing to toy around with?

Sorry, no. The only performance testing we've done extensively is for
searches and for that we used our standard setup with logged queries in
order to emulate the production setting.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Best practices for searcher memory usage?

2010-07-15 Thread Toke Eskildsen
On Wed, 2010-07-14 at 20:28 +0200, Christopher Condit wrote:

[Toke: No frequent updates]

> Correct - in fact there are no updates and no deletions. We index 
> everything offline when necessary and just swap the new index in...

So everything is rebuild from scratch each time? Or do you mean that
you're only adding new documents, not changing old ones?

Either way, optimizing to a single 140GB segment is heavy. Ignoring the
relatively light processing of the data, the I/O for merging is still at
the very minimum to read and write the 140GB. Even if you can read and
write 100MB/sec it still takes an hour. This is of course not that
relevant if you're fine with a nightly batch job.

> By more segments do you mean not call optimize() at index time?

Either that or calling it with maxNumSegments 10, where 10 is just a
wild guess. Your mileage will vary:
http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/index/IndexWriter.html#optimize%28int%29

[Toke: 10GB is a lot for such an index. What about sorting & faceting?]

> No faceting and no sorting (other than score) for this index...

[Toke: What queries?]

> Typically just TermQueries or BooleanQueries: (Chip OR Nacho OR Foo) 
> AND (Salsa OR Sauce) AND (This OR That)
> The latter is most typical.
> 
> With a single keyword it will execute in < 1 second. In a case where 
> there are 10 clauses it becomes much slower (which I understand, 
> just looking for ways to speed it up)...

As Erick Erickson recently wrote: "Since it doesn't make sense to me,
that must mean I don't understand the problem very thoroughly".

Your queries seems simple enough and I would expect response times well
under a second with warmed index and conventional local harddrives.
Together with the unexpected high memory requirement my guess is that
there's something going on with your terms. If you try opening the index
with luke, it'll tell you the number of terms. If that is very high for
the fields you search on, this would explain the memory usage.

You can also take a look at the rank for the most common terms. If it is
very high this would explain the long execution times for compound
queries that uses one or more of these terms. A stopword filter would
help in this case if such a filter is acceptable for you.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Best practices for searcher memory usage?

2010-07-14 Thread Toke Eskildsen
On Tue, 2010-07-13 at 23:49 +0200, Christopher Condit wrote:
> * 20 million documents [...]
> * 140GB total index size
> * Optimized into a single segment

I take it that you do not have frequent updates? Have you tried to see
if you can get by with more segments without significant slowdown?

> The application will run with 10G of -Xmx but any less and it bails out. 
> It seems happier if we feed it 12GB. The searches are starting to bog 
> down a bit (5-10 seconds for some queries)...

10G sounds like a lot for that index. Two common memory-eaters are
sorting by field value and faceting. Could you describe what you're
doing in that regard?

Similarly, the 5-10 seconds for some queries seems very slow. Could you
give some examples on the queries that causes problems together with
some examples of fast queries and how long they take to execute?


The standard silver bullet for easy performance boost is to buy a couple
of consumer grade SSDs and put them on the local machine. If you're
gearing up to use more machines you might want to try this first.

Regards,
Toke


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: is this the right way to go?

2010-06-15 Thread Toke Eskildsen
On Thu, 2010-06-10 at 04:03 +0200, fujian wrote:
> Another thing is about unique. I thought it was unique "field value". If it
> means unique term, for English even loading all around 300,000 terms it
> won't take much memory, right? (Suppose the average length of term is 10,
> the total memory usage is 10*300,000=3MB)

It is only the unique field values, but remember that there is also an
array of length #docs with pointers to the strings that takes up 4 or 8
bytes/pointer, depending on 32bit/64bit JVM. Furthermore, the current
Lucene uses Strings which takes up a lot more than just #chars bytes:
300.000 Strings of average length 10 chars is is about 18MB.
http://www.javamex.com/tutorials/memory/string_memory_usage.shtml


I'm quietly hacking on a solution for this, but the current code is
still at the proof of concept-stage and way too flaky to use for
production: https://issues.apache.org/jira/browse/LUCENE-2369


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Any Solid State Drive performance comparisons?

2010-06-13 Thread Toke Eskildsen
Rob Bygrave [robin.bygr...@gmail.com] wrote:
> Has anyone done a performance comparison for an index on a Solid State Drive
> (vs any other hard drive ... SATA/SCSI)?

We did a fair amount of testing two years ago and put some graphs at 
http://wiki.statsbiblioteket.dk/summa/Hardware The short version is that 
they'll probably boost your raw search speed a lot. However, often all the web 
service wrapping and similar is the main performance bottleneck.

Things has changes since then: CPU's, RAM and SSDs has gotten faster while hard 
disks has stayed the same, so the performance gap between SSDs and harddisks 
has widened.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Right memory for search application

2010-04-29 Thread Toke Eskildsen
On Wed, 2010-04-28 at 16:57 +0200, Erick Erickson wrote:
> And you can extend this ad nauseum. For instance, you could use 6
> fields, yy, mm, dd, HH, MM, SS and have a very small number of
> unique values in each using really tiny amounts of memory to sort down
> to the second in this case.

A reference to a String takes up 4 or 8 bytes, depending on JVM and
setup. When doing String-based sorting in Lucene, there will be a
reference for each document for each sort-field. With 18M documents,
this will be at least 18M * 4 bytes * 6 = 432 MB. Plus the negligible
amount of String objects. It is less than the previously estimated
1.2GB, but still worth noting.

- Toke


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Right memory for search application

2010-04-27 Thread Toke Eskildsen
Samarendra Pratap [samarz...@gmail.com] wrote:
> 1. Our default option is sort by score, however almost 8% of searches use
> sorting on a field (mmddHHMMSS). This field is indexed as string (not as
> NumericField or DateField).

Guessing that the timestamp is practically unique for each document, sorting by 
String takes up a bit more than
18M * (40 bytes + 2 * "mmddHHMMSS".length() bytes) ~= 1.2 GB of RAM as the 
Strings are cached. Coupled with the normal overhead of just opening an index 
of your size (500MB by your measurements?), I would have guessed that 3600MB 
would definitely be enough to open the index and do sorted searches.

I realize that fiddling with production servers is dangerous, but connecting 
with JConsole and forcing a garbage collection might be acceptable? That should 
enable you to determine whether you're leaking memory or if it's just the JVM 
being greedy. I'd guess you leaking though, as HotSpot does not normally 
allocate up to the limit if it does not need to.

Anyway, changing to one of the optimized fields for sorting dates should shave 
1 GB off the memory requirement, so I'll recommend doing that no matter what 
the main cause of your memory problems is.

Regards,
Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Old Lucene src archive corrupt?

2010-03-02 Thread Toke Eskildsen
From: An Hong [an.h...@i365.com]
> I'm trying to download some old Lucene source, e.g., 
> http://archive.apache.org/dist/lucene/java/lucene-2.9.0-src.zip
I get "unexpected of Archive" report from WinRar, which never has had problem 
with zip files.  It's also the same problem with lucene-2.9.1-src.zip and 
perhaps with other versions as well.

I just downloaded the ZIP-file from the link above. No problems with Ubuntu's 
default unzip, WinRar 3.92 or Total Commander (the last two running under Wine).

Can you check if the md5-sum is correct after download? If not, try downloading 
again and see if that changes the md5-sum.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: java.io.IOException: read past EOF since migration to 2.9.1

2010-02-17 Thread Toke Eskildsen
On Wed, 2010-02-17 at 15:18 +0100, Michael van Rooyen wrote:
> I recently upgraded from version 2.3.2 to 2.9.1. [...]
> Since going live a few days ago, however, we've twice had read past EOF 
> exceptions.

The first thing to do is check the Java version. If you're using Sun JRE
1.6.0, you might have encountered a nasty bug in the JVM:
http://issues.apache.org/jira/browse/LUCENE-1282


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: External sort

2009-12-18 Thread Toke Eskildsen
On Fri, 2009-12-18 at 12:47 +0100, Ganesh wrote:
> I am using Integer for datetime. As the data grows, I am hitting the 
> upper limit.

Could you give us some numbers? Document count, index size in GB, amount
of RAM available for Lucene?

> One option most of them in the group discussed about using EHcache. 
> 
> Let consider the below data get indexed. unique_id is the id 
> generated for every record. unique_id,  field1,  field2,  date_time
> 
> In Ehcache, Consider I am storing
> unique_id, date_time
> 
> How could i merge the results from Lucene and Ehcache? Do I need to 
> fetch all the search results and compare it against the EHcache
> results and decide (using FieldComparatorSource).

As you are storing the date_time in the index, you don't win anything by
caching the values externally: Reading the unique_id needed for lookup
in the Ehcache takes just as long as reading the date_time directly.

> (future thought / research) One more thought, Is there any way to 
> write the index in sorted order, May be while merging. Assign docid
> by sorting the selected field.

You cannot control the docID that way, but Lucene keeps documents in
index order, so you could do this by sorting your data before index
build.

You're touching on a recurring theme here, as coordination with external
data-handling could be done very efficiently if Lucene provided a
persistent id as a first-class attribute for documents. The problem is
that it would require a lot of changes to the API and that it would mean
an additional non-optional overhead for all Lucene users. I haven't kept
track on enough threads on the developer-list, so a better solution to
the problem might have been found.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: External sort

2009-12-17 Thread Toke Eskildsen
(reposted to the list as I originally sent my reply to only Marvin by mistake)

From: Marvin Humphrey [mar...@rectangular.com]
> We use a hash set to perform uniquing, but when there are a lot of unique
> values, the memory requirements of the SortWriter go very high.
>
> Assume that we're sorting string data rather than date time values (so the
> memory occupied by all those unique values is significant).  What algorithms
> would you consider for performing the uniquing?

We do have a structure that is comparable to the above: An ordered list of
Strings, where updates to a Lucene index requires updates to the list. We do
this by having a long[] with collator-sorted offsets into a file with all the 
Strings
in insertion order.

When a new String is considered, existence is checked in O(log(n)) by
binary search. If it is new, it is appended to the file and the offset inserted
into the offset-array (arrayCopy). That is only usable for a relatively small
amount of inserts, due to the arrayCopy. SSDs are very helpful for
String lookup performance here.

For building the initial list, we use the terms for a given field, so we know 
that
they are unique: We Append them all in a file and keep track of the offsets,
then sort the offset-list based on what they're pointing at. By using caching
and mergesort (heapsort _kills_ performance in this scenario, as there is no
locality), it performs fairly well.
(I'm lying a little about our setup, for sake of brevity)
Memory usage excl. caching is 8 bytes * #Strings + housekeeping.

By adding another level of indirection and storing the offsets as a file and
sort a list of pointers to the offsets (argh), memory requirements can be
dropped to 4 bytes * # Strings. That doubles the number of seeks, so I
would only recommend it with SSDs and in tight spots.

We do have some code for duplicate reduction (long story): When the list
is sorted, step through the offsets and compare the Strings for the entries
at position x and x+1. If the Strings are equal, set offset[x+1] = x.
When the iteration has finished, the offsets only points to unique Strings
and the Strings-file contains some non-referenced entries, that can be
cleaned up by writing a new Strings-file.


Somewhat related, I'm experimenting with a sliding-window approach to
sorting Lucene terms, which might be usable in your scenario. It is a fairly
clean trade-off between memory and processing time. It is outlined in the
thread "heap memory issues when sorting by a string field" and it should
be fairly straight-forward to adapt the algorithm to remove duplicates.
The main requirement is that it is possible to iterate over the Strings
multiple times.


In general, I find that the memory-killer in these cases tend to be all the
wrapping and not the data themselves. A String has 38+ bytes of
overhead and if you're mainly using ASCII, Java's 2-byte char sure
means a lot of always-0-bits in memory. I don't know if it is feasible, but
making something like
class LittleString {
  private byte[] utf8;
  LittleString(String s) {
utf8 = s.getBytes("utf-8");
  }
  public int hashCode() {
...generate from utf8
  }
  public String toString() {
return new String(utf8, "utf-8");
  }
  public boolean equals(Object o) {
if (!(o instanceof LittleString) return false;
return Arrays.equals(utf8, ((LittleString)o).utf8);
  }
}
and storing that in your HashSet instead of the Strings directly ought
to cut a fair amount of your memory usage. I don't know how much
it would cost in performance though and the HashSet structure itself
isn't free.

Regards,
Toke Eskildsen
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: External sort

2009-12-17 Thread Toke Eskildsen
Sigh... Forest for the trees, Toke.

The date time represented as an integer _is_ the order and at the same time a 
global external representation. No need to jump through all the hoops I made. I 
had my mind full of a more general solution to the memory problem with sort.


Instead of the order-array being a long[] with order and datetime, it should be 
just an int[] with datetime. The FieldCacheImpl does this for INT-sorts , so 
there's no need for extra code if you just store the datetime as an integer 
(something like Integer.toString(datetimeAsInt) for the field-value) and use 
SortField(fieldname, SortField.INT) to sort with.

If you cannot store the datetime as an integer (e.g. if you don't control the 
indexing), you can use the FieldCacheImpl with a custom int-parser that 
translates your datetime representation to int.

The internal representation takes 4 bytes/document. If you need to go lower 
than that, I'll say you have a very uncommon setup.

It can be done by making custom code and storing the order-array on disk, but 
access-speed would suffer tremendously for searches with millions of hits. An 
alternative would be to reduce the granularity of the datetime and use 
SortField.SHORT or ShortField.BYTE. A third alternative would be to count the 
number of unique datetime-values and make a compressed representation, but that 
would make the creation of the order-array more complex.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: External sort

2009-12-17 Thread Toke Eskildsen
On Thu, 2009-12-17 at 15:34 +0100, Ganesh wrote:
> Thanks Toke. I worried to use  long[] inverted = new long[reader.maxDoc]; 
> as the memory consumption will be high for millions of document.

Well, how many documents do you have? 10 million? That's just 160MB in
overhead, of which the 80MB are temporary on the first search.

> Any idea of building external sort cache?  

You could dump the order-array on disk (huge performance hit on
conventional harddisks), but it's hard to avoid the temporary
inverse-array upon first search. Of course, you could generate it on
index build and thus have a memory-hit of virtual 0.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: External sort

2009-12-17 Thread Toke Eskildsen
On Thu, 2009-12-17 at 07:48 +0100, Ganesh wrote:
> I am using v2.9.1 I am having multiple shards and i need to do only 
> date time sorting. Sorting consumes 50% of RAM.

I'm guessing that your date-times are representable in 32 bits (signed
seconds since epoch or such - that'll work until 2038)? If so, it should
be possible to do very efficient sorting, both memory- and
performance-wise.

Make your own sorter by implementing SortComparatorSource.

The SortComparatorSource returns a ScoreDocComparator which contains an
array of longs, in which the first 32 bits designates the order of the
document at the given index (docID) and the last 32 bits holds the date
time.

The ScoreDocComparator's methods are trivial:
  public int compare (ScoreDoc i, ScoreDoc j) {
return order[i.doc] - order[j.doc];
// Or is it the other way? I always mix them up
  }
  public Comparable sortValue (ScoreDoc i) {
return Integer.valueOf((int)(order[i.doc] & 0x));
  }
  public int sortType(){
return SortField.CUSTOM;
  }

Now, for generating the order-array, we do something like this in the
SortComparatorSource:

  TermDocs termDocs = reader.termDocs();
  // inverted[docID] == datetime | docID
  long[] inverted = new long[reader.maxDoc];
  TermEnum termEnum = reader.terms(new Term(fieldname, ""));

  do {
Term term = termEnum.term();
if (term == null || !fieldname.equals(term.field())) {
  break;
}
long dateTime = (long)stringDateTimeToInt(term.text()) << 32;
termDocs.seek(termEnum);
while (termDocs.next()) {
  inverted[termDocs.doc()] = dateTime | termDocs.doc();
}
  } while (termEnum.next());
  termEnum.close();

  // inverted[order] == datetime | docID
  Arrays.sort(inverted); // works for date time 1970+

  // order[docID] == order | datetime
  long[] order = new long[inverted.length];
  for (long o = 0 ; o < inverted.length ; o++) {
int docID = (int)(inverted[o] & 0x);
order[docID] = (o << 32) | (inverted[o] >>> 32);
  }

It would be nice to avoid the extra array needed for reordering, but I'm
fresh out of ideas. Still, the memory-overhead is just
  8 bytes (long) * 2 (arrays) * maxDoc
and performance should high as Arrays.sort(long[]) is fast and
everything runs without taxing the garbage collector.


Caveat lector: I haven't implemented the stuff above, so it's just an
idea written in not-so-pseudo code.

Regards,
Toke Eskildsen



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: heap memory issues when sorting by a string field

2009-12-14 Thread Toke Eskildsen
On Fri, 2009-12-11 at 14:53 +0100, Michael McCandless wrote:
> How long does Lucene take to build the ords for the toplevel reader?
> 
> You should be able to just time FieldCache.getStringIndex(topLevelReader).
>
> I think your 8.5 seconds for first Lucene search was with the
> StringIndex computed per segment?

Cold disk-cache (directly after reboot):
[2009-12-14 14:44:10,914] Requesting StringIndex for field sort_title
[2009-12-14 14:44:20,326] Got StringIndex of length 2916008 in 9
seconds, 412 ms

Warm disk-cache (3 minutes after first test):
[2009-12-14 14:44:10,914] Requesting StringIndex for field sort_title
[2009-12-14 14:44:20,326] Got StringIndex of length 2916008 in 8
seconds, 414 ms

The response time for the first sorted search was about 8,5 seconds, but
that was after 6 non-sorted searches without the use of explicit field
cache, so some amount of warm-up was performed.

Caveat: I must stress that this is very much ad hoc testing.


- FieldCache test code

// Meant for testing
private FieldCache.StringIndex getStringIndex(
IndexReader reader, String field) {
log.info("Requesting StringIndex for field " + field);
Profiler profiler = new Profiler();
FieldCache.StringIndex stringIndex;
try {
stringIndex = FieldCache.DEFAULT.getStringIndex(reader,
field);
} catch (IOException e) {
log.error("Could not retrieve StringIndex", e);
return null;
}
log.info("Got StringIndex of length " + stringIndex.order.length
 + " in " + profiler.getSpendTime());
return stringIndex;
}

- Lucene 2.4 index

ls -l index/sb/20091201-115941/lucene/

-rw-rw-r-- 1 summatst summatst 12840211452 Dec  2 11:21 _0.cfx
-rw-rw-r-- 1 summatst summatst   361027455 Dec  2 11:19 _32.cfs
-rw-rw-r-- 1 summatst summatst   373374178 Dec  2 11:19 _65.cfs
-rw-rw-r-- 1 summatst summatst   438076782 Dec  2 11:21 _98.cfs
-rw-rw-r-- 1 summatst summatst   463141239 Dec  2 11:19 _cb.cfs
-rw-rw-r-- 1 summatst summatst  1862427706 Dec  2 11:19 _rm.cfs
-rw-rw-r-- 1 summatst summatst 203 Dec  2 11:21 segments_3
-rw-rw-r-- 1 summatst summatst  20 Dec  2 11:18 segments.gen

-

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   >