from:"Christoph Kaser"

Re: Simultaneous Indexing and searching

2020-09-09 Thread Christoph Kaser


Hi Richard,

it seems like lucene index replication could help you here: you could 
create the index on the backend server and replicate it to the frontend 
servers.


http://shaierera.blogspot.com/2013/05/the-replicator.html 

http://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html 
 



This way, the frontend servers can issue queries against their copy of 
the index, and the backend server can perform updates that are then 
replicated to the frontend server indexes.


Best regards
Christoph

On 01.09.2020 08:28, Richard So wrote:

Hi there,

I am beginner for using Lucene especially in the area of Indexing and searching 
simultaneously.

Our environment is that we have several webserver for the search front-end that 
submit search request and also a backend server that do the full text indexing; 
whereas the index files are stored in a NFS volume such that both the indexing 
and searchs are pointing to this same NFS volume. The indexing may happen 
whenever something new documents comes in or get updated.

Our project requires that both indexing and searching can be happened at the 
same time (or the blocking should be as short as possible, e.g. under a second)

We have search through the Internet and found something like this references:
http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html
http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html

but seems those only apply to indexing and search in the same server (correct 
me if I am wrong).

Could somebody tell me how to implement such system, e.g. what Lucene classes 
to be used and the caveat, or how to setup ,etc?

Regards
Richard











-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

searchAfter is missing results when custom noncontinuous slices are used

2017-05-24 Thread Christoph Kaser


Hello everybody,

I have observed an unexpected behavior in Lucene, and I am unsure 
whether this is a bug, or a missing warning in the documentation:


I am using the IndexSearcher with an ExecutorService in order to take 
advantage of multiple CPU cores during the searches. I want to limit the 
number of cores a single search can occupy, so I have overwritten the 
IndexSearcher method

protected LeafSlice[] slices(List leaves)
to return a fixed number of Slices. (e.g. 4).

I tried to create slices that are about the same size by looping over 
the leaves (ordered by size descending) and adding the current leaf to 
the slice with the smallest number of documents.


This worked well, until I stumbled upon a query for which searchAfter 
seemed to skip hits, so that the total number of hits obtained by 
multiple calls to searchAfter was lower than TopDocs.totalHits.


The issue seems to be how searchAfter works vs how TopDocs.merge works:

searchAfter skips every document with a higher score than the "after" 
document. In case of equal scores, it uses the document id and skips 
every document with a <= document id (see PagingFieldCollector).


TopDocs.merge uses the score to determine which hits should be part of 
the merged TopDocs. In case of equal scores, it uses the shard index 
(this corresponds to the slices the IndexSearcher uses) to break ties 
(see ScoreMergeSortQueue.lessThan)


So if the shards are noncontinuous (as they are in my case), searchAfter 
uses a different way of sorting the documents than TopDocs.merge, and 
therefore hits are skipped.


Here are my questions:

* Are slices meant to be continuous "sublists" of the passed 
leaves-list? Or is my way of slicing meant to be supported?
* If my way of slicing is not supported, could you either add a warning 
to the javadocs of the slices method or maybe even add  a check for a 
legal return value of slices()?

* Should I create a jira issue for this?

Sorry for the wall of text, I hope I explained the problem in an 
understandable way!


Thank you and best regards
Christoph

Re: Lucene commit

2016-08-22 Thread Christoph Kaser

Hello Paul,

this is already possible using
DirectoryReader.openIfChanged(indexReader,indexWriter). This will give
you an indexreader that already "sees" all changes made by the writer
(up to that point), even though the changes were not yet committed:

https://lucene.apache.org/core/6_1_0/core/org/apache/lucene/index/DirectoryReader.html#openIfChanged-org.apache.lucene.index.DirectoryReader-org.apache.lucene.index.IndexWriter-

Regards,
Christoph

Am 22.08.2016 um 08:31 schrieb Paul Masurel:

Hi,

If I understand correctly, Lucene indexing threads are working on their own
individual segment.
When a thread has enough documents in its segment, it flushes it on disc
and starts a new one.
But segments are only searchable when they are commited.

Now my question is, wouldn't it be nice to be able to set up Lucene so that
segments are made searchable as soon as they are flushed?

Commit would still play the roll of "checkpoint" in a hardware failure
scenario.
This is different from the old "autocommit" feature in that sense.

Of course, this "searchable yet not committed flushed segment" leads to the
following weird behavior :
- documents can become searchable and in case of failure, become not
searchable
(and then eventually searchable again if the client does its job properly
and reindexes rollbacked documents).
- one document can become searchable after another one even though it was
added before.

The benefit would be to reduce the average latency for a document
to become searchable, without hurting throughput by calling commit() too
frequently.

Regards,

Paul

Re: Newbie Questions

2016-08-10 Thread Christoph Kaser

There is no way to "update" a document in lucene, you always have to
remove the existing document and add the updated version with ALL its
fields. The updateDocument-method of IndexWriter exists only for
convenience (and to assure the operation is atomic), but internally it
does just that: remove the old document and add a new one.
In generally, this works best if you get all fields of the document from
your external datasource instead of the index, because depending on your
field configuration, there might be some information missing (especially
for fields that are not STORED or DocValues).

See also the Lucene FAQ:
https://wiki.apache.org/lucene-java/LuceneFAQ#How_do_I_update_a_document_or_a_set_of_documents_that_are_already_indexed.3F

Regards

Am 09.08.2016 um 20:51 schrieb lukes:

Thanks for the reply.

Is there a way to partially update the document ? I know there's a API
updateDocument on IndexWriter, but that seems to create a new document with
just a field i am specifying. What i want is delete some fields from
existing(indexed) document, and then add some new fields(could or not be
same). Alternatively i tried to search for the document, and then calling
removeFields and finally updateDocument, but now any search after the above
process is not able for find that document(I created the new IndexReader).
Am i missing anything ?

Regards.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Newbie-Questions-tp4290817p4291024.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Replacement for Filter-as-abstract-class in Lucene 5.4?

2016-01-15 Thread Christoph Kaser

Isn't that what ConstantScoreQuery does? The only difference is that it 
returns 1.0f as score instead of 0.0f.


Regards
Christoph

Am 15.01.2016 um 09:27 schrieb Uwe Schindler:

I had the same problem while migrating old code. Filter is very convenient to 
use, so why is it deprecated? I agree we should convert all internal filters to 
use this, but people from the outside, that just quickly want to create a 
Filter based on simple stuff like bitsets, should get an easy api without 
multi-pass stuff or the need to implement a full scorer returning 0 as score.

So I tend to rename the current filter class as BaseNonScoringQuery. In 5.x 
keep Filter deprecated (maybe simply as empty abstract subclass of this one).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-Original Message-
From: Trejkaz [mailto:trej...@trypticon.org]
Sent: Friday, January 15, 2016 1:52 AM
To: Lucene Users Mailing List 
Subject: Fwd: Replacement for Filter-as-abstract-class in Lucene 5.4?

Hi all.

Filter is now deprecated, which I already knew was in the pipeline.

The docs say:

"Use Query objects instead: when queries are wrapped in a
 ConstantScoreQuery or in a BooleanClause.Occur.FILTER clause,
 they automatically disable the score computation so the Filter
 class does not provide benefits compared to queries anymore."

That's fair enough and an easy change to do on the caller side.

The other thing we are using Filter for is the other thing it mentions
in the Javadoc:

"Convenient base class for building queries that only perform
 matching, but no scoring. The scorer produced by such queries
 always returns 0 as score."

What is the new convenient way to implement your own queries that
don't do scoring?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Replacement for Filter-as-abstract-class in Lucene 5.4?

2016-01-15 Thread Christoph Kaser


Nevermind, I missed the part about it being a base class for own queries.

Sorry for the confusion!

Am 15.01.2016 um 09:49 schrieb Christoph Kaser:
Isn't that what ConstantScoreQuery does? The only difference is that 
it returns 1.0f as score instead of 0.0f.


Regards
Christoph

Am 15.01.2016 um 09:27 schrieb Uwe Schindler:
I had the same problem while migrating old code. Filter is very 
convenient to use, so why is it deprecated? I agree we should convert 
all internal filters to use this, but people from the outside, that 
just quickly want to create a Filter based on simple stuff like 
bitsets, should get an easy api without multi-pass stuff or the need 
to implement a full scorer returning 0 as score.


So I tend to rename the current filter class as BaseNonScoringQuery. 
In 5.x keep Filter deprecated (maybe simply as empty abstract 
subclass of this one).


Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-Original Message-
From: Trejkaz [mailto:trej...@trypticon.org]
Sent: Friday, January 15, 2016 1:52 AM
To: Lucene Users Mailing List <java-user@lucene.apache.org>
Subject: Fwd: Replacement for Filter-as-abstract-class in Lucene 5.4?

Hi all.

Filter is now deprecated, which I already knew was in the pipeline.

The docs say:

"Use Query objects instead: when queries are wrapped in a
 ConstantScoreQuery or in a BooleanClause.Occur.FILTER clause,
 they automatically disable the score computation so the Filter
 class does not provide benefits compared to queries anymore."

That's fair enough and an easy change to do on the caller side.

The other thing we are using Filter for is the other thing it mentions
in the Javadoc:

"Convenient base class for building queries that only perform
 matching, but no scoring. The scorer produced by such queries
 always returns 0 as score."

What is the new convenient way to implement your own queries that
don't do scoring?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexWriter.addIndexes with LeafReader parameter

2016-01-13 Thread Christoph Kaser

You could try using the org.apache.lucene.index.SlowCodecReader to wrap 
your index reader:
SlowCodecReaderWrapper.wrap(indexReader) returns a CodecReader from an 
index reader.


Regards
Christoph

Am 13.01.2016 um 09:09 schrieb Manner Róbert:

Unfortunately I can not use that, because I do not want to copy all the
indexes. Our use case is "archiving" of indexes: we would like to copy to
separate file (and remove) part of the indexes, for example which are more
than a month old. We achieved it by writing a Reader which does the
filtering, and use the writer to write them out.

Robert

On Tue, Jan 12, 2016 at 8:00 PM, Dawid Weiss  wrote:


You can addIndexes(Directory... dirs) -- then you don't have to deal
with CodecReader?

Dawid

On Tue, Jan 12, 2016 at 4:43 PM, Manner Róbert  wrote:

Hi,

we have used lucene 4.7.0 before, we are on the way to upgrade to 5.4.0.

The problem I have is that writer.addIndexes now needs CodecReader and

does

not accept a basic LeafReader what we have.

Is there any efficient way to work around that? How would you do it?

Query

the documents and addDocument one by one? Or can I somehow wrap my
LeafReader in a CodecReader? What is the reason to require CodecReader

and

what is that for? Its documentation seems to be missing, and I could not
find anything on the net either.

Thanks for any pointers in advance,

Robi

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to merge several Taxonomy indexes

2015-04-02 Thread Christoph Kaser


Hi Gimantha,

why do you use a RAMDirectory? If your merged index fits into RAM completely, a 
MMapDirectory should offer almost the same performance. And if not, it is 
definitely the better choice.

Regards
Christoph

Am 02.04.2015 um 12:38 schrieb Gimantha Bandara:

Hi All,

I have successfully setup a merged indices and drilldown and usual search
operations work perfect.
But, I have a side question. If I selected RAMDirectory as the destination
Indices in merging, probably the jvm can go out of memory if the merged
indices are too big. Is there a way I can handle this issue?

On Tue, Mar 24, 2015 at 12:18 PM, Gimantha Bandara giman...@wso2.com
wrote:


Hi Christoph,

My mistake. :) It does the exactly what i need. figured it out later..
Thanks a lot!

On Tue, Mar 24, 2015 at 3:14 AM, Gimantha Bandara giman...@wso2.com
wrote:


Hi Christoph,

I think TaxonomyMergeUtils is to merge a taxonomy directory and an index
together (Correct me if I am wrong). Can it be used to merge several
taxonomyDirectories together and create one taxonomy index?

On Mon, Mar 23, 2015 at 9:19 PM, Christoph Kaser lucene_l...@iconparc.de

wrote:
Hi Gimantha,

have a look at the class org.apache.lucene.facet.taxonomy.TaxonomyMergeUtils,
which does exactly what you need.

Best regards,
Christoph

Am 23.03.2015 um 15:44 schrieb Gimantha Bandara:


Hi all,

Can anyone point me how to merge several taxonomy indexes? My
requirement
is as follows. I have  several taxonomy indexes and normal document
indexes. I want to merge taxonomy indexes together and other document
indexes together and perform search on them. One part I have figured
out.
It is easy. To Merge document indexes, all I have to do is create a
MultiReader and pass it to IndexSearcher. But I am stuck at merging the
taxonomy indexes. Is there a way to merge taxonomy indexes?



--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer.
HRB
121830, Amtsgericht München



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919


--
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919






--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München

 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to merge several Taxonomy indexes

2015-03-23 Thread Christoph Kaser


Hi Gimantha,

have a look at the class 
org.apache.lucene.facet.taxonomy.TaxonomyMergeUtils, which does exactly 
what you need.


Best regards,
Christoph

Am 23.03.2015 um 15:44 schrieb Gimantha Bandara:

Hi all,

Can anyone point me how to merge several taxonomy indexes? My requirement
is as follows. I have  several taxonomy indexes and normal document
indexes. I want to merge taxonomy indexes together and other document
indexes together and perform search on them. One part I have figured out.
It is easy. To Merge document indexes, all I have to do is create a
MultiReader and pass it to IndexSearcher. But I am stuck at merging the
taxonomy indexes. Is there a way to merge taxonomy indexes?




--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München

 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Can't get case insensitive keyword analyzer to work

2014-08-12 Thread Christoph Kaser


Hello Milind,

if you don't set the field to be tokenized, no analyzer will be used and 
the field's contents will be stored as-is, i.e. case sensitive.
It's the analyzer's job to tokenize the input, so if you use an analyzer 
that does not separate the input into several tokens (like the 
KeywordAnalyzer), your input will remain untokenized.


Regards
Christoph

Am 12.08.2014 um 03:38 schrieb Milind:

I found the problem.  But it makes no sense to me.

If I set the field type to be tokenized, it works.  But if I set it to not
be tokenized the search fails.  i.e. I have to pass in true to the method.
 theFieldType.setTokenized(storeTokenized);

I want the field to be stored as un-tokenized.  But it seems that I don't
need to do that.  The LowerCaseKeywordAnalyzer works if the field is
tokenized, but not if it's un-tokenized!

How can that be?


On Mon, Aug 11, 2014 at 1:49 PM, Milind mili...@gmail.com wrote:


It does look like the lowercase is working.

The following code

 Document theDoc = theIndexReader.document(0);
 System.out.println(theDoc.get(sn));
 IndexableField theField = theDoc.getField(sn);
 TokenStream theTokenStream = theField.tokenStream(theAnalyzer);
 System.out.println(theTokenStream);

produces the following output
 SN345-B21
 LowerCaseFilter@5f70bea5 term=sn345-b21,bytes=[73 6e 33 34 35 2d 62
32 31],startOffset=0,endOffset=9

But the search does not work.  Anything obvious popping out for anyone?


On Sat, Aug 9, 2014 at 4:39 PM, Milind mili...@gmail.com wrote:


I looked at a couple of examples on how to get keyword analyzer to be
case insensitive but I think I missed something since it's not working for
me.

In the code below, I'm indexing text in upper case and searching in lower
case.  But I get back no hits.  Do I need to something more while
indexing?

 private static class LowerCaseKeywordAnalyzer extends Analyzer
 {
 @Override
 protected TokenStreamComponents createComponents(String
theFieldName, Reader theReader)
 {
 KeywordTokenizer theTokenizer = new
KeywordTokenizer(theReader);
 TokenStreamComponents theTokenStreamComponents =
 new TokenStreamComponents(
 theTokenizer,
 new LowerCaseFilter(Version.LUCENE_46,
theTokenizer));
 return theTokenStreamComponents;
 }
 }

 private static void addDocment(IndexWriter theWriter,
   String theFieldName,
   String theValue,
   boolean storeTokenized)
 throws Exception
 {
   Document theDocument = new Document();
   FieldType theFieldType = new FieldType();
   theFieldType.setStored(true);
   theFieldType.setIndexed(true);
   theFieldType.setTokenized(storeTokenized);
   theDocument.add(new Field(theFieldName, theValue,
theFieldType));
   theWriter.addDocument(theDocument);
 }


 static void testLowerCaseKeywordAnalyzer()
 throws Exception
 {
 Version theVersion = Version.LUCENE_46;
 Directory theIndex = new RAMDirectory();

 Analyzer theAnalyzer = new LowerCaseKeywordAnalyzer();

 IndexWriterConfig theConfig = new IndexWriterConfig(theVersion,
 theAnalyzer);
 IndexWriter theWriter = new IndexWriter(theIndex, theConfig);
 addDocment(theWriter, sn, SN345-B21, false);
 addDocment(theWriter, sn, SN445-B21, false);
 theWriter.close();

 QueryParser theParser = new QueryParser(theVersion, sn,
theAnalyzer);
 Query theQuery = theParser.parse(sn:sn345-b21);
 IndexReader theIndexReader = DirectoryReader.open(theIndex);
 IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
 TopScoreDocCollector theCollector =
TopScoreDocCollector.create(10, true);
 theSearcher.search(theQuery, theCollector);
 ScoreDoc[] theHits = theCollector.topDocs().scoreDocs;
 System.out.println(Number of results found:  + theHits.length);
 }

--
Regards
Milind


--
Regards
Milind






--


Weil Individualität der beste Standard ist

Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstraße 1
80333 München

iconparc.de

Tel: +49 - 89- 15 90 06 - 21
Fax: +49 - 89- 15 90 06 - 19

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. 
HRB 121830, Amtsgericht München

Re: search performance

2014-06-03 Thread Christoph Kaser

Can you take thread stacktraces (repeatedly) during those 5 minute 
searches? That might give you (or someone on the mailing list) a clue 
where all that time is spent.
You could try using jstack for that: 
http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstack.html


Regards
Christoph

Am 03.06.2014 08:17, schrieb Jamie:

Toke

Thanks for the comment.

Unfortunately, in this instance, it is a live production system, so we 
cannot conduct experiments. The number is definitely accurate.


We have many different systems with a similar load that observe the 
same performance issue. To my knowledge, the Lucene integration code 
is fairly well optimized.


I've requested access to the indexes so that we can perform further 
testing.


Regards

Jamie

On 2014/06/03, 8:09 AM, Toke Eskildsen wrote:

On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:

[200GB, 150M documents]


With NRT enabled, search speed is roughly 5 minutes on average.
The server resources are:
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

5 minutes is extremely long. Is that really the right number? I do not
see a hardware upgrade changing that with the fine machine you're using.

What is your search speed if you disable continuous updates?

When you restart the searcher, how long does the first search take?


- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexReplication Client and IndexWriter

2014-04-16 Thread Christoph Kaser


Hi Shai,

no problem, thank you for letting me know!

Have a nice vacation!

Christoph

Am 16.04.2014 08:49, schrieb Shai Erera:

Hi Christoph,

Apologize for the delayed response, I'm on a holiday vacation. I will take
a look at your issues as soon as I can.

Shai


On Fri, Apr 11, 2014 at 12:02 PM, Christoph Kaser
lucene_l...@iconparc.dewrote:


Hello Shai and Mike,

thank you for your answers!

I created LUCENE-5597 for this feature. Unfortunately, I am not sure I
will be able to provide patches: I don't need this feature at the moment
(my interest was more academic) and unfortunately don't have the time to
work on this.

Additionally, I created LUCENE-5599, which provides a patch to fix a small
performance issue I had with the replicator when replicating large indexes.

Regards,
Christoph Kaser



Am 08.04.2014 12:45, schrieb Michael McCandless:


You might be able to use a class on the NRT replication branch
(LUCENE-5438), InfosRefCounts (weird name), whose purpose is to do
what IndexFileDeleter does for IndexWriter, ie keep track of which
files are still referenced, delete them when they are done, etc.  This
could used on the client side to hold a lease for another client.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 8, 2014 at 6:26 AM, Shai Erera ser...@gmail.com wrote:


IndexRevision uses the IndexWriter for deleting unused files when the
revision is released, as well as to obtain the SnapshotDeletionPolicy.

I think that you will need to implement two things on the client side:

* Revision, which doesn't use IndexWriter.
* Replicator which keeps track of how many refs a file has (basically
what
IndexFileDeleter does)

Then you could setup any node in the middle to be both a client and a
server. Would be interesting to explore that. Would you like to open an
issue? And maybe even try to come up w/ a patch?

Shai


On Tue, Apr 8, 2014 at 1:05 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

  It's not safe also opening an IndexWriter on the client side.

But I agree, supporting tree topology would make sense; it seems like
we just need a way for the ReplicationClient to also be a Replicator.
It seems like it should be possible, since it's clearly aware of the
SessionToken it's pulled from the original Replicator.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 8, 2014 at 3:42 AM, Christoph Kaser 
lucene_l...@iconparc.de
wrote:


Hi all,

I am trying out the (highly useful) index replicator module (with the
HttpReplicator) and have stumbled upon a question:
It seems, the IndexReplicationHandler is working directly on the index
directory, without using an indexwriter. Could there be a problem if I


open


an IndexWriter on the client side?
Usually, this should not be needed, as only the master should be
changed,
however if I want to implement a tree topology, I need an IndexWriter


on a


non-leaf client, because the IndexRevision that I need to publish needs


one.


Regards,
Christoph


  -

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  -

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexReplication Client and IndexWriter

2014-04-11 Thread Christoph Kaser


Hello Shai and Mike,

thank you for your answers!

I created LUCENE-5597 for this feature. Unfortunately, I am not sure I 
will be able to provide patches: I don't need this feature at the moment 
(my interest was more academic) and unfortunately don't have the time to 
work on this.


Additionally, I created LUCENE-5599, which provides a patch to fix a 
small performance issue I had with the replicator when replicating large 
indexes.


Regards,
Christoph Kaser



Am 08.04.2014 12:45, schrieb Michael McCandless:

You might be able to use a class on the NRT replication branch
(LUCENE-5438), InfosRefCounts (weird name), whose purpose is to do
what IndexFileDeleter does for IndexWriter, ie keep track of which
files are still referenced, delete them when they are done, etc.  This
could used on the client side to hold a lease for another client.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 8, 2014 at 6:26 AM, Shai Erera ser...@gmail.com wrote:

IndexRevision uses the IndexWriter for deleting unused files when the
revision is released, as well as to obtain the SnapshotDeletionPolicy.

I think that you will need to implement two things on the client side:

* Revision, which doesn't use IndexWriter.
* Replicator which keeps track of how many refs a file has (basically what
IndexFileDeleter does)

Then you could setup any node in the middle to be both a client and a
server. Would be interesting to explore that. Would you like to open an
issue? And maybe even try to come up w/ a patch?

Shai


On Tue, Apr 8, 2014 at 1:05 PM, Michael McCandless 
luc...@mikemccandless.com wrote:


It's not safe also opening an IndexWriter on the client side.

But I agree, supporting tree topology would make sense; it seems like
we just need a way for the ReplicationClient to also be a Replicator.
It seems like it should be possible, since it's clearly aware of the
SessionToken it's pulled from the original Replicator.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 8, 2014 at 3:42 AM, Christoph Kaser lucene_l...@iconparc.de
wrote:

Hi all,

I am trying out the (highly useful) index replicator module (with the
HttpReplicator) and have stumbled upon a question:
It seems, the IndexReplicationHandler is working directly on the index
directory, without using an indexwriter. Could there be a problem if I

open

an IndexWriter on the client side?
Usually, this should not be needed, as only the master should be changed,
however if I want to implement a tree topology, I need an IndexWriter

on a

non-leaf client, because the IndexRevision that I need to publish needs

one.

Regards,
Christoph



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

IndexReplication Client and IndexWriter

2014-04-08 Thread Christoph Kaser


Hi all,

I am trying out the (highly useful) index replicator module (with the 
HttpReplicator) and have stumbled upon a question:
It seems, the IndexReplicationHandler is working directly on the index 
directory, without using an indexwriter. Could there be a problem if I 
open an IndexWriter on the client side?
Usually, this should not be needed, as only the master should be 
changed, however if I want to implement a tree topology, I need an 
IndexWriter on a non-leaf client, because the IndexRevision that I need 
to publish needs one.


Regards,
Christoph

--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München

 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

BooleanFilter vs BooleanQuery performance

2013-12-16 Thread Christoph Kaser


Hi all,

from my tests on an index with 22 million entries, it seems that in many 
cases a BooleanFilter is a lot slower than an equivalent BooleanQuery.
Is this the expected behaviour? i would have expected a Filter to be at 
least as fast as a query, since it basically does the same thing, but 
without scoring.


Is there a better alternative to using a BooleanFilter?

Regards
Christoph

--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München

 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BooleanFilter vs BooleanQuery performance

2013-12-16 Thread Christoph Kaser


Hi,

thank you for your explanation!
I used a wrapped BooleanQuery instead, this turned out to be a lot faster.

Christoph

Am 16.12.2013 15:19, schrieb Uwe Schindler:

Hi,

The problem with BooleanFilter is its implementation:
It creates BitSets and AND/ORs them together. The BitSets are created, because 
you can cache them for later use (the main use-case for filters).
In contrast, a query intersect the DocIdSetIterators directly. The good thing 
with this: If you have Queries that only match little documents, the other 
queries can then advance and leave out the doc ids not matching. BooleanFilter 
has to get all matching docIds from all filters.

If you want it fast, use BooleanQuery and wrap it with ConstantScoreQuery. Then 
there is also no scoring done (in most cases, older BooleanQuery sometimes 
still calculated the score).

In general, BooleanFilter and ChainedFilter is in my opinion legacy code from 
older days and should no longer be used (unless you cache Filters and want to 
cache the boolFilter, too). This is why they are not part of Lucene's Core 
classes.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-Original Message-
From: Christoph Kaser [mailto:lucene_l...@iconparc.de]
Sent: Monday, December 16, 2013 2:30 PM
To: java-user@lucene.apache.org
Subject: BooleanFilter vs BooleanQuery performance

Hi all,

from my tests on an index with 22 million entries, it seems that in many cases
a BooleanFilter is a lot slower than an equivalent BooleanQuery.
Is this the expected behaviour? i would have expected a Filter to be at least
as fast as a query, since it basically does the same thing, but without scoring.

Is there a better alternative to using a BooleanFilter?

Regards
Christoph

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München

 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Using DocValues with CollationKeyAnalyzer

2012-11-06 Thread Christoph Kaser


Hi all,

for best performance, I use a SortedBytesDocValuesField to sort results. 
I would like to use a ICUCollationKeyAnalyzer for this field, so sorting 
occurs in a natural order.
However, it seems as if the SortedBytesDocValuesField does not use an 
analyzers, but expects a ByteRef which is stored as-is.

Is this correct?

So far, this is the best I could come up with:

com.ibm.icu.text.Collator collator = 
com.ibm.icu.text.Collator.getInstance(new ULocale(column.locale));

collator.setStrength(Collator.SECONDARY);
RawCollationKey key = collator.getRawCollationKey(field_value, null);
BytesRef bytes=new BytesRef(key.bytes, 0, key.size);
SortedBytesDocValuesField sortfield = new 
SortedBytesDocValuesField(sort_field, bytes);


So I don't use the analyzer, but instead simulate its behaviour.
Is there another way, or is SortedBytesDocValuesField meant to be used 
like that?


Best Regards,
Christoph Kaser

 
	



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ToParentBlockJoinQuery - Faceting on Parent and Child Documents

2012-08-03 Thread Christoph Kaser


Hi Jayendra,

we use facetting and blockjoinqueries on lucene 3.6 like this:

- Create the FacetsCollector
- For facetting on Parent documents, use ToParentBlockJoinQuery, for 
facetting on children ToChildBlockJoinQuery (if needed, add additional 
query clauses using a Booleanquery)

- Use searcher.search(query,null,facetCollector)

This seems to work fine.

Best regards,
Christoph Kaser

Am 03.08.2012 13:50, schrieb Martijn v Groningen:

Hi Jayendra,

This isn't supported yet. You could implement this by creating a
custom Lucene collector.
This collector could count the unique hits inside a block of docs per
unique facet field value. The
unique facet values could be retrieved from Lucene's FieldCache or doc
values (if you can use Lucene 4.0
in your project).

In general I think this would be a cool addition!

Martijn

On 25 July 2012 13:37, Jayendra Patil jayendra.patil@gmail.com wrote:

Thanks Mike for the wonderful work on ToParentBlockJoinQuery.

We had a use case for Relational data search and are working with
ToParentBlockJoinQuery which works perfectly as mentioned @
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html

However, I couldn't find any examples on net or even in the JUnit
testcases to use Faceting on the Parent or the Child results.

Is it supported as yet ??? Can you provide us with any examples ??

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Nested indexing doubt.

2012-06-08 Thread Christoph Kaser


Hi Ananth,

You have to add the child documents before the parent document, 
otherwise the blockjoinquery won't work.


Regards,
Christoph

Am 08.06.2012 10:18, schrieb Ananth V:

Hey guys,
I'm trying to index nested documents in lucene 3.6. I have
the parent document having a 'type' and 'typename' fields and the children
having 'value' and 'author' fields. The below snippet is what i've written
to index them as a block. Is this correct? Is there any working piece of
code i can use. Googling wasn't helpful.

private static void addDoc(IndexWriter w,String type, String
typename,String value, String author, String value2, String author2) throws
IOException {
 ArrayListDocument docs=new ArrayListDocument();
 Document doc = new Document();
 doc.add(new Field(type, type, Field.Store.YES, Field.Index.ANALYZED));
 doc.add(new Field(typename, typename, Field.Store.YES,
Field.Index.ANALYZED));
 docs.add(doc);
 Document doc1 = new Document();
 doc1.add(new Field(value, value, Field.Store.YES,
Field.Index.ANALYZED));
 doc1.add(new Field(author, author, Field.Store.YES,
Field.Index.ANALYZED));
 docs.add(doc1);
 Document doc2 = new Document();
 doc2.add(new Field(value, value2, Field.Store.YES,
Field.Index.ANALYZED));
 doc2.add(new Field(author, author2, Field.Store.YES,
Field.Index.ANALYZED));
 docs.add(doc2);
 w.addDocuments(docs);
   }

Thanks,
Ananth.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ToParentBlockJoinQuery$BlockJoinWeight cannot explain match on parent document

2012-05-29 Thread Christoph Kaser


Hi Martijn,

thank you for your response.

I created the following issue:
https://issues.apache.org/jira/browse/LUCENE-4082

Christoph

Am 25.05.2012 16:25, schrieb Martijn v Groningen:

Hi Christoph,

You can open an issue for this. I think we can use the child score as
an explanation of why a parent doc is scored the way it is.

Martijn

On 25 May 2012 13:20, Christoph Kaserlucene_l...@iconparc.de  wrote:

Hello all,

I try to calculate score explanations for a query that contains a
ToParentBlockJoinQuery and get the following exception:

java.lang.UnsupportedOperationException:
org.apache.lucene.search.join.ToParentBlockJoinQuery$BlockJoinWeight cannot
explain match on parent document
at
org.apache.lucene.search.join.ToParentBlockJoinQuery$BlockJoinWeight.explain(ToParentBlockJoinQuery.java:222)
at
org.apache.lucene.search.BooleanQuery$BooleanWeight.explain(BooleanQuery.java:236)

I can understand that the ToParentBlockJoinQuery cannot explain the scores
of this document's children, but would it be possible not to throw an
exception but to simply output the score this document got from its
children? This would allow me to analyze the score obtained from other parts
of the complete query, and if needed, I could still get an explanation on
the childquery itself with a specific child id.

Should I open an issue for this, or is it impossible to output any kind of
explanation (even a dummy explanation) in BlockJoinWeight?

Regards,
Christoph






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ToParentBlockJoinQuery and grand-children

2012-05-25 Thread Christoph Kaser


Hi Mike,

unfortunately, you were right about only getting the last child's 
grandchildren. Furthermore, the groups have wrong groupValues: They are 
the document id of the previous parent, not the child. I have opened an 
issue:


https://issues.apache.org/jira/browse/LUCENE-4076

I also created the issue for the missing access to the computed scores:

https://issues.apache.org/jira/browse/LUCENE-4077

Regards,
Christoph



Am 24.05.2012 18:32, schrieb Michael McCandless:

On Thu, May 24, 2012 at 11:48 AM, Christoph Kaser
lucene_l...@iconparc.de  wrote:


thank you for your response. Unfortunately, I won't be able to try this
today, but I should be able to try it in the next few days. If I find the
bug you described, I will open an issue.

Thanks!


On a somewhat related note, is there a way to get the scores for the parent
documents from the ToParentBlockJoinCollector? I can tell the collector to
track the scores and the max score, but I did not find a way to retrieve
either the parent scores nor the max score (of the parent documents).

Hmmm... so ToParentBlockJoinCollector does track the maxScore and
score of each parent hit, but it looks like it never makes those
available in the TopGroups it returns!  Silly.  Can you open a
separate issue for that?  Thanks.

Mike McCandless

http://blog.mikemccandless.com






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ToParentBlockJoinQuery and grand-children

2012-05-24 Thread Christoph Kaser


Hello Mike,

thank you for your response. Unfortunately, I won't be able to try this 
today, but I should be able to try it in the next few days. If I find 
the bug you described, I will open an issue.


On a somewhat related note, is there a way to get the scores for the 
parent documents from the ToParentBlockJoinCollector? I can tell the 
collector to track the scores and the max score, but I did not find a 
way to retrieve either the parent scores nor the max score (of the 
parent documents).


Christoph Kaser

Am 23.05.2012 20:10, schrieb Michael McCandless:

You do have to call getTopGroups for each grandchild query, and the
order should match the TopGroups you got for the children

However  looking at the code, I suspect there's a bug... by the
time the collector collects the parent hit, some of the grand children
will have been discarded.  I suspect you'll only get back
grandchildren for the last child docID under each parent docID's
group.  Are you seeing that?

Tricky... can you open an issue?

Mike McCandless

http://blog.mikemccandless.com

On Wed, May 23, 2012 at 12:22 PM, Christoph Kaser
lucene_l...@iconparc.de  wrote:

Hello,

I would like to use the ToParentBlockJoinQuery and its collector to query a
document with children and grand children, but I can't figure out how to get
the document ids that represent grand children.

I know how to build the query and get the parent and child documents:


/Example code start*/
Query grandChildQuery=new TermQuery(new Term(color, red));
Filter childFilter = new CachingWrapperFilter(new RawTermFilter(new
Term(type,child)), DeletesMode.IGNORE);
ToParentBlockJoinQuery grandchildJoinQuery = new
ToParentBlockJoinQuery(grandChildQuery, childFilter, ScoreMode.Max);

BooleanQuery childQuery= new BooleanQuery();
childQuery.add(grandchildJoinQuery, Occur.MUST);
childQuery.add(new TermQuery(new Term(shape, round)), Occur.MUST);

Filter parentFilter = new CachingWrapperFilter(new RawTermFilter(new
Term(type,parent)), DeletesMode.IGNORE);
ToParentBlockJoinQuery childJoinQuery = new
ToParentBlockJoinQuery(childQuery, parentFilter, ScoreMode.Max);

parentQuery=new BooleanQuery();
parentQuery.add(childJoinQuery, Occur.MUST);
parentQuery.add(new TermQuery(new Term(name, test)), Occur.MUST);

ToParentBlockJoinCollector parentCollector= new
ToParentBlockJoinCollector(Sort.RELEVANCE, 30, true, true);
searcher.search(parentQuery, null, parentCollector);
TopGroupsInteger  topGroups = parentCollector.getTopGroups(childJoinQuery,
null, 0, 20, 0, false);

/Example code end*/

Now topGroups contains the parents document ids, and the child document ids.
But how can I get the grandchild document ids for a given child document id?
Do I have to call

TopGroupsInteger  childTopGroups =
parentCollector.getTopGroups(grandchildJoinQuery , null, 0, 20, 0, false);

and match the document ids by hand? If so, is there a guarantee that they
will be in the same order as I get them in the topgroups, or will I have to
iterate over all childTopGroups until I find the right document id?

Does anyone have example code for nested joins?

Thanks in advance,
Christoph











-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

ToParentBlockJoinQuery and grand-children

2012-05-23 Thread Christoph Kaser


Hello,

I would like to use the ToParentBlockJoinQuery and its collector to 
query a document with children and grand children, but I can't figure 
out how to get the document ids that represent grand children.


I know how to build the query and get the parent and child documents:


/Example code start*/
Query grandChildQuery=new TermQuery(new Term(color, red));
Filter childFilter = new CachingWrapperFilter(new RawTermFilter(new 
Term(type,child)), DeletesMode.IGNORE);
ToParentBlockJoinQuery grandchildJoinQuery = new 
ToParentBlockJoinQuery(grandChildQuery, childFilter, ScoreMode.Max);


BooleanQuery childQuery= new BooleanQuery();
childQuery.add(grandchildJoinQuery, Occur.MUST);
childQuery.add(new TermQuery(new Term(shape, round)), Occur.MUST);

Filter parentFilter = new CachingWrapperFilter(new RawTermFilter(new 
Term(type,parent)), DeletesMode.IGNORE);
ToParentBlockJoinQuery childJoinQuery = new 
ToParentBlockJoinQuery(childQuery, parentFilter, ScoreMode.Max);


parentQuery=new BooleanQuery();
parentQuery.add(childJoinQuery, Occur.MUST);
parentQuery.add(new TermQuery(new Term(name, test)), Occur.MUST);

ToParentBlockJoinCollector parentCollector= new 
ToParentBlockJoinCollector(Sort.RELEVANCE, 30, true, true);

searcher.search(parentQuery, null, parentCollector);
TopGroupsInteger topGroups = 
parentCollector.getTopGroups(childJoinQuery, null, 0, 20, 0, false);


/Example code end*/

Now topGroups contains the parents document ids, and the child document 
ids. But how can I get the grandchild document ids for a given child 
document id? Do I have to call


TopGroupsInteger childTopGroups = 
parentCollector.getTopGroups(grandchildJoinQuery , null, 0, 20, 0, false);


and match the document ids by hand? If so, is there a guarantee that 
they will be in the same order as I get them in the topgroups, or will I 
have to iterate over all childTopGroups until I find the right document id?


Does anyone have example code for nested joins?

Thanks in advance,
Christoph





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Memory question

2012-05-16 Thread Christoph Kaser

Another option to consider is to *decrease* the JVM maximum heap size. 
This in effect leaves more memory for swapped in mmio pages and 
decreases the GC effort, which might increase system performance and 
stability.


Regards,
Christoph

Am 15.05.2012 21:38, schrieb Chris Bamford:

Thanks Uwe.

What I'd like to understand is the implications of this on a server which opens 
a large number of indexes over a long period. Will this non-heap memory 
continue to grow? Will gc be effective at spotting it and releasing it via 
references in the heap?

  I had an instance yesterday where a server swapped itself to a standstill and 
had to be restarted. The load average was through the roof and I am trying to 
understand why. One of my recent changes is updating from 2.3 to 3.6, so 
naturally I am keen to know the impact of the mmap stuff which is now standard 
under the covers.

My server caches indexsearchers and then closes them based on how full the heap 
is getting. My worry is that if the bulk of the memory is being allocated 
outside the Jvm, how can I make sensible decisions?

Thanks for any pointers / info.

Chris



-Original Message-
From: u...@thetaphi.de
To: java-user@lucene.apache.org
Sent: Tue, 15 May 2012 18:10
Subject: RE: Memory question



It mmaps the files into virtual memory if it runs on a 64 bit JVM. Because
of that you see the mmapped CFS files. This is outside Java Heap and is all
*virtual* no RAM is explicitely occupied except the O/S cache.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


-Original Message-
From: Chris Bamford [mailto:chris.bamf...@talktalk.net]
Sent: Tuesday, May 15, 2012 4:47 PM
To: java-user@lucene.apache.org
Subject: Memory question

Hi

Can anyone tell me what happens to the memory when Lucene opens an index?
Is it loaded into the JVM's heap or is it mapped into virtual memory

outside of

it?
I am running on Linux and if I use pmap on the PID of my JVM, I can see

lots of

entries for index cfs files.

Does this mean that indexes are mapped into non-heap memory?  If so, how
can I monitor the space my process is using if I cache open

IndexSearchers?

The details are:

Sun 64-bit JVM on Linux.
Lucene 3.6 running in 2.3 compatibility mode (as we are in the in the

process of

a migration to 3.6)

Thanks,

- Chris

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Parallel searching use ExecutorService and Collectors

2012-05-08 Thread Christoph Kaser


Hi all,

I want to speed up my searches by using multiple CPU cores for one 
search. I saw that there is a possibility to use multithreaded search by 
passing an ExecutorService to the IndexSearcher:


idxSearcher = new IndexSearcher(reader, 
Executors.newCachedThreadPool());


I call my search using:

idxSearcher.search(query, filter, collector);

However, I found that this has no effect on search speed at all. After 
some digging, I found out that the multithreading apparently does not 
work when calling the search with a custom collector implementation, the 
source code even says:


 // TODO: should we make this
// threaded...?  the Collector could be sync'd?
// always use single thread:

Does anyone know, whether there is work currently going on to address 
this issue? If not, is there a workaround that allows me to use multiple 
threads for searching and still use my own collectors?


Best regards,
Christoph Kaser

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Document-Ids and Merges

2012-04-05 Thread Christoph Kaser


Thank you both Mike and Shai for your answers.

If anyone has a similiar problem:
I ended up using a column that provides my own document ids, whose 
values I got using the fieldcache.
I then precalculate the indirection per IndexReader and store it in a 
WeakHashMapIndexReader,float[] to save the extra lookup.


Christoph Kaser

Am 28.03.2012 19:40, schrieb Michael McCandless:

On Wed, Mar 28, 2012 at 3:37 AM, Christoph Kaser
christoph.ka...@iconparc.de  wrote:

Thank you for your answer!

That's too bad. I thought of using my own ID-field, but I wanted to save the
additional indirection (from docId to my ID to my value).
Do document IDs remain constant for one IndexReader as long as it isn't
reopened? If so, I could precalculate the indirection.

Yes, the entire view of the index presented by a single IndexReader is
unchanging (not just docIDs: everything).

On reopen, a new IndexReader is returned, so the old IndexReader is
still unchanged.

So, if you can hold your arrays per-segment, and init them per-segment
(such as FieldCache, or DocValues (only in 4.0) as Shai described)
then you can safely use the docID to index those arrays just within
the context of that segment.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Document-Ids and Merges

2012-03-28 Thread Christoph Kaser


Hi Shai,

That sounds interesting. However, I am unsure how I can do this. Is 
there a way to store values with a segment? How can I get the segment 
from a document ID?

Here is how my ValueSource looks like at the moment:

public class MyScoreValues extends ValueSource {
float[] values=...; //float array with reader.maxDoc() entries

public DocValues getValues(IndexReader reader) throws IOException {
return new DocValues() {
public float floatVal(int doc) {
if(doc  values.length)
return values[doc];
return 1.0f;
}
};
}
}

How would I need to change it to make the arrays segment-based?

Best regards,
Christoph



Am 27.03.2012 21:16, schrieb Shai Erera:

Or ... move to use a per-segment array. Then you don't need to rely on doc
IDs changing. You will need to build the array from the documents that are
in that segment only.

It's like FieldCache in a way. The array is relevant as long as the segment
exists (i.e. not merged away).

Hope this helps.

Shai
On Mar 27, 2012 9:29 AM, Christoph Kaserlucene_l...@iconparc.de  wrote:


Hi all,

I have a search application with 16 million documents that uses custom
scores per document using a ValueSource. These values are updated a lot
(and sometimes all at once), so I can't really write them into the index
for performance reasons. Instead, I simply have a huge array of float
values in memory and use the document ID as index in the array.
This works great as long as the index is not changed, but as soon as I
have a few new documents and deletions, index segments are merged (I
suppose) and the document IDs of existing documents change. Is there any
way to be informed when document IDs of existing documents change? If so,
is there a way to calculate the new document ID from the old one, so I can
convert my array to the new document IDs?

Any help would be greatly appreciated!

Best regards,
Christoph

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org





--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Document-Ids and Merges

2012-03-28 Thread Christoph Kaser


Thank you for your answer!

That's too bad. I thought of using my own ID-field, but I wanted to save 
the additional indirection (from docId to my ID to my value).
Do document IDs remain constant for one IndexReader as long as it isn't 
reopened? If so, I could precalculate the indirection.


Best regards,
Christoph

Am 27.03.2012 18:15, schrieb Michael McCandless:

In general how Lucene assigns docIDs is a volatile implementation
detail: it's free to change from release to release.

Eg, the default merge policy (TieredMergePolicy) merges out-of-order
segments.  Another eg: at one point, IndexSearcher re-ordered the
segments on init.  Another: because ConcurrentMergeScheduler runs
different merges in different threads, they can finish in different of
orders and thus alter how subsequent merges are selected.

Really it's best if you assign your own (app-level) ID field and use
that, if you need a stable ID.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Mar 27, 2012 at 3:29 AM, Christoph Kaser
lucene_l...@iconparc.de  wrote:

Hi all,

I have a search application with 16 million documents that uses custom
scores per document using a ValueSource. These values are updated a lot (and
sometimes all at once), so I can't really write them into the index for
performance reasons. Instead, I simply have a huge array of float values in
memory and use the document ID as index in the array.
This works great as long as the index is not changed, but as soon as I have
a few new documents and deletions, index segments are merged (I suppose) and
the document IDs of existing documents change. Is there any way to be
informed when document IDs of existing documents change? If so, is there a
way to calculate the new document ID from the old one, so I can convert my
array to the new document IDs?

Any help would be greatly appreciated!

Best regards,
Christoph

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Document-Ids and Merges

2012-03-27 Thread Christoph Kaser


Hi all,

I have a search application with 16 million documents that uses custom 
scores per document using a ValueSource. These values are updated a lot 
(and sometimes all at once), so I can't really write them into the index 
for performance reasons. Instead, I simply have a huge array of float 
values in memory and use the document ID as index in the array.
This works great as long as the index is not changed, but as soon as I 
have a few new documents and deletions, index segments are merged (I 
suppose) and the document IDs of existing documents change. Is there any 
way to be informed when document IDs of existing documents change? If 
so, is there a way to calculate the new document ID from the old one, so 
I can convert my array to the new document IDs?


Any help would be greatly appreciated!

Best regards,
Christoph

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Indexing product keys with and without spaces in them

2012-01-03 Thread Christoph Kaser


Hello,

we use lucene as search engine in an online shop. The products in this 
shop often contain product keys like CRXUSB2.0-16GB.
We would like our customers to be able to find products by entering 
their key. The problem is that product keys sometimes contain spaces or 
dashes and customers sometimes don't enter these whitespaces correctly. 
On the other hand, some customers enter whitespaces where there are 
none. Is there an analyzer or some other method that allows us to find 
the product if the user enters things like:

- CRX USB2.0 16GB
- CRXUSB2.016GB
- CRX USB-2.0 16GB
...

The problem is that the product keys don't all have a common format and 
are contained in the normal text, so we don't have an easy way to treat 
them different to the rest of the text.


Any help would be great!

Best regards,
Christoph


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing product keys with and without spaces in them

2012-01-03 Thread Christoph Kaser


Hi Aditya,

Thank you for your suggestion!
Unfortunately, this is not possible, as there is no common format for 
all product keys. The products are not ours nor are they all from the 
same manufacturer, so we don't have any influence on how the product 
keys look like.


Regards,
Christoph

On 03.01.2012 14:10, findbestopensource wrote:

Hi Christoph

My opinion is, you should not normalize or do any modification to the
product keys. This should be unique. Should be used as it is. Instead of
spaces you should have only used - but since the product already out in
the market, it cannot help.

In your UI, You could provide multiple text box where user will fill
respective chars. You could add space or - before passing the key to
Lucene.

Regards
Aditya
www.findbestopensource.com - Finds best open source across all platforms.


On Tue, Jan 3, 2012 at 2:14 PM, Christoph Kaserlucene_l...@iconparc.dewrote:


Hello,

we use lucene as search engine in an online shop. The products in this
shop often contain product keys like CRXUSB2.0-16GB.
We would like our customers to be able to find products by entering their
key. The problem is that product keys sometimes contain spaces or dashes
and customers sometimes don't enter these whitespaces correctly. On the
other hand, some customers enter whitespaces where there are none. Is there
an analyzer or some other method that allows us to find the product if the
user enters things like:
- CRX USB2.0 16GB
- CRXUSB2.016GB
- CRX USB-2.0 16GB
...

The problem is that the product keys don't all have a common format and
are contained in the normal text, so we don't have an easy way to treat
them different to the rest of the text.

Any help would be great!

Best regards,
Christoph


--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org





--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing product keys with and without spaces in them

2012-01-03 Thread Christoph Kaser


Hi Ian,

thank you for your reply.

Unfortunately this will be hard, as we have no way of knowing at which 
position the user might enter spaces, so we cannot expand the product 
keys at indexing time.


The other way round (triplets without spaces or hyphens) might work, 
however we have no real way of knowing whether product keys really are 
triplets, so we also have to make doublets and quadruplets (and maybe 
even quintuplets). So for every two, three, four, five consecutive 
tokens in our index, we would have to include the concatenated version. 
If we treat the user input the same way, we should be able to find type 
identifiers regardless of their spelling.


However, this would dramatically increase the index size and lead to 
false positives for situations where other words are concatenated which 
form a different word.


Has somebody ever tried something like this? Is there a way to do this 
without increasing the index to about 15 times (1+2+3+4+5) its original 
size?


Christoph


Am 03.01.2012 11:06, schrieb Ian Lea:

When indexing you could normalise them down to a standard format
without spaces or hyphens, but searching is much harder if you really
can't identify possible product ids within user queries.  Make
triplets without spaces or hyphens?  CRX USB-2.0 16GB ==
CRXUSB2.016GB but also some random words ==  somerandomwords.  The
latter wouldn't match, the former would if it was a valid id.

Some form of synonym analysis/injection at indexing would be better if
you could do that: CRXUSB2.016GB ==  CRX USB2.0 16GB, to be indexed
as well as the base value.

If you can't either have a dedicated product id search field or
standardise the product ids, this is going to be hard.


--
Ian,


On Tue, Jan 3, 2012 at 8:44 AM, Christoph Kaserlucene_l...@iconparc.de  wrote:

Hello,

we use lucene as search engine in an online shop. The products in this shop
often contain product keys like CRXUSB2.0-16GB.
We would like our customers to be able to find products by entering their
key. The problem is that product keys sometimes contain spaces or dashes and
customers sometimes don't enter these whitespaces correctly. On the other
hand, some customers enter whitespaces where there are none. Is there an
analyzer or some other method that allows us to find the product if the user
enters things like:
- CRX USB2.0 16GB
- CRXUSB2.016GB
- CRX USB-2.0 16GB
...

The problem is that the product keys don't all have a common format and are
contained in the normal text, so we don't have an easy way to treat them
different to the rest of the text.

Any help would be great!

Best regards,
Christoph


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing product keys with and without spaces in them

2012-01-03 Thread Christoph Kaser

Unfortunately, we don't have a designated field for product identifiers, 
and the product identifiers are from various manufacturers. So it is 
hard to normalize product keys, as we can't distinguish them from other 
parts of the document.
Examples are xbox 360 (which might be searched as xbox360) or ipod 
(which might be searched as i-pod).


We have about 100,000 products, so we might go with memory prefix matching.

I will give it a try!

Regards,
Christoph

Am 03.01.2012 15:48, schrieb Ian Lea:

My suggestion wasn't to store/index the triplets, just a normalized
version of the product key.  So if you had

id: CRXUSB2.0-16GB
desc: some 16GB USB thing

you'd index, in your searchable words field, CRXUSB2.016GB some 16GB USB thing

And then at search time you'd take CRX USB2.0-16G and normalize that
to CRXUSB2.016GB and get a hit.  If they entered CRX USB2.0 16G usb
thing you'd also make usb2.016gusb and 16gusbthing, and
CRXUSB2.016Gusbthing if you're going to quintuplets. So your queries
would be more complex, but the index wouldn't be larger.  You'd need
to make the combinations optional rather than required of course, and
a long query would generate lots of combinations, but since most won't
match the search would likely still be fast.

I'm not really saying this is a good idea, just something that might
work.  Personally I'd go and shout at whoever makes up the product ids
and replace them all with something simple and consistent.  Like
numbers.


Roughly how many products do you have?  If not a massive number you
could try some in memory prefix matching along the lines of

for each word in query
if is product id
  OK, got product id
if possible prefix
 add next word
  is product id?
OK, got product id
  is still possible prefix?
   add next word
 etc.

Might even be able to do it with lucene term and prefix queries on a
normalized product id field.


--
Ian.


On Tue, Jan 3, 2012 at 2:09 PM, Uwe Schindleru...@thetaphi.de  wrote:

Hi,


Has somebody ever tried something like this? Is there a way to do this

without

increasing the index to about 15 times (1+2+3+4+5) its original size?

The index will not have 15 times the size as it is inverted index and only
indexes the unique parts of your tokens. In most cases it will have approx.
maybe the double size. Just try it out, depends on your data!

Uwe


Christoph


Am 03.01.2012 11:06, schrieb Ian Lea:

When indexing you could normalise them down to a standard format
without spaces or hyphens, but searching is much harder if you really
can't identify possible product ids within user queries.  Make
triplets without spaces or hyphens?  CRX USB-2.0 16GB ==
CRXUSB2.016GB but also some random words ==somerandomwords.

The

latter wouldn't match, the former would if it was a valid id.

Some form of synonym analysis/injection at indexing would be better if
you could do that: CRXUSB2.016GB ==CRX USB2.0 16GB, to be indexed
as well as the base value.

If you can't either have a dedicated product id search field or
standardise the product ids, this is going to be hard.


--
Ian,


On Tue, Jan 3, 2012 at 8:44 AM, Christoph Kaserlucene_l...@iconparc.de

wrote:

Hello,

we use lucene as search engine in an online shop. The products in
this shop often contain product keys like CRXUSB2.0-16GB.
We would like our customers to be able to find products by entering
their key. The problem is that product keys sometimes contain spaces
or dashes and customers sometimes don't enter these whitespaces
correctly. On the other hand, some customers enter whitespaces where
there are none. Is there an analyzer or some other method that allows
us to find the product if the user enters things like:
- CRX USB2.0 16GB
- CRXUSB2.016GB
- CRX USB-2.0 16GB
...

The problem is that the product keys don't all have a common format
and are contained in the normal text, so we don't have an easy way to
treat them different to the rest of the text.

Any help would be great!

Best regards,
Christoph


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Numeric field min max values

2011-11-08 Thread Christoph Kaser


Hi Chris,

Here is some code we use to obtain the int values from the TermEnum:

HashSetInteger ints = new HashSetInteger();
TermEnum te = reader.terms(new Term(fieldName,));
do {
String val = te.term().text();

//See the FieldCache-Implementation: NumericFields add some 
values that are only needed for range querying

final int shift = val.charAt(0)-NumericUtils.SHIFT_START_INT;
if (shift0  shift=31)
break;

ints.add(NumericUtils.prefixCodedToInt(val));
}while(te.next());

Hope that helps,

Christoph Kaser

Am 07.11.2011 21:07, schrieb Uwe Schindler:

This is caused by lower-precision terms used by NumericField to allow fast 
NumericRangeQuery. You have to filter those values by looking at the first few 
bits, which contains the precision.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-Original Message-
From: Christian Reuschling [mailto:christian.reuschl...@gmail.com]
Sent: Monday, November 07, 2011 8:17 PM
To: java-user@lucene.apache.org
Subject: Re: Numeric field min max values

hm - I recognized that when I iterate with TermEnum and decode the value
with prefixCodedToInt (..), I get correct values, but I also get values that 
are not
Field values of this field in the entire index.
E.g. I get in the number-encoded field with the timestams also a '0'
as term - but all documents have a correct timestamp.
I also recognized that Luke shows the same values, even in the case the correct
decoder is selected. Luke also gives the opportunity to 'browse term docs', and
says that every document is a '0' - term document.

Has anyone a idea?

best

Chris

2011/11/3 Christian Reuschlingchristian.reuschl...@gmail.com:

Thank you very much! This exactly solves my problem


2011/11/3 Ian Leaian@gmail.com:

I can't answer most of the questions, but oal.util.NumericUtils has
prefixCodedToInt (Long, etc) methods that will convert the encoded
value (what you are seeing, I presume) to int or long or whatever.
Maybe that will help.


--
Ian.


On Wed, Nov 2, 2011 at 7:19 PM, Christian Reuschling
christian.reuschl...@gmail.com  wrote:

Hi,

maybe it is an easy question - I searched over the lucene-user
archive, but sadly didn't found an answer :(

I currently change our field logic from string- to numeric fields.
Until now, I managed to find the min-max values of a field by
iterating over the field with a TermEnum (termEnum =
reader.terms(new Term(strFieldName, ));).

Now, in the case of a numeric field, I get some strange field values
as $)A M` - I guess this could be a low-precision token from the
field trie?

Is there a special way to iterate over numeric field values? Or is
there a possibility to get the trie and ask him for the min-max
values? Or another (util)-class?

Thanks for all answers!

Chris


- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Merging several taxonomy indexes for faceted search

2011-10-24 Thread Christoph Kaser


Hi Shai,

thank you very much for pointing me to the TaxonymMergeUtils-class, it 
does exactly what I need.


I had only included the maven artifact for facet support and therefore 
did not see the (very helpful) examples package before.


Best regards,
Christoph

Am 19.10.2011 21:02, schrieb Shai Erera:

Hi Christoph,

You can certainly do that, and there are a bunch of APIs that will help you
do that. We have a very high-level utility called TaxonomyMergeUtils, which
offers a bunch of merge() methods, each taking more parameters. Perhaps
start with the simplest one (the one taking 4 directories) and see if that
works for you. If you discover that you need more granular control over the
merge process, let me know and I can point you at more lower-level API. In
fact, TaxonomyMergeUtils is part of the examples and you should get access
to its source code, so you can see the steps you need to follow (at the
lower level) in its more detailed merge() method.

Shai

On Wed, Oct 19, 2011 at 9:54 AM, Christoph Kaserlucene_l...@iconparc.dewrote:


Hi all,

I am planing to change my existing lucene index to use the new facets
introduced in lucene 3.4.0.

Unfortunately, I could not find an answer to my question in the
documentation:

I create a relatively large index of 8 million books by dividing it into
several smaller groups of documents, creating indices for them, and then
joining them all together to one big index using |IndexWriter.addIndexes|.

This allows the work to be split among several threads or even computers.

I now would like to add faceted search capabilities to my index to allow
grouping by author and publisher, but I have the following problem:
Can I merge/add/join several taxonomy indexes as created by
LuceneTaxonmyWriter, and if so, how would I do that?

Thanks in advance for any help!

Best Regards,
Christoph





--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Merging several taxonomy indexes for faceted search

2011-10-19 Thread Christoph Kaser


Hi all,

I am planing to change my existing lucene index to use the new facets 
introduced in lucene 3.4.0.


Unfortunately, I could not find an answer to my question in the 
documentation:


I create a relatively large index of 8 million books by dividing it into 
several smaller groups of documents, creating indices for them, and then 
joining them all together to one big index using |IndexWriter.addIndexes|.


This allows the work to be split among several threads or even computers.

I now would like to add faceted search capabilities to my index to allow 
grouping by author and publisher, but I have the following problem:
Can I merge/add/join several taxonomy indexes as created by 
LuceneTaxonmyWriter, and if so, how would I do that?


Thanks in advance for any help!

Best Regards,
Christoph

Merging several taxonomy indexes for faceted search

2011-10-19 Thread Christoph Kaser


Hi all,

I am planing to change my existing lucene index to use the new facets 
introduced in lucene 3.4.0.


Unfortunately, I could not find an answer to my question in the 
documentation:


I create a relatively large index of 8 million books by dividing it into 
several smaller groups of documents, creating indices for them, and then 
joining them all together to one big index using |IndexWriter.addIndexes|.


This allows the work to be split among several threads or even computers.

I now would like to add faceted search capabilities to my index to allow 
grouping by author and publisher, but I have the following problem:
Can I merge/add/join several taxonomy indexes as created by 
LuceneTaxonmyWriter, and if so, how would I do that?


Thanks in advance for any help!

Best Regards,
Christoph

37 matches

Mail list logo