Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-08 Thread Simon Willnauer
Thank you ryan for pushing on this, being persistent and getting the vote out. On Tue, Sep 8, 2020 at 5:55 PM Ryan Ernst wrote: > This vote is now closed. The results are as follows: > > Binding Results > A1: 12 (55%) > D: 6 (27%) > A2: 4 (18%) > > All Results > A1: 16 (55%) > D: 7 (

[ANNOUNCE] Apache Lucene 4.7.0 released.

2014-02-26 Thread Simon Willnauer
February 2014, Apache Lucene™ 4.7 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.7 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text

[ANNOUNCE] Apache Lucene 4.6 released

2013-11-24 Thread Simon Willnauer
October 2013, Apache Lucene™ 4.6 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.6 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text

Re: Omitting term frequencies while preserving positions

2013-08-05 Thread Simon Willnauer
the reason why you can't omit it today is that $num_position == $term_frequency ie. we need to store it anyways. Yet, I kind of agree that this is an impl detail so we could in theory return 1 as the TF from the DocsAndPosEnum but this would break our APIs as well since DocsAndPositionsEnum require

Re: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

2013-08-01 Thread Simon Willnauer
one thing I wonder is if you could just publish your benchmark code? simon On Thu, Aug 1, 2013 at 7:45 PM, Michael McCandless wrote: > On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng > wrote: >> >> Hi Mike, >> >> I retested and results are the same: >> >> 1/ I did not use sort (so FieldCache sh

Re: MemoryIndex in Lucene 4.x

2013-07-15 Thread Simon Willnauer
hey, can you share your benchmark and/or tell us a little more about how your data looks like and how you analyze the data. There might be analysis changes that contribute to that? simon On Sun, Jul 14, 2013 at 7:56 PM, cischmidt77 wrote: > I use Lucene/MemoryIndex for a large number of quer

Re: ERROR help me please ,org.apache.lucene.search.IndexSearcher.(Ljava/lang/String;)V

2013-05-17 Thread Simon Willnauer
Well IndexSearcher doesn't have a constructor that accepts a string, maybe you should pass in an indexreader instead? simon On Fri, May 17, 2013 at 3:11 PM, fifi wrote: > please,how I can solve this error? > > Exception in thread "main" java.lang.NoSuchMethodError: > org.apache.lucene.search.Ind

Re: Deadlock in DocumentsWriterFlushControl

2013-05-15 Thread Simon Willnauer
This seems like a bug caused due to the fact that we moved the CFS building into DWPT. Can you open an issue for this? simon On Wed, May 15, 2013 at 5:50 PM, Sergiusz Urbaniak wrote: > Hi all, > > We have an obvious deadlock between a "MaybeRefreshIndexJob" thread > calling ReferenceManager.mayb

Re: lucene and mongodb

2013-05-15 Thread Simon Willnauer
there is also elasticsearch (elasticsearch.org) build on top of lucene that might feel more natural if you come from mongo simon On Wed, May 15, 2013 at 11:38 AM, Rider Carrion Cleger wrote: > Thanks you Hendrik, > I'm new with Apache Lucene, the problem that arises is like starting with > lucen

[ANNOUNCE] Apache Lucene 4.3 released

2013-05-06 Thread Simon Willnauer
May 2013, Apache Lucene™ 4.3 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.3 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text searc

Re: How to use TokenStream build two fields

2013-04-23 Thread Simon Willnauer
hey there, I think your english is perfectly fine! Given the info you provided it's very hard to answer your question... I can't look into org.wltea.analyzer.core.AnalyzeContext.fillBuffer(AnalyzeContext.java:124) but apparently there is a nullpointer happening here. maybe you can track that down

Re: Token Stream with Offsets (Token Sources class)

2013-04-07 Thread Simon Willnauer
hey, first, please don't crosspost! second, can you provide more infos like that part where you index the data. maybe something that is selfcontained? simon On Mon, Apr 8, 2013 at 1:16 AM, vempap wrote: > Hi, > > I've the following snippet code where I'm trying to extract weighted span > term

Re: "4.1 consuming more memory than 3.0.2 while Indexing"

2013-04-01 Thread Simon Willnauer
can you provide some information how much ram you are setting on the index writer config? also how many threads are you using for indexing? simon On Mon, Apr 1, 2013 at 2:21 PM, Arun Kumar K wrote: > Hi Adrien, > > I have seen memory usage using linux command top for RES memory & i have > used

Re: Get BitSet from Filter object in 4.1

2013-03-26 Thread Simon Willnauer
You can do Filter#getDocIdSet(reader, acceptedDocs).bits() yet, this method might return null if the filter can not be represented as bits or for other reasons like performance. simon On Tue, Mar 26, 2013 at 10:37 AM, Ramprakash Ramamoorthy wrote: > Team, > > We are migrating from 2.3

Re: Compression and Highlighter

2013-03-26 Thread Simon Willnauer
ginal document. > ____ > From: Simon Willnauer [simon.willna...@gmail.com] > Sent: Monday, March 25, 2013 4:07 AM > To: java-user@lucene.apache.org > Subject: Re: Compression and Highlighter > > On Mon, Mar 25, 2013 at 8:13 AM, Bushman, Lamont wro

Re: Assert / NPE using MultiFieldQueryParser

2013-03-25 Thread Simon Willnauer
it > describes my entire professional life > > "Bobby Tables" is another (http://xkcd.com/327/). > > There, I've done my bit to stop productivity today! > > Erick > > > On Mon, Mar 25, 2013 at 2:08 PM, Simon Willnauer > wrote: >> >> ad

Re: Assert / NPE using MultiFieldQueryParser

2013-03-25 Thread Simon Willnauer
r > the record: https://issues.apache.org/jira/browse/LUCENE-4878 > > Adam > > -Original Message- > From: Simon Willnauer [mailto:simon.willna...@gmail.com] > Sent: Sunday, March 24, 2013 9:28 AM > To: java-user@lucene.apache.org > Subject: Re: Assert / NPE using Mult

Re: Compression and Highlighter

2013-03-25 Thread Simon Willnauer
On Mon, Mar 25, 2013 at 8:13 AM, Bushman, Lamont wrote: > I have a project where I need to index documents using Lucene 4.1.0. One > of the fields for the stored Document is the actual text from the > document(.pdf, .docx, etc.) I want to be able to highlight text from the > documents in

Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Simon Willnauer
On Mon, Mar 25, 2013 at 4:16 AM, Steve Rowe wrote: > The wiki at http://wiki.apache.org/lucene-java/ has come under attack by > spammers more frequently of late, so the PMC has decided to lock it down in > an attempt to reduce the work involved in tracking and removing spam. > > From now on, onl

Re: Using AnalyzingSuggester with stopwords

2013-03-24 Thread Simon Willnauer
Alex, did you try to get it working with a single term like adding "the foobar" and then drawing suggestions for "the foo" ? simon On Sun, Mar 24, 2013 at 8:51 PM, Alexander Reelsen wrote: > Hey there, > > I am trying to get up some working example with the AnalyzingSuggester and > stopwords - l

Re: Assert / NPE using MultiFieldQueryParser

2013-03-24 Thread Simon Willnauer
Hey, this is in-fact a bug in the MultiFieldQueryParser, can you open a ticket for this please in our bugtracker? MultifieldQueryParser should override getRegexpQuery but it doesn't simon On Sun, Mar 24, 2013 at 3:57 PM, Adam Rauch wrote: > I'm using MultiFieldQueryParser to parse search queri

Re: Field.Index deprecation ?

2013-03-23 Thread Simon Willnauer
t;> > -Original Message- >> > From: Michael McCandless [mailto:luc...@mikemccandless.com] >> > Sent: Friday, March 22, 2013 9:41 PM >> > To: java-user@lucene.apache.org; simon.willna...@gmail.com >> > Subject: Re: Field.Index deprecation ? >&g

Re: Field.Index deprecation ?

2013-03-22 Thread Simon Willnauer
On Fri, Mar 22, 2013 at 5:28 PM, Michael McCandless wrote: > We badly need Lucene in Action 3rd edition! go mike go!!! ;) > > The easiest approach is to use one of the new XXXField classes under > oal.document, eg StringField for your example. > > If none of the existing XXXFields "fit", you can

Re: Lucene reliability as primary store

2013-03-22 Thread Simon Willnauer
On Fri, Mar 22, 2013 at 2:00 PM, Pablo Guerrero wrote: > Hi all, > > I'm evaluating using Lucene for some data that would not be stored anywhere > else, and I'm concerned about reliabilty. Having a database storing the > data in addition to Lucene would be a problem, and I want to know if Lucene >

Re: Segment file clean-up and codecs

2013-03-22 Thread Simon Willnauer
can you send this to d...@lucene.apache.org? simon On Fri, Mar 22, 2013 at 7:52 PM, Ravikumar Govindarajan wrote: > Most of us, writing custom codec use segment-name as a handle and push data > to a different storage > > Would it be possible to get a hook in the codec APIs, when obsolete segment

Re: question about document-frequency in score

2013-03-22 Thread Simon Willnauer
all statistics in lucene are per field so is document frequency simon On Fri, Mar 22, 2013 at 10:48 AM, Nicole Lacoste wrote: > Hi > > I am trying to figure out if the document-frequency of a term used in > calculating the score. Is it per field? Or is independent of the field? > > Thanks > >

Re: Lucene slow performance -- still broke

2013-03-20 Thread Simon Willnauer
ormance > > Please forceMerge only one time not every time (only to clean up your index)! > If you are doing a reindex already, just fix your close logic as discussed > before. > > > > Scott Smith schrieb: > >>Unfortunately, this is a production system which I can't touch

Re: Overall doc-count in TermStats, during flush...

2013-03-20 Thread Simon Willnauer
The BitSet basically counts how many documents have one or more values in this field. Some docs might not have values in this field. state.segmentInfo.getDocCount() is the # of docs in this segment but we are flushing a single field here. We pass down the cardinality here since we keep the statist

Re: Lucene slow performance

2013-03-15 Thread Simon Willnauer
On Sat, Mar 16, 2013 at 12:02 AM, Scott Smith wrote: > " Do you always close IndexWriter after adding few documents and when > closing, disable "wait for merge"? In that case, all merges are interrupted > and the merge policy never has a chance to merge at all (because you are > opening and clo

Re: Lucene slow performance

2013-03-15 Thread Simon Willnauer
Can you tell us a little more about how you use lucene, how do you index, do you use NRT or do you open an IndexReader for every request, do you maybe us a custom merge policy or somthing like this, any special IndexWriter settings? On Fri, Mar 15, 2013 at 11:15 PM, Scott Smith wrote: > We have a

Re: Concurrent indexing performance problem

2013-03-07 Thread Simon Willnauer
On Thu, Mar 7, 2013 at 6:44 PM, Michael McCandless wrote: > This sounds reasonable (500 M docs / 50 GB index), though you'll need > to test resulting search perf for what you want to do with it. > > To reduce merging time, maximize your IndexWriter RAM buffer > (setRAMBufferSizeMB). You could als

Re: Concurrent indexing performance problem

2013-03-07 Thread Simon Willnauer
On Thu, Mar 7, 2013 at 7:06 PM, Jan Stette wrote: > Thanks for your suggestions, Mike, I'll experiment with the RAM buffer size > and segments-per-tier settings and see what that does. > > The time spent merging seems to be so great though, that I'm wondering if > I'm actually better off doing the

Re: Field seems to have become binary field on update to Lucene 4.1

2013-02-19 Thread Simon Willnauer
phew! thanks for clarifying simon On Tue, Feb 19, 2013 at 11:19 PM, Paul Taylor wrote: > On 19/02/2013 20:56, Paul Taylor wrote: >> >> >> Strange test failure after converting code from Lucene 3.6 to Lucene 4.1 >> >> public void testIndexPuid() throws Exception { >> >> addReleaseOne(); >

Re: IndexSearcher.close() removed in 4.0

2013-02-19 Thread Simon Willnauer
, Eric Charles wrote: > Hi, > Why not having the IS#close() calling the wrapped IR#close() ? > > I would be happier having to only deal with the Searcher once created and > forget it wraps a Reader: I create a Searcher, I close it. > > Thx, Eric > > > On 18/02/201

Re: Need Help:How to Get the enumeration of Terms Ending with a given word

2013-02-18 Thread Simon Willnauer
On Thu, Feb 14, 2013 at 11:42 AM, VIGNESH S wrote: > Hi, > > I have two questions > > 1.How to Get the enumeration of Terms Ending with a given word > I saw we can get enumerations of word starting at a given word by > Indexreader.terms(term())) method unless you want to iterate all terms and che

Re: IndexSearcher.close() removed in 4.0

2013-02-18 Thread Simon Willnauer
On Mon, Feb 18, 2013 at 7:32 PM, saisantoshi wrote: > I understand from the JIRA ticket(Lucene-3640) that the IndexSearcher.close() > is no-op operation but not very clear on why it is a no-op? Could someone > shed some light on this? We were using this method in the older versions and > is it saf

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-25 Thread Simon Willnauer
On Fri, Jan 25, 2013 at 3:29 PM, saisantoshi wrote: > Thanks a lot. If we want to wrap TopScoreDocCollector into > PositiveScoresOnlyCollector. Can we do that? > I need only positive scores and I dont think topscore collector can handle > by itself right? > I guess so! But how do you get neg. sco

Re: how to add attributes to a field just like term's payload ?

2013-01-25 Thread Simon Willnauer
Directory directory = ... final SegmentInfos sis = new SegmentInfos(); sis.read(directory); Map commitUserData = sis.getUserData(); simon On Fri, Jan 25, 2013 at 2:32 AM, wgggfiy wrote: > hello, but there is no getCommitUserData in IndexReader, > how can I get the userdata ?? > thx > > > > -

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-25 Thread Simon Willnauer
hey, you don't need to set the indexreader in the constructor. An AtomicReader is passed in for each segment to Collector#setNextReader(AtomicReaderContext) If you want to use a given collector and extend it with some custom code in collect I would likely write a delegate Collector like this: pub

Re: StoredFieldsFormat / documentation

2013-01-24 Thread Simon Willnauer
Hi Bernd, On Thu, Jan 24, 2013 at 9:30 AM, Bernd Müller wrote: > Hello, > > In the lucene 4.1 release, there was introduced a compression for > stored fields as described here: > https://issues.apache.org/jira/browse/LUCENE-4226 yeah that is correct, its the new default. if you use Lucene 4.1 t

Re: Tool for Lucene storage recovery

2013-01-18 Thread Simon Willnauer
hey, do you wanna open a jira issue for this and attach your code? this might help others too and if the shit hits the fan its good to have something in the lucene jar that can bring some data back. simon On Fri, Jan 18, 2013 at 6:37 PM, Michał Brzezicki wrote: > in lucene (*.fdt). Code is avail

Re: Excessive use of IOException without proper documentation

2012-11-04 Thread Simon Willnauer
te both: >> >> 1. Oracle's JavaDoc Style Guide: http://www.oracle.com/** >> technetwork/java/javase/**documentation/index-137868.**html#throwstag<http://www.oracle.com/technetwork/java/javase/documentation/index-137868.html#throwstag> >> 2. Joshua Bloch'

Re: Excessive use of IOException without proper documentation

2012-11-02 Thread Simon Willnauer
Hey, On Fri, Nov 2, 2012 at 2:20 PM, Michael-O <1983-01...@gmx.net> wrote: > Hi, > > why does virtually every method (exaggerating) throw an IOE? I know there > might be a failure in the underlying IO (corrupt files, passing checked exc > up, etc) but > > 1. Almost none of the has a JavaDoc on i

Re: 4.0 tokenStream or SimpleAnalyzer bug?

2012-11-02 Thread Simon Willnauer
hey scott, this is intentional see the javadoc step 2: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html simon On Fri, Nov 2, 2012 at 2:07 AM, Scott Smith wrote: > I was doing some tokenizer/filter analysis attempting to fix a bug I have in > highlighting und

Re: How to correctly use SearcherManager#close?

2012-11-01 Thread Simon Willnauer
hey michael, On Thu, Nov 1, 2012 at 11:30 PM, Michael-O <1983-01...@gmx.net> wrote: > Thanks for the quick response. Any chance this could be clearer in the > JavaDoc of this class? sure thing, do you wanna open an issues / create a patch I am happy to commit it. simon > >> Call it when you kno

Re: Norms and Term Vectors in Lucene 4.0

2012-10-30 Thread Simon Willnauer
hey scott, On Mon, Oct 29, 2012 at 11:56 PM, Scott Smith wrote: > Converting some code to lucene 4.0, it appears that we can no longer set > whether we want to store norms or termvectors using the "sugared" Field > classes (e.g., StringField() and TextField). I gather the defaults are to > st

Re: Term Positions added to one document forward

2012-10-29 Thread Simon Willnauer
you should call currDocsAndPositions.nextPosition() before you call currDocsAndPositions.getPayload() payloads are per positions so you need to advance the pos first! simon On Mon, Oct 29, 2012 at 6:44 PM, Ivan Vasilev wrote: > Hi Guys, > > I use the following code to index documents and set Pa

Re: Scoring based on document

2012-10-23 Thread Simon Willnauer
hey there, in Lucene 4 you can override the termStatistics / CollectionStatistics used for scoring in the IndexSearcher. You can take multiple fields into account here in order use it for scoring. Here is the javadoc link: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/IndexSe

Re: How could i take into account the other part of a field which not matches with the query

2012-10-14 Thread Simon Willnauer
hey. On Sun, Oct 14, 2012 at 1:51 PM, emmanuel Gosse wrote: >> >> Hi, > > > >> How could i take into account in a query the fact that the searched words >> could be more precise in a document field than an other. >> > > example : > 2 documents : > doc1 : title : taxi > doc2 : title : taxi driver

Re: I want to know that why transform numeric to string

2012-09-24 Thread Simon Willnauer
quick answer, Lucene only operates on strings (from a high level perspective) simon On Fri, Sep 21, 2012 at 11:54 AM, 惠达 王 wrote: > hi all: > I want to know that why transform numeric to string? > > public static int longToPrefixCoded(final long val, final int shift, > final BytesRef bytes)

Re: Directory flushing / commit / openIfChanged

2012-08-06 Thread Simon Willnauer
hey harald, On Mon, Aug 6, 2012 at 1:22 PM, Harald Kirsch wrote: > Hi, > > in my application I have to write tons of small documents to the index, but > with a twist. Many of the documents are actually aggregations of pieces of > information that appear in a data stream, usually close together, b

Re: questions about DocValues in 4.0 alpha

2012-08-06 Thread Simon Willnauer
hey, On Mon, Aug 6, 2012 at 11:34 AM, Li Li wrote: > hi everyone, > in lucene 4.0 alpha, I found the DocValues are available and gave > it a try. I am following the slides in > http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene > I have g

Re: Problem with near realtime search

2012-08-05 Thread Simon Willnauer
ms I saw. >> >> So docIds can definitively change under the hood? >> >> Harald. >> >> >> Am 03.08.2012 17:24, schrieb Simon Willnauer: >>> >>> hey harald, >>> >>> if you use a possibly different searcher (reader) than you us

Re: Problem with near realtime search

2012-08-03 Thread Simon Willnauer
hey harald, if you use a possibly different searcher (reader) than you used for the search you will run into problems with the doc IDs since they might change during the request. I suggest you to use SearcherManager or NRTMangager and carry on the searcher reference when you collect the stored val

Re: Analyzer on query question

2012-08-03 Thread Simon Willnauer
On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky wrote: > Hi, > > I understand that generally speaking you should use the same analyzer on > querying as was used on indexing. In my code I am using the SnowballAnalyzer > on index creation. However, on the query side I am building up a complex > Bo

Re: Memory leak?? with CloseableThreadLocal with use of Snowball Filter

2012-08-01 Thread Simon Willnauer
On Thu, Aug 2, 2012 at 7:53 AM, roz dev wrote: > Thanks Robert for these inputs. > > Since we do not really Snowball analyzer for this field, we would not use > it for now. If this still does not address our issue, we would tweak thread > pool as per eks dev suggestion - I am bit hesitant to do th

Re: Usage of NoMergePolicy and its potential implications

2012-07-25 Thread Simon Willnauer
On Mon, Jul 23, 2012 at 7:00 PM, snehal.chennuru wrote: > Thanks for the heads up Ian. I know it is highly discouraged. But, like I > said, it is a legacy application and it is very hard to go back and re-do > it. you really shouldn't do that! If you use lucene as a Primary key generator why don'

Re: FixedStraightBytesImpl - flushing

2012-07-24 Thread Simon Willnauer
hey SimonM :) On Mon, Jul 23, 2012 at 6:37 PM, Simon McDuff wrote: > > Hello, (LUCENE 4.0.0-ALPHA) > > We are using the DocValues features (very nice). cool! > > We are using FixedBytesRef. > > In that specific case, we were wondering why does it flush at the end (when > we commit) ? the reas

Re: Flushing Thread

2012-07-21 Thread Simon Willnauer
) >> >> >> On Fri, Jul 20, 2012 at 2:29 AM, Simon McDuff wrote: >> > >> > Thank you Simon Willnauer! >> > >> > With your explanation, we`ve decided to control the flushing by spawning >> > another thread. So the thread is available

Re: Flushing Thread

2012-07-20 Thread Simon Willnauer
hey simon ;) On Fri, Jul 20, 2012 at 2:29 AM, Simon McDuff wrote: > > Thank you Simon Willnauer! > > With your explanation, we`ve decided to control the flushing by spawning > another thread. So the thread is available to still ingest ! :-) (correct me > if I'm wrong)W

Re: Flushing Thread

2012-07-19 Thread Simon Willnauer
hey, On Thu, Jul 19, 2012 at 7:41 PM, Simon McDuff wrote: > > Thank you for your answer! > > I read all your blogs! It is always interesting! for details see: http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ and http://www.searchworkings.org/blog/-/blog

Re: RAM or SSD...

2012-07-18 Thread Simon Willnauer
On Wed, Jul 18, 2012 at 9:05 PM, Tim Eck wrote: > Rum is an essential ingredient in all software systems :-) Absolutely! :) simon > > -Original Message- > From: Simon Willnauer [mailto:simon.willna...@gmail.com] > Sent: Wednesday, July 18, 2012 11:49 AM > To: java-user

Re: RAM or SSD...

2012-07-18 Thread Simon Willnauer
1. use mmap directory 2. buy rum 3. get an SSD simon :) On Wed, Jul 18, 2012 at 8:36 PM, Vitaly Funstein wrote: > You do not want to store 30 G of data in the JVM heap, no matter what > library does this. > > On Wed, Jul 18, 2012 at 10:44 AM, Paul Jakubik wrote: >> If only 30GB, go with RAM and

Re: In memory Lucene configuration

2012-07-18 Thread Simon Willnauer
ferent queries (well, some are repeated >> twice or thrice), and includes search time and doc loading (reading the two >> fields I mentioned). The queries are all straight boolean conjunctions, and >> yes, I am dropping the first few queries when calculating averages. >> >&

Re: In memory Lucene configuration

2012-07-16 Thread Simon Willnauer
t 2200 different queries (well, some are repeated >> twice or thrice), and includes search time and doc loading (reading the two >> fields I mentioned). The queries are all straight boolean conjunctions, and >> yes, I am dropping the first few queries when calculating ave

Re: In memory Lucene configuration

2012-07-15 Thread Simon Willnauer
hey there, On Sun, Jul 15, 2012 at 10:41 AM, Doron Yaacoby wrote: > Hi, I have the following situation: > > I have two pretty large indices. One consists of about 1 billion documents > (takes ~6GB on disk) and the other has about 2 billion documents (~10GB on > disk). The documents are very sho

Re: Is creating an analyzer expensive?

2012-07-12 Thread Simon Willnauer
You can safely reuse a single analyzer across threads. The Analyzer class maintains ThreadLocal storage for TokenStreams internally so you can just create the analyzer once and use it throughout your application. simon On Thu, Jul 12, 2012 at 10:13 PM, Dave Seltzer wrote: > I have one more quest

Re: delete by docid in lucene 4

2012-07-12 Thread Simon Willnauer
eleteDocument(int docId) in IndexWriter. >> It seems like it would be easy to add as DocumentsWriter already has a >> deletedDocID. I can file a jira and submit a patch if this is something > that you >> guys would accept. >> >> Sean >> >> On Thu, Jul 12,

Re: delete by docid in lucene 4

2012-07-12 Thread Simon Willnauer
it to another machine, which keeps the index forever. Before >> we >>> upload the index, we forceMerge(1) on it, and gather some stats about the >>> index like max,min serial id, total documents. While calculating max and >> min >>> serial id, if we see a duplicate serial

Re: BrazilianAnalyzer don't woks with any BooleanQuery

2012-07-12 Thread Simon Willnauer
can you tell us more about your index side of things? Are you using positions in the index since I see PhraseQuery in your code? Where are you passing the text you are searching for to the BrasilianAnalyzer, I don't see it in your code. You need to process you text at search time too to get results

Re: delete by docid in lucene 4

2012-07-12 Thread Simon Willnauer
On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges wrote: > Is it possible to delete by docId in lucene 4? I can delete by docid > in lucene 3 using IndexReader.deleteDocument(int docId), but that > method is gone in lucene 4, and IndexWriter only allows deleting by > Term or Query. that is correct.

Re: index searcher leading to system freeze ?

2012-07-11 Thread Simon Willnauer
are you closing your underlying IndexReaders properly? simon On Wed, Jul 11, 2012 at 5:04 AM, Yang wrote: > I'm running 8 index searchers java processes on a 8-core node. > They all read from the same lucene index on local hard drive. > > > the index contains about 20million docs, each doc is

Re: Problems with hundreds of BLOCKED threads.

2012-07-09 Thread Simon Willnauer
]} arrays. * This class is optimized for small memory-resident indexes. * It also has bad concurrency on multithreaded environments. simon On Sat, Jul 7, 2012 at 1:29 PM, Simon Willnauer wrote: > On Fri, Jul 6, 2012 at 9:28 PM, Leon Rosenberg > wrote: >> Hello, >> >> we ha

Re: Problems with hundreds of BLOCKED threads.

2012-07-09 Thread Simon Willnauer
On Fri, Jul 6, 2012 at 9:28 PM, Leon Rosenberg wrote: > Hello, > > we have a small internet shop which uses lucene for product search. > With increasing traffic we have continuos problem with literaly > hundreds of threads being BLOCKED in lucene code: > > here is an example taken with jstack on p

Re: about .frq file format in doc

2012-06-27 Thread Simon Willnauer
see definitions: http://lucene.apache.org/core/3_6_0/fileformats.html#Definitions simon On Wed, Jun 27, 2012 at 6:08 PM, Simon Willnauer wrote: > a term in this context is a (field,text) tuple - does this make sense? > simon > > On Wed, Jun 27, 2012 at 11:40 AM, wangjing wr

Re: about .frq file format in doc

2012-06-27 Thread Simon Willnauer
a term in this context is a (field,text) tuple - does this make sense? simon On Wed, Jun 27, 2012 at 11:40 AM, wangjing wrote: > http://lucene.apache.org/core/3_6_0/fileformats.html#Frequencies > > The .frq file contains the lists of documents which contain each term, > along with the frequency o

Re: what is the fdx file exactly mean

2012-06-25 Thread Simon Willnauer
see http://lucene.apache.org/core/3_6_0/fileformats.html#field_index for file format documentation. simon On Mon, Jun 25, 2012 at 5:28 AM, wangjing wrote: > .fdx file  contains, for each document, a pointer to its field data. > > BUT fdx is contains pointer to WHAT? it's a pointer of  field data

Re: lucene (search) performance tuning

2012-05-26 Thread Simon Willnauer
On Sat, May 26, 2012 at 2:59 AM, Yang wrote: > I tested with more threads / processes. indeed this is completely > cpu-bound, since running 1 thread gives the same latency as 4 threads (my > box has 4 cores) > > > given this, is there any way to simplify the scoring computation (i'm only > using l

Re: IndexReader.deleteDocument(Term) in Lucene 3.6/4.0

2012-05-25 Thread Simon Willnauer
hey, On Fri, May 25, 2012 at 2:45 PM, Nikolay Zamosenchuk wrote: > Hi everyone. We are using IndexReader.deleteDocument(Term) method to > delete documents, since it returns the number of deleted documents. > This is used to be sure that some docs were removed. We must know for > sure if documents

Re: FilterClause serializable

2012-05-21 Thread Simon Willnauer
we removed almost all serializable from lucene since it was causing many problems and wasn't complete either. users should serialize classes / logic themself or use higher level impls that deal with that already. simon On Mon, May 21, 2012 at 1:05 PM, Lars Gjengedal wrote: > Hi > > I have not bee

Re: Lucene's internal doc ID space

2012-05-12 Thread Simon Willnauer
On Fri, May 11, 2012 at 7:56 AM, Jong Kim wrote: > When I update a document in Lucene (i.e., re-indexing), I have to delete > the existing document, and create a new one. My understanding is that this > assigns a new doc ID for the newly created document. If that is the case, > is it true that the

Re: weird multifile problems

2012-04-06 Thread Simon Willnauer
Hey, do you get multiple files per segment or multiple files per index? The compoundfile system writes a .cfs file (and a .cfe file in trunk) per segment. So if you are seeing multiple .cfs fiels Lucene is actually doing what you want. If there are files like .fdt/fdx or tii/tis then the segment is

Re: retrieved doc field values being cached?

2012-02-24 Thread Simon Willnauer
Hey Stuart, Lucene solely relies on the FS cache with some exceptions for the term-dictionary and FieldCache which is pulled entirely into memory. FieldCache is not used to retrieve stored fields though, its rather an univerted view (docID -> value) of an indexed (inverted) field. So basically wha

Re: Index writing performance of 3.5

2012-02-09 Thread Simon Willnauer
one major thing that changed from 3.0.3 to 3.5 is that we use TieredMergePolicy by default. can you try to use the same merge policy on both 3.0.3 and 3.5 and report back? ie LogByteSizeMergePolicy or whatever you are using... simon On Thu, Feb 9, 2012 at 5:28 AM, Vitaly Funstein wrote: > Hello,

Re: NRTManager and AlreadyClosedException

2012-02-08 Thread Simon Willnauer
are you closing the NRTManager while other threads still accessing the SearcherManager? simon On Wed, Feb 8, 2012 at 1:48 PM, Cheng wrote: > I use it exactly the same way. So there must be other reason causing the > problem. > > On Wed, Feb 8, 2012 at 8:21 PM, Ian Lea wrote: > >> Releasing a se

Re: Multiple document types

2012-01-25 Thread Simon Willnauer
of the url, so that the url would > determine which index was to be loaded by the dataimport command. seems like you should look at solr's multicore feature: http://wiki.apache.org/solr/CoreAdmin simon > > F > > -Original Message- > From: Simon Willnauer [mailt

Re: Cleaning up writer after certain idle time?

2012-01-25 Thread Simon Willnauer
Hey, On Wed, Jan 25, 2012 at 11:01 PM, Cheng wrote: > Hi, > > I am using multiple writer instances in a web service. Some instances are > busy all the time, while some aren't. I wonder how to configure the writer > to dissolve itself after a certain time of idling, say 30 seconds. what do you me

Re: Multiple document types

2012-01-25 Thread Simon Willnauer
hey Frank, can you elaborate what you mean by different doc types? Are you referring to an entity ie. a table per entity to speak in SQL terms? in general you should get better responses for solr related questions on solr-u...@lucene.apache.org simon On Wed, Jan 25, 2012 at 10:49 PM, Frank DeRos

Re: How NRTManagerReopenThread works with Java Executor framework?

2012-01-15 Thread Simon Willnauer
I think the question is more related to the reopen thread. This class directly extends Thread and instead of calling Thread#start() directly you can simply pass it to the Executor since it implements Runnable - is that what you are asking for? simon On Sun, Jan 15, 2012 at 7:53 PM, Michael McCandl

Call for Submission Berlin Buzzwords 2012all for Submission Berlin Buzzwords - http://berlinbuzzwords.de

2012-01-11 Thread Simon Willnauer
ittee Chairs:  *  Isabel Drost (Nokia & Apache Mahout)  *  Jan Lehnardt (CouchBase & Apache CouchDB)  *  Simon Willnauer (SearchWorkings & Apache Lucene)  *  Grant Ingersoll (Lucid Imagination & Apache Lucene)  *  Owen O’Malley (Yahoo Inc. & Apache Hadoop)  *  Jim Webber (Neo Tec

Heads Up - Index File Format Change on Trunk

2012-01-05 Thread Simon Willnauer
Folks, I just committed LUCENE-3628 [1] which cuts over Norms to DocVaues. This is an index file format change and if you are using trunk you need to reindex before updating. happy indexing :) simon [1] https://issues.apache.org/jira/browse/LUCENE-3628 -

Re: Comparing Indexing Speed of Lucene 3.5 and 4.0

2012-01-05 Thread Simon Willnauer
hey peter, On Wed, Jan 4, 2012 at 12:52 AM, Peter K wrote: > Thanks Simon for you answer! > >> as far as I can see you are comparing apples and pears. > > When excluding the waiting time I also get the slight but reproducable > difference**. The times for waitForGeneration are nearly the same > (

Re: IndexDocValues and storing Stats

2012-01-04 Thread Simon Willnauer
Hey, On Wed, Jan 4, 2012 at 1:15 PM, Hany Azzam wrote: > Hi, > > I am experimenting with the Lucene trunk (aka 4.0), especially with the new > IndexDocValues feature. I am trying to store some query-independent > statistics such as PageRank, etc. One stat that I am trying to store is the > sum

Re: Comparing Indexing Speed of Lucene 3.5 and 4.0

2012-01-03 Thread Simon Willnauer
hey Peter, as far as I can see you are comparing apples and pears. Your comparison is waiting for merges to finish and if you are using multiple threads lucene 4.0 will flush more segments to disk than 3.5 so what you are seeing is likely a merge that is still trying to merge small segments. can y

Re: Help running out of files

2012-01-02 Thread Simon Willnauer
hey charlie, there are a couple of wrong assumptions in your last email mostly related to merging. mergefactor = 10 doesn't mean that you are ending up with one file neither is it related to files. Yet, my first guess is that you are using CompoundFileSystem (CFS) so each segment corresponds to a

Re: Table Defn and/or ER Diagram of Segment files

2011-12-19 Thread Simon Willnauer
I think you are confusing something here. BDB can be used as a "Directory" implementation but a Directory is a simple "blob" store. BDB only stores binary BLOB which corresponds to a file. AFAIK we dropped the BDB support entirely a couple of releases ago. In Lucene you can think of one large table

Re: Lucene 4.0 questions, was: shift bug in possibly invalid use of NumericTokenStream

2011-12-19 Thread Simon Willnauer
On Mon, Dec 19, 2011 at 9:04 PM, Simon Willnauer wrote: > On Mon, Dec 19, 2011 at 5:03 PM, Peter Karich wrote: >> Hi Uwe, >> >> thanks for the talk suggestion(s)*. >> >> I was using it for faster term lookups of a long 'id'. How would this be >> d

Re: Lucene 4.0 questions, was: shift bug in possibly invalid use of NumericTokenStream

2011-12-19 Thread Simon Willnauer
On Mon, Dec 19, 2011 at 5:03 PM, Peter Karich wrote: > Hi Uwe, > > thanks for the talk suggestion(s)*. > > I was using it for faster term lookups of a long 'id'. How would this be > done with 4.0? Before I did it via Term: > > new Term(fieldName, NumericUtils.longToPrefixCoded(longValue)); > > How

Re: Obtaining IDF values for the terms in a document set

2011-12-15 Thread Simon Willnauer
On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary wrote: > We have a large set of documents that we would like to index with a > customized stopword list. We have run tests by indexing a random set of about > 10% of the documents, and we'd like to generate a list of the terms in that > smaller set

Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

2011-12-13 Thread Simon Willnauer
   1       20       761.95 >      262.49    63,139,256     91,881,472 > > The performance is slightly better than the one using StandardAnalyzer,  but   > this is still much worse than the performance with 2.4.1. > > Sean > > -Original Message- > From: Simon Willnauer [

  1   2   3   4   5   >