Re: Under the hood of SpanQueries

2013-04-11 Thread Karsten F.
Hi Igor, About your performance problem with SpanQueries and Payloads: Try to filter with the corresponding BooleanQuery and use a profiler. You have an IO-bottleneck because of reading position and payload information per document. Possible it would help if you first filter off the obviously

Re: ComplexPhraseQueryParser (Expanded Form and Boosting)

2010-02-02 Thread Karsten F.
Hi Nariman, In my understanding of ComplexPhraseQueryParser this class is not longer supported. http://issues.apache.org/jira/browse/LUCENE-1486#action_12782254 Instead with lucene 3.1 the new org.apache.lucene.queryParser.standard.parser.StandardSyntaxParser will do this job.

Search a PhraseQuery one multiple terms with the same position

2010-01-28 Thread Karsten F.
Hi, I have a problem with the checkedRepeats in SloppyPhraseScorer. This feature is for phrases like 1st word 2st word. Without this feature the result would be the same as 1st word 2st. OK But I have an Index with more then one token on the same position. The german sentence Die käuflichen

RE: Faceting, Sort and DocIDSet

2009-04-22 Thread Karsten F.
Hi Dave, facets: in you case a solution with one int[IndexReader.maxDoc()] fits. For each document number you can store an integer which represents the facet value. This is what org.apache.solr.request.UnInvertedField will store in your case. (*John* : is there something similar in

RE: Faceting, Sort and DocIDSet

2009-04-20 Thread Karsten F.
Hi David, correct: you should avoid reading the content of a document inside a hitcollector. Normaly that means to cache all you need in main memory. Very simple and fast is a facet with only 255 possible values and exactly one value per document. In this case you need only an

Re: Faceting, Sort and DocIDSet

2009-04-18 Thread Karsten F.
Hi Dave, searching and sorting in lucene are two separate functions (if you not want to sort by relevance). You will not loss performance if you first search with BitSet as HitCollector and then sort the result by DateField. But more easy is to extend TopFieldDocCollector/TopFieldCollector to a

Re: Taxonomy in Lucene

2009-04-18 Thread Karsten F.
Hi John, I intended to compare xtf with hierarchical facet browsing in browseengine (selection expansion). I found PathFacetCountCollector/PathFacetHandler#getFacetsForPath, and I think that the implementation in xtf has a lot of advantages. So I suggest you to reuse the xtf-source for that

Re: Best Practice for Lucene Search

2009-02-02 Thread Karsten F.
Hi ilwes, Did you noticed the thread http://www.nabble.com/Lucene-vs.-Database-td19755932.html ? I think it is usefull for the question about using lucene storage fields even if you already have the information in DB. Best regards Karsten ilwes wrote: Hello, I googled, searched this

Re: cross-field AND queries with field boosting

2009-01-28 Thread Karsten F.
Hi Murali, I think a search with 4 * 5 = 20 Boolean Clauses will not be a performance problem (at least if you have only one optimized index-folder). You also could use one Field which contains content of all other fields with a boost factor for each term (different boost for content from

Re: Taxonomy in Lucene

2008-12-12 Thread Karsten F.
Hi John, I will take a look in the bobo-browse source code at week end. Do you now the xtf implementation of faceted browsing: starting point is org.cdlib.xtf.textEngine.facet.GroupCounts#addDoc ? (It works with millions of facet values on millions of hits) What is the starting point in

Re: Taxonomy in Lucene

2008-12-11 Thread Karsten F.
hi glen, possible you will find this thread interesting: http://groups.google.com/group/xtf-user/browse_thread/thread/beb62f5ff9a16a3a/16044d1009511cda was about a taxonomy like in your example. Also take a look to the faceted browsing on date in

Re: Taxonomy in Lucene

2008-12-10 Thread Karsten F.
Hi Dipak, Which kind of Taxonomy? Where is the difference to faceted browsing in your case? best regards Karsten Kesarkar, Dipak wrote: Hi I want to include Taxonomy feature in my search. Does Lucene support Taxonomy? How? If not, is there in different way to add Taxonomy

Re: Improving Indexing Performance

2008-12-08 Thread Karsten F.
Hi buFka, take a look to http://wiki.apache.org/lucene-java/ImproveIndexingSpeed e.g. your example does not set mergeFactor or RAMBufferSizeMB I also like the last tip: Run a Java profiler Because in my case, the leak of performance vanished after I switched from jdom to saxon. (we are

Re: Save big arrays in lucene document

2008-12-05 Thread Karsten F.
Hi Zender, please take a look to http://www.nabble.com/Lucene-vs.-Database-td19755932.html#a19757274 you shouldn't use a lucene fields to store such huge data. At least not a lucene field in your main search index. You can use lucene as repository, but I would advice you to use a extra index

Re: GermanAnalyzer

2008-11-24 Thread Karsten F.
Hi csantos, most possible this is not about lucene: http://java.sun.com/j2se/1.4.2/docs/api/java/lang/AbstractMethodError.html GermanAnalyser ist not part of normal lucene jar (it is part of lucene-analyzers). In application server the position of jar files can be important. Please try your

Re: How can I get to the Document for architecture of lucene index.

2008-10-27 Thread Karsten F.
Hi Ohsang, are you looking for http://lucene.apache.org/java/2_4_0/fileformats.html ? Best regards Karsten Kwon, Ohsang wrote: I want to know how the lucene stored the data in the index internally. (Lucene`s index format changed very often.) I can not find this information in

Re: Use SQL frontend to read lucene index

2008-10-27 Thread Karsten F.
Hi Blured, if you are asking about integration of lucene and a DBMS, possible compass is something for you http://www.nabble.com/Lucene-vs.-Database-tp19755932p19758736.html if you think about using hibernate: I think there already exist a lucene connector, so you don't have to use jdbc. if

RE: Use SQL frontend to read lucene index

2008-10-27 Thread Karsten F.
Hi Blured, sorry I don't know anything about eclipse birt. I recommend to start a new thread eclipse birt with lucene where you describe your problem again in detail. be aware that lucene don't know numerical values. lucene only knows strings. best regards Karsten blured blured wrote:

Re: No hits for longer search strings

2008-10-16 Thread Karsten F.
Hi Chris, most likely this is not a lucene problem. You looked with luke in the stored fields of your document? Please take a second look with luke in the terms of your field 'unique_id' (with Show top terms): What do you see? Best regards Karsten btw: why do you use the prefix search? This

Re: highlighter / fragmenter performance for large fields

2008-10-16 Thread Karsten F.
Hi Brian, I don't know the internals of highlighting („explanation“) in lucene. But I know that XTF ( http://xtf.wiki.sourceforge.net/underHood_Documents#tocunderHood_Documents5 ) can handle very large documents (above 100 Mbyte) with highlighting very fast. The difference to your approach is,

RE: Searching sets of documents

2008-10-14 Thread Karsten F.
Hi spring, unit of retrieval in lucene is a document. There are no joins between document sets like in sql. What you can do is to collect all hits for each term query on level of folders and than implement the logical „and“ or „or“ by your own. For this you could reuse the existing

Re: Lucene vs. Database

2008-10-01 Thread Karsten F.
Hi agatone, I agree with markharw00 that highlighting is the main reason to store fields in lucene. I want to remind Sascha Fahl that the stored field in lucene are not inside the inverted index-structure. The implemention of stored fields is very simple: A (.fdt)-file with the pairs

Re: Searching substring starting at a fixed position

2008-09-11 Thread Karsten F.
Hi Luther, your question: Is there a way to ask Lucene to search starting from a fixed position? the anwer: no, not by standard search. But you don't want to use your field for scoring. So this is a field to filter results. you could easily change RangeFilter for this purpose but the new

Re: Merging indexes - which is best option?

2008-09-08 Thread Karsten F.
Hi Antony, I decided first to delete all duplicates from master(iW) and then to insert all temporary indices(other). Any other opinions? Best regards Karsten code public static synchronized void merge(IndexWriter iW, Directory[] other, final String uniqueID_FieldName) throws IOException{

Re: Newbie question: using Lucene to index hierarchical information.

2008-09-08 Thread Karsten F.
queries on that? If Lucene isn't the right tool for this job, maybe some other toolkit would more useful(possibly on top of the Lucene) Thanks in advance for any suggestions and comments. I would appreciate any ideas and directions to look into. On Tue, Sep 2, 2008 at 11:46 AM, Karsten F

Re: Injecting additional tokens

2008-09-02 Thread Karsten F.
Hi Markus, hopefully someone will tell you the predefined Filter for this. I only want to agree, that filter is the correct place for this, and that you should be aware of the Token positions (after your filter you must have two Tokens on the same position). I think WordDelimitierFilter is a

Re: Newbie question: using Lucene to index hierarchical information.

2008-09-02 Thread Karsten F.
Hi Leonid, what kind of query is your use case? Comlex scenario: You need all the hierarchical structure information in one query. This means you want to search with xpath in a real xml-Database. (like: All Documents with a subtitle XY which contains directly after this subtitle a table with

Re: Index types

2008-09-01 Thread Karsten F.
Hi John, I am not sure about the way Solr implements range query. But it looks like, that Solr is using org.apache.lucene.search.ConstantScoreRangeQuery which itself is using org.apache.lucene.search.RangeFilter So Solr do not rewrite the query to a large Boolean SHOULD, but it is reading all

Re: Index types

2008-08-27 Thread Karsten F.
Hi John, about integration other index implementation: Sounds like you need a DBMS with some lucene features. There was a post about using lucene in Oracle: http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--to18703473.html#a18741137 and

Re: Clarification about segments

2008-08-23 Thread Karsten F.
Hi David, this is not true, please take a look to IndexWriter#setRAMBufferSizeMB and IndexWriter#setMaxBufferedDocs But you can produce 9 segments (each with only one document), if you call IndexWriter#flush or IndexWriter#commit after each addDocument so from my knowledge about lucene there

Re: Testing for field existence

2008-08-18 Thread Karsten F.
Hi Bill, you should not use prefix-query (*), because in first step lucene would generate a list of all terms in this field, and than search for all this terms. Which is senceless. I would suggest to insert a new field myFields which contains as value the names of all fields for this

Re: Indexing sections of TEI XML files

2008-08-13 Thread Karsten F.
Hi A. starting point of xtf was the TEI format. I am very curious, if you find a missing point for your needs. (I already used it with cocoon.) I never saw a better implementation of searching xml-aware: Each hit knows his exact position inside the indexed(=source) xml-file :-) I you dive into

Re: Results by unique id's

2008-08-12 Thread Karsten F.
hi Martin, I think you are searching for DuplicateFilter http://www.nabble.com/how-to-get--all-unique--documents-based-on-keyword-feild-to18807014.html best regards Karsten wysiecki wrote: Hello, thanks for help in advance. my example docs: two fileds company_id and content

Re: Per user data store

2008-08-06 Thread Karsten F.
Hi, I want to agree with the advice of using only one index. And I want to add two reasons: 1. Sorting and caching are working with the lucene-document-numbers. In case of lucene warming up means that a lot of int-Arrays and bitsets are stored in main memory. If you using different MultiReader

Re: folder path prefix filtering

2008-08-05 Thread Karsten F.
Hi Nico Krijnen, I think it is ok, to store a filter for each user-session im memory. And I think that a cached filter is the correct approach for permissions. (extra memory usage = one bit for each user and each document) Hopefully someone with more experience will also answer your question.

Re: Using lucene as a database... good idea or bad idea?

2008-07-31 Thread Karsten F.
Hi Grant, you made mention of jackrabbit as example of storing data in lucene. I did not find something like that in source-code. I found LocalFileSystem and DatabaseFileSystem. (I found lucene for indexing and searching.) Have I overlooked something? Best regards Karsten Grant

Re: Using lucene as a database... good idea or bad idea?

2008-07-31 Thread Karsten F.
Hi Ganesh, in this Thread nobody said, that lucene is a good storage server. Only it could be used as storage server (Grant: Connect data storage with simple, fast lookup and Lucene..) I don't now about automatic rentention. But for the rest in your list of features I suggest to take a deep

Re: Creating an index from an XML file using Lucene in Java

2008-07-28 Thread Karsten F.
Hi Fayyaz, again, this is about SAX-Handler not about lucene. My understanding of what you want: 1. one lucene document for each SPEECH-Element (already implemented) 2. one lucene document for each SCENE-COMMENTARY-Element (not implemented yet). correct? If yes, you can write

Re: Creating an index from an XML file using Lucene in Java

2008-07-27 Thread Karsten F.
Hi Fayyaz, From my point of view, this is not a lucene question. If I understand your SAX-Handler correctly, you start a document with each speech-start-Tag and you end this document with each lines-close-Tag. So if you know that the SCENE-COMMENTARY Elements and the speech elements are

Re: deleting documents with doc id

2008-07-27 Thread Karsten F.
Hi, only to be sure: You know IndexModifier.deleteDocument(int)? It is deprecated, because you should use IndexWriter.deleteDocuments(Term[]). What do you mean with index is committed. If you mean optimize() the document number will change (so there is a side-effect;-) best regards Karsten

Re: Parametric/faceted Searching

2008-07-24 Thread Karsten F.
Hi, my question: How did ebay solve this problem? Take a look to the faceted browsing in the mark twain project: http://www.marktwainproject.org/xtf/search?keyword=Berlinstyle=mtp http://tinyurl.com/5cvb3c This solution is open source and from the xtf project (they use lucene).