how to load mmap directory into memory?

2014-12-02 Thread Li Li
I am using mmap fs directory in lucene. My index is small (about 3GB in disk) and I have plenty of memory available. The problem is that when the term is first queried, it's slow. How can I load all directory into memory? One solution is using many query to warm it up. But I can't query all terms

continuing gc when too many search threads

2014-09-19 Thread Li Li
I have an index of about 30 million short strings, the index size is about 3GB in disk I have give jvm 5gb memory with default setting in ubuntu 12.04 of sun jdk 7. When I use 20 theads, it's ok. But If I run 30 threads. After a while. The jvm is doing nothing but gc. lucene version si 4.10.0

Re: Why does this query slow down Lucene?

2012-08-15 Thread Li Li
how slow is it? are all your searches slow or only that query slow? how many docs are indexed and the size of the indexes? whats the hardware configuration? you should describe it clearly to get help. 在 2012-8-16 上午9:28,zhoucheng2008 zhoucheng2...@gmail.com写道: Hi, I have the string $21 a Day

Re: 回复: Why does this query slow down Lucene?

2012-08-15 Thread Li Li
use jstack pid to check any deadlock. On Thu, Aug 16, 2012 at 10:09 AM, zhoucheng2008 zhoucheng2...@gmail.com wrote: The query has been stuck for more than an hour. The total size is less than 1G, and the number of docs is around 100,000. Hardware is ok as it works well with other much more

Re: 回复: Why does this query slow down Lucene?

2012-08-15 Thread Li Li
and also try jmap -heap pid to check whether it runs out of memory or jstat -gcutil pid 1000 On Thu, Aug 16, 2012 at 10:09 AM, zhoucheng2008 zhoucheng2...@gmail.com wrote: The query has been stuck for more than an hour. The total size is less than 1G, and the number of docs is around 100,000.

questions about DocValues in 4.0 alpha

2012-08-06 Thread Li Li
hi everyone, in lucene 4.0 alpha, I found the DocValues are available and gave it a try. I am following the slides in http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene I have got 2 questions. 1. is DocValues updatable now? 2. How

Re: Question about chinese and WildcardQuery

2012-06-28 Thread Li Li
don't understand why StandardAnalyzer does not work if according with the ChineseAnalyzer deprecation I should use StandardAnalyzer: @deprecated Use {@link StandardAnalyzer} instead, which has the same functionality. Is very annoying. 2012/6/27 Li Li fancye...@gmail.com standard analyzer

Re: Auto commit when flush

2012-06-28 Thread Li Li
flush is not commit. On Thu, Jun 28, 2012 at 2:42 PM, Aditya findbestopensou...@gmail.com wrote: Hi Ram, I guess IndexWriter.SetMaxBufferedDocs will help... Regards Aditya www.findbestopensource.com On Wed, Jun 27, 2012 at 11:25 AM, Ramprakash Ramamoorthy youngestachie...@gmail.com

Re: Lucene Query About Sorting

2012-06-27 Thread Li Li
what do you want to do? 1. sort all matched docs by field A. 2. sort all matched docs by relevant score, selecting top 100 docs and then sort by field A On Wed, Jun 27, 2012 at 1:44 PM, Yogesh patel yogeshpateldai...@gmail.com wrote: Thanks for reply Ian , But i just gave suppose document

Re: Question about chinese and WildcardQuery

2012-06-27 Thread Li Li
standard analyzer will segment each character into a token, you should use whitespace analyzer or your own analyzer that can tokenize it as one token for wildcard search 在 2012-6-27 傍晚6:20,Paco Avila monk...@gmail.com写道: Hi there, I have to index chinese content and I don't get the expected

Re: about .frq file format in doc

2012-06-27 Thread Li Li
lastDocID represent last document which contains this term. because it will reuse this FormatPostingsDocsConsumer. so you need clear all member variables in finish method On Thu, Jun 28, 2012 at 11:14 AM, wangjing ppm10...@gmail.com wrote: thanks could you help me to solve another problem,

Re: about .frq file format in doc

2012-06-27 Thread Li Li
On Thu, Jun 28, 2012 at 11:14 AM, wangjing ppm10...@gmail.com wrote: thanks could you help me to solve another problem, why lucene will reset lastDocID = 0 when finish add one doc? it will not call finish after adding a document reading the JavaDoc of FormatPostingsDocsConsumer /**

Re: any good idea for loading fields into memory?

2012-06-22 Thread Li Li
), and retrieving document would be fast enough simply because all data is in RAM. On Fri, Jun 22, 2012 at 3:56 AM, Li Li fancye...@gmail.com wrote: use collector and field cache is a good idea for ranking by certain field's value. but I just need to return matched documents' fields

RE: any good idea for loading fields into memory?

2012-06-22 Thread Li Li
to) what you need in it. -Paul -Original Message- From: Li Li [mailto:fancye...@gmail.com] our old map implementation use about 10 ms, while newer one is 40 ms. the reason is we need to return some fields of all hitted documents. the fields are not very long strings

Re: any good idea for loading fields into memory?

2012-06-21 Thread Li Li
, Jun 20, 2012 at 9:47 AM, Li Li fancye...@gmail.com wrote: but as l can remember, in 2.9.x FieldCache can only apply to indexed but not analyzed fields. 在 2012-6-20 晚上8:59,Danil ŢORIN torin...@gmail.com写道: I think you are looking for FieldCache. I'm not sure the current status in 4x

Re: any good idea for loading fields into memory?

2012-06-21 Thread Li Li
Message- From: Li Li [mailto:fancye...@gmail.com] but as l can remember, in 2.9.x FieldCache can only apply to indexed but not analyzed fields. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

any good idea for loading fields into memory?

2012-06-20 Thread Li Li
hi all I need to return certain fields of all matched documents quickly. I am now using Document.get(field), but the performance is not well enough. Originally I use HashMap to store these fields. it's much faster but I have to maintain two storage systems. Now I am reconstructing this

Re: any good idea for loading fields into memory?

2012-06-20 Thread Li Li
20, 2012 at 3:49 PM, Li Li fancye...@gmail.com wrote: hi all I need to return certain fields of all matched documents quickly. I am now using Document.get(field), but the performance is not well enough. Originally I use HashMap to store these fields. it's much faster but I have

Re: any good idea for loading fields into memory?

2012-06-20 Thread Li Li
own Collector, using FieldCache is quite straight forward. On Wed, Jun 20, 2012 at 3:49 PM, Li Li fancye...@gmail.com wrote: hi all I need to return certain fields of all matched documents quickly. I am now using Document.get(field), but the performance is not well enough. Originally I

Re: lucene (search) performance tuning

2012-05-26 Thread Li Li
the FAQ. -- Ian. On Tue, May 22, 2012 at 2:08 AM, Li Li fancye...@gmail.com wrote: something wrong when writing in my android client. if RAMDirectory do not help, i think the bottleneck is cpu. you may try to tune jvm but i do not expect much improvement. the best one is splitting

Re: Performance of storing data in Lucene vs other (No)SQL Databases

2012-05-21 Thread Li Li
what's your meaning of performance of storage? lucene just stores all fields of a document(or columns of a row if in db) together. it can only store string. you can't store int or long( except you convert it to string). to retrieve a given field of a document will cause many io operations. it's

Re: lucene (search) performance tuning

2012-05-21 Thread Li Li
在 2012-5-22 凌晨4:59,Yang tedd...@gmail.com写道: I'm trying to make my search faster. right now a query like name:Joe Moe Pizza address:77 main street city:San Francisco is this a conjunction query or a disjunction query? in a index with 20mil such short business descriptions (total size

Re: lucene (search) performance tuning

2012-05-21 Thread Li Li
is not fully used, yuo can do this in one physical machine 在 2012-5-22 上午8:50,Li Li fancye...@gmail.com写道: 在 2012-5-22 凌晨4:59,Yang tedd...@gmail.com写道: I'm trying to make my search faster. right now a query like name:Joe Moe Pizza address:77 main street city:San Francisco

how to convert French letters to English?

2012-05-11 Thread Li Li
I have some french hotels such as Elysée Etoile But for many of our users, then can't type French letters, so they will type Elysee Etoile is there any analyzer can do this? thanks. - To unsubscribe, e-mail:

Re: how to convert French letters to English?

2012-05-11 Thread Li Li
at 11:01 AM, Li Li fancye...@gmail.com wrote: I have some french hotels such as Elysée Etoile But for many of our users, then can't type French letters, so they will type Elysee Etoile is there any analyzer can do this? thanks

Re: Many keywords problem

2012-05-08 Thread Li Li
a disjunction (or) query of so many terms is indeed slow. can u describe your real problem? why you should the disjunction results of so many terms? On Sun, May 6, 2012 at 9:57 PM, qibaoy...@126.com qibaoy...@126.com wrote: Hi,       I met a problem about how to search many keywords  in about

Re: Re: Many keywords problem

2012-05-08 Thread Li Li
describe the question clearly. At 2012-05-08 18:44:13,Li Li fancye...@gmail.com wrote: a disjunction (or) query of so many terms is indeed slow. can u describe your real problem? why you should the disjunction results of so many terms? On Sun, May 6, 2012 at 9:57 PM, qibaoy

Re: Re: Many keywords problem

2012-05-08 Thread Li Li
But this only get (term1 or term2 or term3. ). you can't implement (term1 or term2 ...) and (term3 or term4) by this method. maybe you should writer your own Scorer to deal with this kind of queries. On Tue, May 8, 2012 at 9:44 PM, Li Li fancye...@gmail.com wrote: disjunction query is much

Re: Lucene Question about Query

2012-05-06 Thread Li Li
what's your analyzer? if you use standard analyzer, I think this won't happen. if you want to get exact match of name field, you should index this field but not analyze it. On Mon, May 7, 2012 at 11:59 AM, Yogesh patel yogeshpateldai...@gmail.com wrote: Hi I am using lucene for search

Re: lucene algorithm ?

2012-04-27 Thread Li Li
On Thu, Apr 26, 2012 at 5:13 AM, Yang tedd...@gmail.com wrote: I read the paper by Doug Space optimizations for total ranking, since it was written a long time ago, I wonder what algorithms lucene uses (regarding postings list traversal and score calculation, ranking) particularly the

Re: two fields, the first important than the second

2012-04-27 Thread Li Li
. Ákos On Fri, Apr 27, 2012 at 5:17 AM, Li Li fancye...@gmail.com wrote: sorry for some typos. original query +(title:hello desc:hello) +(title:world desc:world) boosted one   +(title:hello^2 desc:hello) +(title:world^2 desc:world) last one     +(title:hello desc:hello) +(title:world

Re: Indexing with Semantics

2012-04-27 Thread Li Li
stemmer semantic is a large word, care to use it. On Sat, Apr 28, 2012 at 11:02 AM, Kasun Perera kas...@opensource.lk wrote: I'm using Lucene's Term Freq vector to calculate cosine similarity between documents, Say my docments has these 3 terms, owe owed owing. Lucene takes this as 3 separate

Re: two fields, the first important than the second

2012-04-26 Thread Li Li
you should describe your ranking strategy more precisely. if the query has 2 terms, hello and world for example, and your search fields are title and description. There are many possible combinations. Here is my understanding. Both terms should occur in title or desc query may be

Re: two fields, the first important than the second

2012-04-26 Thread Li Li
has two terms. if it has more terms, the query will become too complicated. On Fri, Apr 27, 2012 at 11:12 AM, Li Li fancye...@gmail.com wrote: you should describe your ranking strategy more precisely. if the query has 2 terms, hello and world for example, and your search fields are title

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

2012-04-18 Thread Li Li
small addition, I'll post it in comments soon). By using it I have disjunction summing query with steady subscorers. Regards On Tue, Apr 17, 2012 at 2:37 PM, Li Li fancye...@gmail.com wrote: hi all, I am now hacking the BooleanScorer2 to let it keep the docID() of the leaf scorer(mostly

why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

2012-04-17 Thread Li Li
hi all, I am now hacking the BooleanScorer2 to let it keep the docID() of the leaf scorer(mostly possible TermScorer) the same as the top-level Scorer. Why I want to do this is: When I Collect a doc, I want to know which term is matched(especially for BooleanClause whose Occur is SHOULD). we

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

2012-04-17 Thread Li Li
some mistakes of the example: after first call advance(5) currentDoc=6 first scorer's nextDoc is called to in advance, the heap is empty now. then call advance(6) because scorerDocQueue.size() minimumNrMatchers, it just return NO_MORE_DOCS On Tue, Apr 17, 2012 at 6:37 PM, Li Li

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

2012-04-17 Thread Li Li
is absolutely useful (with one small addition, I'll post it in comments soon). By using it I have disjunction summing query with steady subscorers. Regards On Tue, Apr 17, 2012 at 2:37 PM, Li Li fancye...@gmail.com wrote: hi all, I am now hacking the BooleanScorer2 to let it keep the docID

Re: ToParentBlockJoinQuery query loop finitely

2012-03-23 Thread Li Li
. Try putting the parent (shirt) document last in each case instead... Query-time join is already committed to trunk and 3.x, so it'll be in 3.6.0/4.0. Mike McCandless http://blog.mikemccandless.com On Fri, Mar 23, 2012 at 12:27 AM, Li Li fancye...@gmail.com wrote: hi all, I read

Re: ToParentBlockJoinQuery query loop finitely

2012-03-23 Thread Li Li
join filtering and the block join is more meant for parent / child search. Martijn On 23 March 2012 11:58, Li Li fancye...@gmail.com wrote: thank you. is there any the search time join example? I can only find a JoinUtil in package org.apache.lucene.search.join and a TestJoinUtil in test

ToParentBlockJoinQuery query loop finitely

2012-03-22 Thread Li Li
hi all, I read these two articles http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html, http://blog.mikemccandless.com/2012/01/tochildblockjoinquery-in-lucene.htmland wrote a test program. But it seems there is some problem. it ends with endless loop. Here is my

Re: combine results from multiple queries sort

2012-03-14 Thread Li Li
it's a very common problem. many of our users(including programmers that familiar with sql) have the same question. comparing with sql, all queries in lucene are based on inverted index. fortunately, when searching, we can providing a Filter. from source codes of function searchWithFilter we can

Re: Updating a document.

2012-03-04 Thread Li Li
if you want to identify a document, you should use a field such as url as Unique Key in solr On Mon, Mar 5, 2012 at 12:31 AM, Benson Margulies bimargul...@gmail.comwrote: I am walking down the document in an index by number, and I find that I want to update one. The updateDocument API only

Re: Updating a document.

2012-03-04 Thread Li Li
document id will be subject to changes. and all segments' document id is starting from zero. after a merge, document ids will also change. On Mon, Mar 5, 2012 at 12:31 AM, Benson Margulies bimargul...@gmail.comwrote: I am walking down the document in an index by number, and I find that I want

Re: Lucene performance in 64 Bit

2012-03-01 Thread Li Li
I think many users of lucene use large memory because 32bit system's memory is too limited(windows 1.5GB, Linux 2-3GB). the only noticable thing is * Compressed* *oops* . some says it's useful, some not. you should give it a try. On Thu, Mar 1, 2012 at 4:59 PM, Ganesh emailg...@yahoo.co.in wrote:

Re: How to separate one index into multiple?

2012-02-19 Thread Li Li
I think you could do as follows. taking splitting it to 3 indexes for example. you can copy the index 3 times. for copy 1 for(int i=0;ireader1.maxDocs();i+=3){ reader1.delete(i); } for copy for(int i=1;ireader2.maxDocs();i+=3){ reader2.delete(i); } and then optimize these

Re: How to separate one index into multiple?

2012-02-19 Thread Li Li
you can delete by query like -category:category1 On Sun, Feb 19, 2012 at 9:41 PM, Li Li fancye...@gmail.com wrote: I think you could do as follows. taking splitting it to 3 indexes for example. you can copy the index 3 times. for copy 1 for(int i=0;ireader1.maxDocs();i+=3

Re: effectiveness of compression

2012-02-15 Thread Li Li
for now lucene don't provide any thing like this. maybe you can diff each version before add them into index . so it just indexes and stores difference for newer version. On Wed, Feb 15, 2012 at 4:25 PM, Jamie ja...@stimulussoft.com wrote: Greetings All. I'd like to index data corresponding

Re: Paid Job: Looking for a developer to create a small java application to extract url's from .fdt files

2012-02-13 Thread Li Li
for 2.x and 3.x you can simply use this codes: Directory dir=FSDirectory.open(new File(./testindex)); IndexReader reader=IndexReader.open(dir); ListString urls=new ArrayListString(reader.numDocs()); for(int i=0;ireader.maxDoc();i++){ if(!reader.isDeleted(i)){ Document

Re: How best to handle a reasonable amount to data (25TB+)

2012-02-07 Thread Li Li
it's up to your machines. in our application, we indexs about 30,000,000(30M)docs/shard, and the response time is about 150ms. our machine has about 48GB memory and about 25GB is allocated to solr and other is used for disk cache in Linux. if calculated by our application, indexing 1.25T docs will

Re: Size of lucene norm file

2011-09-18 Thread Li Li
docNum * IndexedFieldsNum * 1 Bytes you should disable indexed fields which are not used for relevancy rank. On Sun, Sep 18, 2011 at 5:20 AM, roz dev rozde...@gmail.com wrote: Hi, I want to estimate the size of NORM file that lucene will generate for a 20 Gb index which has 2.5 Million Docs

What will happen when one thread is closing a searcher while another is searching?

2011-09-05 Thread Li Li
hi all, I am using spellcheck in solr 1.4. I found that spell check is not implemented as SolrCore. in SolrCore, it uses reference count to track current searcher. oldSearcher and newSearcher will both exist if oldSearcher is servicing some query. But in FileBasedSpellChecker public void

Re: Question about MaxFieldLength

2011-08-27 Thread Li Li
It will affect the entire index because it 's a parameter of IndexWriter. but you can modify it anytime you like before IndexWriter.addDocument. If you want to truncate different fields with different maxLength. you should avoid multithreads' race condition. maybe you can add a TokenFilter

what's the status of droids project(http://incubator.apache.org/droids/)?

2011-08-23 Thread Li Li
hi all I am interested in vertical crawler. But it seems this project is not very active. It's last update time is 11/16/2009

Re: How to make Lucene effective for video retrieval?

2011-08-19 Thread Li Li
if there are only text information, your video search is just normal full text search. but I think you should consider more on ranking, facet search etc. On Fri, Aug 19, 2011 at 1:05 PM, Lei Pang cactus...@gmail.com wrote: Hi everyone, I want to use Lucene to retrieve videos through their meta

Re: How to make Lucene effective for video retrieval?

2011-08-19 Thread Li Li
ranking functions, such as BM25 or Language Model, rather than Lucene's original ranking function? Thank you. On Fri, Aug 19, 2011 at 2:37 PM, Li Li fancye...@gmail.com wrote: if there are only text information, your video search is just normal full text search. but I think you should consider

Re: full text searching in cloud for minor enterprises

2011-07-06 Thread Li Li
: Look at searchblox On Monday, July 4, 2011, Li Li fancye...@gmail.com wrote: hi all,     I want to provide full text searching for some small websites. It seems cloud computing is  popular now. And it will save costs because it don't need employ engineer to maintain the machine.     For now

full text searching in cloud for minor enterprises

2011-07-04 Thread Li Li
hi all, I want to provide full text searching for some small websites. It seems cloud computing is popular now. And it will save costs because it don't need employ engineer to maintain the machine. For now, there are many services such as amazon s3, google app engine, ms azure etc. I am

Re: a faster way to addDocument and get the ID just added?

2011-03-30 Thread Li Li
merge will also change docid all segments' docId begin with 0 2011/3/30 Trejkaz trej...@trypticon.org: On Tue, Mar 29, 2011 at 11:21 PM, Erick Erickson erickerick...@gmail.com wrote: I'm always skeptical of storing the doc IDs since they can change out from underneath you (just delete even a

Re: Too many open files error

2011-03-23 Thread Li Li
use lsof to count the number of opened files ulimit to modify it. maybe u need ask adminstrator to modify limit.conf 2011/3/23 Vo Nhu Tuan vonhut...@gmail.com: Hi, Can someone help me with this problem please? I got these when running my program: java.io.FileNotFoundException:

Re: Too many open files error

2011-03-23 Thread Li Li
and also try using compound files (cfs) 2011/3/23 Vo Nhu Tuan vonhut...@gmail.com: Hi, Can someone help me with this problem please? I got these when running my program: java.io.FileNotFoundException: /Users/vonhutuan/Documents/workspace/InformationExtractor/index_wordlist/_i82.frq

Re: I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Li Li
to use some synchronization mechanism to allow only 1 or 2 ReplicationHandler threads are doing CMD_GET_FILE command. Is that solution feasible? 2011/3/11 Li Li fancye...@gmail.com hi it seems my mail is judged as spam. Technical details of permanent failure: Google tried to deliver

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Li Li
http://java-source.net/open-source/html-parsers 2011/3/11 shrinath.m shrinat...@webyog.com I am trying to index content withing certain HTML tags, how do I index it ? Which is the best parser/tokenizer available to do this ? -- View this message in context:

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Li Li
these parsers when crawling and save parsed result only. HtmlUnit is also a good tool for this purpose which support javascript and parsing web pages. 2011/3/11 shrinath.m shrinat...@webyog.com Thank you Li Li. Two questions : 1. Is there anything *in* *Lucene* that I need to know of ? some

Re: I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Li Li
I don't use any client but browser. 2011/3/11 Erick Erickson erickerick...@gmail.com What mail client are you using? I also had this problem and it's solved in Gmail by sending the mail as plain text rather than Rich formatting. Best Erick On Fri, Mar 11, 2011 at 4:35 AM, Li Li fancye

Re: I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Li Li
I used plain text and sent successfully. thanks. 2011/3/11 Erick Erickson erickerick...@gmail.com: What mail client are you using? I also had this problem and it's solved in Gmail by sending the mail as plain text rather than Rich formatting. Best Erick On Fri, Mar 11, 2011 at 4:35 AM, Li

Re: Detecting duplicates

2011-03-05 Thread Li Li
it's indeed very slow. because it do collapsing in all matched documents. we tacked this problem by doing collapsing in top 100 documents. 2011/3/6 Mark static.void@gmail.com I'm familiar with Deduplication however I do not wish to remove my duplicates and my needs are slightly different.

Re: Detecting duplicates

2011-03-04 Thread Li Li
it's the problem of near duplication detection. there are many papers addressing this problem. methods like simhash are used. 2011/3/5 Mark static.void@gmail.com Is there a way one could detect duplicates (say by using some unique hash of certain fields) and marking a document as a

Re: I can't post email to d...@lucene.apache.org maillist

2011-02-16 Thread Li Li
thank you. I got it. 2011/2/16 Chris Hostetter hossman_luc...@fucit.org: : I used to receive the email myself because I subscribe the maillist. : but recently if I post a email to the maillist, I can't receive the : email posted by me. So I thought I failed to post this email. I notice you

I can't post email to d...@lucene.apache.org maillist

2011-02-15 Thread Li Li
hi all is there any limit to post email to this maillist now? thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: I can't post email to d...@lucene.apache.org maillist

2011-02-15 Thread Li Li
mean? Bounced as spam? rejected for other reasons? This question came through so obviously you can post something I found that sending mail as plain text kept the spam filter from kicking in. Best Erick On Tue, Feb 15, 2011 at 7:29 AM, Li Li fancye...@gmail.com wrote: hi all

Re: Lucene: how to get frequency of Boolean query

2010-12-26 Thread Li Li
do you mean get tf of the hited documents when doing search? it's a difficult problem because only TermScorer has TermDocs and using tf in score() function. but this we can't know whether these doc is selected because we use a priorityQueue in TopScoreDocCollector public void collect(int doc)

Re: Where does Lucene recognise it has encountered a new term for the first time?

2010-12-15 Thread Li Li
I don't understand your problem well. but needing know when a new term occur is a hard problem because when new document is added, it will be added to a new segment. I think you can only do this in the last merge in optimization stage. You can read the codes in SegmentMerger.mergeTermInfos() .

Re: instantiated contrib

2010-08-26 Thread Li Li
if you only load 10% (7k)? Did you see the graphics in the package level javadocs? http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/store/instantiated/package-summary.html        karl 26 aug 2010 kl. 09.24 skrev Li Li: I have about 70k document, the total indexed size is about

Re: instantiated contrib

2010-08-26 Thread Li Li
10% (7k)? Did you see the graphics in the package level javadocs? http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/store/instantiated/package-summary.html        karl 26 aug 2010 kl. 09.24 skrev Li Li: I have about 70k document, the total indexed size is about 15MB

how to adjust buffer size of reading file?

2010-08-05 Thread Li Li
I found the system call by java when reading file, the buffer size is always 1024. Can I modify this value to reduce system call? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail:

will load fdx into memory make search faster?

2010-08-05 Thread Li Li
hi all we analyze system call of lucene and find that the fdx file is always read when we get field values. In my application the fdt is about 50GB and fdx is about 120MB. I think it may be benifit to load fdx into memory just like tii. Anyone else tried this ?

Re: will load fdx into memory make search faster?

2010-08-05 Thread Li Li
. Though, how many docs are you typically retrieving per search? Mike On Thu, Aug 5, 2010 at 3:37 AM, Li Li fancye...@gmail.com wrote: hi all    we analyze system call of lucene and find that the fdx file is always read when we get field values. In my application the fdt is about 50GB and fdx

Re: understanding lucene

2010-07-27 Thread Li Li
lucene in action 2nd ed. is a good book 2010/7/28 Yakob jacob...@opensuse-id.org: hello everyone, I am starting to understand lucene in java and I am having a hard time in implementing it. I am trying to develop a java application that can do indexing, searching and whatnot. and using lucene

is there any resource for improve lucene index/search performance

2010-07-20 Thread Li Li
Or where to find any improvement proposal for lucene? e.g. I want to change the float point multiplication to integer multiplication or using bitmap for high frequent terms or something else like this. Is there any place where I can find any resources or guys? thanks.

Cache full text into memory

2010-07-14 Thread Li Li
I want to cache full text into memory to improve performance. Full text is only used to highlight in my application(But it's very time consuming, My avg query time is about 250ms, I guess it will cost about 50ms if I just get top 10 full text. Things get worse when get more full text because

How to manage resource out of index?

2010-07-07 Thread Li Li
I used to store full text into lucene index. But I found it's very slow when merging index because when merging 2 segments it copy the fdt files into a new one. So I want to only index full text. But When searching I need the full text for applications such as hightlight and view full text. I can

Re: How to manage resource out of index?

2010-07-07 Thread Li Li
to merge very large indexes anyway. when your system grows / you go into production you'll probably split the indexes too to use solr's distributed search func. for the sake of query speed). hope that helps, bec :) On 7 July 2010 14:07, Li Li fancye...@gmail.com wrote: I used to store full

Fwd: index format error because disk full

2010-07-06 Thread Li Li
-- Forwarded message -- From: Li Li fancye...@gmail.com Date: 2010/7/7 Subject: index format error because disk full To: solr-u...@lucene.apache.org the index file is ill-formated because disk full when feeding. Can I roll back to last version? Is there any method to avoid

Re: index format error because disk full

2010-07-06 Thread Li Li
会 在 2010年7月7日 上午10:46,jg lin linji...@gmail.com 写道: 你会说汉语吗(⊙_⊙)? 2010/7/7 Li Li fancye...@gmail.com -- Forwarded message -- From: Li Li fancye...@gmail.com Date: 2010/7/7 Subject: index format error because disk full To: solr-u...@lucene.apache.org the index file is ill

Re: index format error because disk full

2010-07-06 Thread Li Li
谢谢 在 2010年7月7日 上午10:53,jg lin linji...@gmail.com 写道: 加个QQ群问问18038594,你的问题我不会。 Li Li fancye...@gmail.com 於 2010年7月7日上午10:48 ��道: 会 在 2010年7月7日 上午10:46,jg lin linji...@gmail.com 写道: 你会说汉语吗(⊙_⊙)? 2010/7/7 Li Li fancye...@gmail.com -- Forwarded message -- From: Li

about TokenSources.getTokenStream and highlighter

2010-06-12 Thread Li Li
hi all when use highlighter, We must provide a tokenStream and the original text. To get a tokenStream, we can either reanlyze the original text or use saved TermVector to reconstruct it. In my application, highlight will cost average 200ms-300ms, and I want to optimze it to lower than

how to patch?

2010-06-12 Thread Li Li
I want to use fast highlighter in solr1.4 and find a issue in https://issues.apache.org/jira/browse/SOLR-1268 File Name Date Attached ↑ Attached By Size SOLR-1268.patch 2010-02-05 10:32 PM Koji

how to get term position of a document?

2010-06-06 Thread Li Li
I want to override TermScorer.score() method to take position info into scoring. e.g. any occurence whose position is less than 100 will get a boost The original score method: public float score() { assert doc != -1; int f = freqs[pointer]; float raw =

Re: is there any resources that explain detailed implementation of lucene?

2010-06-03 Thread Li Li
to solve? Best Erick On Wed, Jun 2, 2010 at 8:54 PM, Li Li fancye...@gmail.com wrote: such as the detailed process of store data structures, index, search and sort. not just apis. thanks. - To unsubscribe, e-mail: java-user

Re: What's DisjunctionMaxQuery ?

2010-06-03 Thread Li Li
/java/2_0_0/api/org/apache/lucene/search/Disjunction MaxQuery.html. Itamar. -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Tuesday, June 01, 2010 11:42 AM To: java-user@lucene.apache.org Subject: What's DisjunctionMaxQuery ? anyone could show me some detail

Re: how to extend Similarity in this situation?

2010-06-02 Thread Li Li
thank you 2010/6/2 Rebecca Watson bec.wat...@gmail.com: Hi Li Li If you want to support some query types and not others you should overide/extend the queryparser so that you throw an exception / makes a different query type instead. Similarity doesn't do the actual scoring, it's used

about norm

2010-06-02 Thread Li Li
in javadoc http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_norm norm(t,d) = doc.getBoost() ・ lengthNorm(field) ・ ∏ f.getBoost() field f in d named as t whre is field come from in

is there any resources that explain detailed implementation of lucene?

2010-06-02 Thread Li Li
such as the detailed process of store data structures, index, search and sort. not just apis. thanks. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail:

What's DisjunctionMaxQuery ?

2010-06-01 Thread Li Li
anyone could show me some detail information about it ? thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

how to extend Similarity in this situation?

2010-06-01 Thread Li Li
I want to only support boolean or query(as many search engine do). But I want to boost document whose terms are closer. e.g. the query terms are 'apache lucene' doc1 apache has many projects such as lucene doc2 The Apache HTTP Server Project is an effort to develop and maintain an ... Lucene is a

Question about Field.setOmitTermFreqAndPositions(true)

2010-05-31 Thread Li Li
I read in 'lucene in action that to save space, we can omit termfreq and postion information. But as far as I know, lucene's default scoring model is vsm, which need tf(term,doc) to calcuate score. If there is no tf saved. Will the relevance score be correct?

Re: Question about Field.setOmitTermFreqAndPositions(true)

2010-05-31 Thread Li Li
What about TermVector? it says in lucene in action: Term vectors are something a mix of between an indexed field and a stored field. They are similar to a stored field because you can quickly retrieve all term vector fields for a given document: term vectors are keyed first by document ID. But

Re: how to reuse a tokenStream?

2010-05-28 Thread Li Li
MyFilter(tokenStream); return stream; } } 2010/5/28 Erick Erickson erickerick...@gmail.com: What is the problem you're seeing? Maybe a stack trace? You haven't told us what the incorrect behavior is. Best Erick On Fri, May 28, 2010 at 12:52 AM, Li Li fancye...@gmail.com wrote

how to reuse a tokenStream?

2010-05-27 Thread Li Li
I want to analyzer a text twice so that I can get some statistic information from this text TokenStream tokenStream=null; Analyzer wa=new WhitespaceAnalyzer(); try { tokenStream =

  1   2   >