Re: Lucene does NOT use UTF-8.
Hi, Ken, Thanks for your email. You are right, I was meant to propose that Lucene switch to use true UTF-8, rather than having to work around this issue by fixing the caused problems elsewhere. Also, conforming to standards like UTF-8 will make the code easier for new developers to pick up. Just my 2 cents. Thanks, Jian On 8/27/05, Ken Krugler <[EMAIL PROTECTED]> wrote: > > >On Aug 26, 2005, at 10:14 PM, jian chen wrote: > > > >>It seems to me that in theory, Lucene storage code could use true UTF-8 > to > >>store terms. Maybe it is just a legacy issue that the modified UTF-8 is > >>used? > > The use of 0xC0 0x80 to encode a U+ Unicode code point is an > aspect of Java serialization of character streams. Java uses what > they call "a modified version of UTF-8", though that's a really bad > way to describe it. It's a different Unicode encoding, one that > resembles UTF-8, but that's it. > > >It's not a matter of a simple switch. The VInt count at the head of > >a Lucene string is not the number of Unicode code points the string > >contains. It's the number of Java chars necessary to contain that > >string. Code points above the BMP require 2 java chars, since they > >must be represented by surrogate pairs. The same code point must be > >represented by one character in legal UTF-8. > > > >If Plucene counts the number of legal UTF-8 characters and assigns > >that number as the VInt at the front of a string, when Java Lucene > >decodes the string it will allocate an array of char which is too > >small to hold the string. > > I think Jian was proposing that Lucene switch to using a true UTF-8 > encoding, which would make things a bit cleaner. And probably easier > than changing all references to CEUS-8 :) > > And yes, given that the integer count is the number of UTF-16 code > units required to represent the string, your code will need to do a > bit more processing when calculating the character count, but that's > a one-liner, right? > > -- Ken > -- > Ken Krugler > TransPac Software, Inc. > <http://www.transpac.com> > +1 530-470-9200 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Eliminating norms ... completley
Hi, Chris, Turning off norm looks like a very interesting problem to me. I remember that in Lucene Road Map for 2.0, there is a requirement to turn off indexing for some information, such as proximity. Maybe optionally turning off the norm could be an experiment to show case how to turn off the proximity down the road. Looking at the Lucene source code, it seems to me that the code could be further improved, bringing it more to the good OO design. For example, abstract classes could be changed to interfaces if possible, using accessor methods like getXXX() instead of public member variables, etc. My hunch is that the changes would add clarity of style to the code and wouldn't be a real performance drawback. Just my thoughts. For sake of backward compatibility, these thoughts may not be that valuable though. Cheers, Jian On 10/7/05, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > Yonik and I have been looking at the memory requirements of an application > we've got. We use a lot of indexed fields, primarily so I can do a lot > of numeric tests (using RangeFilter). When I say "a lot" I mean > arround 8,000 -- many of which are not used by all documents in the index. > > Now there are some basic usage changes I can make to cut this number in > half, and some more complex biz rule changes I can make to get the number > down some more (at the expense of flexibility) but even then we'd have > arround 1,000 -- which is still a lot more then the recommended "handful" > > After discussing some options, I asked the question "Remind me again why > having lots of indexed fields makes the memory requirements jump up -- > even if only a few documents use some field?" and Yonik reminded me about > the norm[] -- an array of bytes representating the field boost + length > boost for each document. One of these arrays exists for every indexed > field. > > So then I asked the $50,000,000 question: "Is there any way to get rid of > this array for certain fields? ... or any way to get rid of it completely > for every field in a specific index?" > > This may sound like a silly question for most IR applications where you > want length normalization to contribute to your scores, but in this > particular case most of these fields are only used to store single numeric > value, to be certain, there are some fields we have (or may add in the > future) that could benefit from having a narms[] ... but if it had to be > an all or nothing thing we could certainly live without them. > > It seems to me, that in an ideal world, deciding wether or not you wanted > to store norms for a field would be like deciding wether you wanted to > store TermVectors for a field. I can imagine a Field.isNormStored() > method ... but that seems like a pretty significant change to the existing > code base. > > > Alternately, I started wondering if if would be possible to write our own > IndexReader/IndexWriter subclasses that would ignore the norm info > completely (with maybe an optional list of field names the logic should be > lmited to), and return nothing but fixed values for any parts of the code > base that wanted them. Looking at SegmentReader and MultiReader this > looked very promising (especailly considering the way SegmentReader uses a > system property to decide which acctaul class ot use). But I was less > enthusiastic when i started looking at IndexWriter and the DocumentWriter > classes there doesn't seem to be any clean way to subclass the > existing code base to eliminate the writing of the norms to the Directory > (curses those final classes, and private final methods). > > > So I'm curious what you guys think... > > 1) Regarding the root problem: is there any other things you can think > of besides norms[] that would contribute to the memory foot print > needed by a large number of indexed fields? > 2) Can you think of a clean way for individual applications to eliminate > norms (via subclassing the lucene code base - ie: no patching) > 3) Yonik is currently looking into what kind of patch it would take to > optionally turn off norms (I'm not sure if he's looking at doing it > "per field" or "per index"). Is that the kind of thing that would > even be considered for getting commited? > > -- > > --- > "Oh, you're a tricky one." Chris M Hostetter > -- Trisha Weir [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Adding generic payloads to a Term's posting list
Hi, I have been studying the Lucene indexing code for a bit. I am not sure if I understand the problem scope completely, but, storing extra information using TermsInfoWriter may not solve the problem? For the example of XML document tag depth, could that be a seperate field? Because Lucene term is a combination of (field, termText), so, depth could be a field and even though two XML tags are the same, if their depths are different, they are still treated as separate terms. This is what I could think about so far. Jian On 10/10/05, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard > > See item #11 of API changes. Maybe along the lines of what you are > interested in, although I don't know if anyone has even attempted a design > of it. I would also like to see this, plus the ability to store info at > higher levels in the Index, such as Field (not on a per token basis), > Document (info about the document that spans it's fields) and Index (such > as > coreference information). Alas, no time... > > -Grant > > >-Original Message- > >From: Shane O'Sullivan [mailto:[EMAIL PROTECTED] > >Sent: Monday, October 10, 2005 8:38 AM > >To: java-dev@lucene.apache.org > >Subject: Adding generic payloads to a Term's posting list > > > >Hi, > > > >To the best of my knowledge, it is not possible to add generic > >data to a Term's posting list. > >By this I mean info that is defined by the search engine, not > >Lucene itself. > >Whereas Lucene adds some data to the posting lists, such as > >the term's position within a document, there are many other > >useful types of information that could be attached to a term. > > > >Some examples would be in XML documents, to store the depth of > >a tag in the document, or font information, such as if the > >term appeared in a header or in the main body of text. > > > >Are there any plans to add such functionality to the API? If > >not, where would be a the appropriate place to implement these > >changes? I presume the TermInfosWriter and TermInfosReader > >would have to be altered, as well as the classes which call > >them. Could this be done without having to modify the index in > >such a way that standard Lucene indexes couldn't read it? > > > >Thanks > > > >Shane > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
skipInterval
Hi, All, I was reading some research papers regarding quick inverted index lookups. The classical approach to skipping dictates that a skip should be positioned every sqrt(df) document pointers. I looked at the the current Lucene implementation. The skipInterval is hardcoded as follows in TermInfosWriter.java class: int skipInterval = 16; Therefore, I have two questions: 1) Would it be a good idea and feasible to use sqrt(df) to be the skipInterval, rather than hardcode it? 2) When merging segments, for every term, the skip table is buffered first in the RAMOutputStream and then written to the output stream. If there are lot of documents for a term, this seems to consume a lot of memory, right? If instead, we use sqrt(df) to be the skipInterval, the memory consumed will be a lot less, as it is logarithmic. Hope some one could shed more light on this. Thanks in advance, Jian
Fwd: skipInterval
Hi, All, I should have sent to this email address rather than the old jakarta email address. Sorry if double-posted. Jian -- Forwarded message -- From: jian chen <[EMAIL PROTECTED]> Date: Oct 15, 2005 6:36 PM Subject: skipInterval To: Lucene Developers List Hi, All, I was reading some research papers regarding quick inverted index lookups. The classical approach to skipping dictates that a skip should be positioned every sqrt(df) document pointers. I looked at the the current Lucene implementation. The skipInterval is hardcoded as follows in TermInfosWriter.java class: int skipInterval = 16; Therefore, I have two questions: 1) Would it be a good idea and feasible to use sqrt(df) to be the skipInterval, rather than hardcode it? 2) When merging segments, for every term, the skip table is buffered first in the RAMOutputStream and then written to the output stream. If there are lot of documents for a term, this seems to consume a lot of memory, right? If instead, we use sqrt(df) to be the skipInterval, the memory consumed will be a lot less, as it is logarithmic. Hope some one could shed more light on this. Thanks in advance, Jian
Re: Fwd: skipInterval
Hi, Paul, Thanks for your email. I am not sure how the sqrt vs. constant for skipInterval will pan out for two or multiple required terms. That needs some experiments I guess. Cheers, Jian On 10/16/05, Paul Elschot <[EMAIL PROTECTED]> wrote: > > Jian, > -- Forwarded message -- > > From: jian chen <[EMAIL PROTECTED]> > > Date: Oct 15, 2005 6:36 PM > > Subject: skipInterval > > To: Lucene Developers List > > > > Hi, All, > > > > I was reading some research papers regarding quick inverted index > lookups. > > The classical approach to skipping dictates that a skip should be > positioned > > every sqrt(df) document pointers. > > The typical use of skipping info in Lucene is in ConjunctionScorer, for a > query with two required terms. There it helps for the case when one > term occurs much less frequently than another. > Iirc the sqrt() is optimal for a single lookup in a single level index, > reducing the complexity from linear to logarithmic. > Does the sqrt() also apply in the case of searching for two required terms > and returning all the documents in which they both occur? > > Regards, > Paul Elschot > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > >
lucene inter-process locking question
Hi, Lucene Developers, Just got a question regarding the locking mechanism in Lucene. I see in IndexReader, first there is synchronized(directory) to synch up multi-threads, then, inside, there is the statement for grabbing the commit.lock. So, my question is, could the multi-thread synch be also done with commit.lock? In other words, I don't understand why synchronized(directory) is there meanwhile commit.lock could do both in and inter-process locking? Could anyone enlighten me about it? Thanks so much in advance, Jian
Fwd: lucene inter-process locking question
Hi, I did some research and found an answer from the following url: http://www.gossamer-threads.com/lists/lucene/java-dev/21808?search_string=synchronized%20directory;#21808 So, now I understand that it is partly historical. Cheers, Jian -- Forwarded message -- From: jian chen <[EMAIL PROTECTED]> Date: Nov 7, 2005 4:17 PM Subject: lucene inter-process locking question To: java-dev@lucene.apache.org Hi, Lucene Developers, Just got a question regarding the locking mechanism in Lucene. I see in IndexReader, first there is synchronized(directory) to synch up multi-threads, then, inside, there is the statement for grabbing the commit.lock . So, my question is, could the multi-thread synch be also done with commit.lock? In other words, I don't understand why synchronized(directory) is there meanwhile commit.lock could do both in and inter-process locking? Could anyone enlighten me about it? Thanks so much in advance, Jian
Re: Lucene Index backboned by DB
Dear All, I have some thoughts on this issue as well. 1) It might be OK to implement retrieving field values separately for a document. However, I think from a simplicity point of view, it might be better to have the application code do this drudgery. Adding this feature could complicate the nice and simple design of Lucene without much benefit. 2) The application could separately a document into several documents, for example, one document for indexing mainly, the other documents for storing binary values for different fields. Thus, giving the relevant doc id, its associated binary value for a particular field could be loaded very fast with just a disk lookup (looking up the fdx file). This way, only the relevant field is loaded into memory rather than all of the fields for a doc. There is no change on Lucene side, only some more work for the application code. My view for a search library (or in general, a library), should be small and efficient, since it is used by lot of applications, any additional feature could potentially impact its robustness and liability to performance drawback. Welcome for any critics or comments? Jian On 11/15/05, Robert Kirchgessner <[EMAIL PROTECTED]> wrote: > > Hi, > > a discussion in > > http://issues.apache.org/jira/browse/LUCENE-196 > > might be of interest to you. > > Did you think about storing the large pieces of documents > in a database to reduce the size of Lucene index? > > I think there are good reasons to adding support for > storing fields in separate files: > > 1. One could define a binary field of fixed length and store it > in a separate file. Then load it into memory and have fast > access for field contents. > > A use case might be: store calendar date (-MM-DD) > in three bytes, 4 bits for months, 5 bits for days and up to > 15 bits for years. If you want to retrieve hits sorted by date > you can load the fields file of size (3 * documents in index) bytes > and support sorting by date without accessing hard drive > for reading dates. > > 2. One could store document contents in a separate > file and fields of small size like title and some metadata > in the way it is stored now. It could speed up access to > fields. It would be interesting to know whether you gain > significant perfomance leaving the big chunks out, i.e. > not storing them in index. > > In my opinion 1. is the most interesting case: storing some > binary fields (dates, prices, length, any numeric metrics of > documents) would enable *really* fast sorting of hits. > > Any thoughts about this? > > Regards, > > Robert > > > > We have a similiar problem > > Am Dienstag, 15. November 2005 23:23 schrieb Karel Tejnora: > > Hi all, > > in our testing application using lucene 1.4.3. Thanks you guys for > > that great job. > > We have index file around 12GiB, one file (merged). To retrieve hits it > > takes nice small amount of the time, but reading fields takes 10-100 > > times more (the stored ones). I think because all the fields are read. > > I would like to try implement lucene index files as tables in db with > > some lazy fields loading. As I have searched web I have found only impl. > > of the store.Directory (bdb), but it only holds data as binary streams. > > This technique will be not so helpful because BLOB operations are not > > fast performing. On another side I will have a lack of the freedom from > > documents fields variability but I can omit a lot of the skipping and > > many opened files. Also IndexWriter can have document/term locking > > granuality. > > So I think that way leads to extends IndexWriter / IndexReader and have > > own implementation of index.Segment* classes. It is the best way or I > > missing smthg how achieve this? > > If it is bad idea, I will be happy to heard another possibilities. > > > > I would like also join development of the lucene. Is there some points > > how to start? > > > > Thx for reading this, > > sorry if I did some mistakes > > > > Karel > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: DbDirectory with Berkeley DB Java Edition
Hi, I am pretty pessimistic about any DB directory implementation for Lucene. The nature of the Lucene index files does not really fit well into a relational database. Therefore, performance wise, the DB implementations would suffer a lot. Basically, I would discourage anyone on the DB implementation. 2 cents, Jian On 12/14/05, Steeley Andrew <[EMAIL PROTECTED]> wrote: > > Hello all, > > I know this question has been asked before, but I have not found a > definitive answer. Has anyone implemented a version of Andi Vajda's > DbDirectory that works with the Berkeley DB Java Edition? I am aware of > Oscar Picasso's postings on this list on the topic, but haven't been able to > contact Mr. Picasso or find his contribution as of yet. Anyone else? > > Thanks much, > Andy > >
Re: Filter
Hi, All, For the filter issue, my idea is to completely get rid of the filter interface. Can we not use the HitCollector and have that to do the filtering work? I am in the process of writing a simpler engine based on Lucene source code. I don't mind re-inventing the wheel, as I feel frustrated with the relations among Query, Searcher, Scorer, etc. I have done the initial round of my code already and it is in production and works great. Basically, the search interfaces will be like the following: public interface IndexSearcher { public void search(HitCollector hc, int maxDocNum) throws IOException; } public interface HitCollector { public void collect(int doc, float score); // the total hits that meet the search criteria, // could be way bigger than the actual ScoreDocs // we record in the HitQueue public int getTotalHits(); public ScoreDoc[] getScoreDocs(); public int getNumScoreDocs(); // max number of ScoreDocs this hit collector could hold public int getCapacity(); } I have refactored the Scorers in Lucene to be just Searchers. Because the scorer is the actual searcher that does the ranking, right? I will publish my code using an open source license this year. Cheers, Jian Chen Lead Developer, Seattle Lighting On 3/10/06, eks dev <[EMAIL PROTECTED]> wrote: > > It looks to me everybody agrees here, not? If yes, it > would be really usefull if somebody with commit rights > could add 1) and 2) to the trunk (these patches > practically allready exist). > > It is not invasive change and there are no problems > with compatibility. Also, I have noticed a lot of > people trying to "hack in" better Filter support using > Pauls Patches from JIRA. > > That would open a window for some smart code to get > commited into Lucene core. > > Just have a look at Filtering support in Solr, > beautiful, but unfortunately also "hacked" just to > overcome BitSet on Filter. > > > > > --- Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > 1) commit DocNrSkipper interface to the core code > > base > > 2) add the following method declaration to the > > Filter class... > > public DocNrSkipper getSkipper(IndexReader) > > throws IOException > > ...impliment this method by calling bits, and > > returning an instance > > of BitSetSortedIntList > > 3) indicate that Filter.bits() is deprecated. > > 4) change all existing calls to Filter.bits() in > > the core lucene code > > base to call Filter.getSkipper and do whatever > > iterating is > > neccessary. > > 5) gradually reimpliment all of the concrete > > instances of Filter in > > the core lucene code base so they override the > > getSkipper method > > with something that returns a more "iterator" > > style DocNrSkipper, > > and impliment their bits() method to use the > > DocNrSkipper from the > > new getSkipper() method to build up the bit set > > if clients call it > > directly. > > 6) wait a suitable amount of time. > > 7) remove Filter.bits() and all of the concrete > > implimentations from the > > lucene core. > > > > > > > > > > -Hoss > > > > > > > > > > ___ > To help you stay safe and secure online, we've developed the all new > Yahoo! Security Centre. http://uk.security.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: this == that
I am wondering if interning Strings will be really that critical for performance. The biggest bottle neck is still disk. So, maybe we can use String.equals(...) instead of ==. Jian On 5/1/06, DM Smith <[EMAIL PROTECTED]> wrote: karl wettin wrote: > The code is filled with string equality code using == rather than > equals(). I honestly don't think it saves a single clock tick as the > JIT takes care of it when the first line of code in the equals method > is if (this == that) return true; If the strings are intern() then it should be a touch faster. If the strings are not interned then I think it may be a premature optimization. IMHO, using intern to optimize space is a reasonable optimization, but using == to compare such strings is error prone as it is possible that the comparison is looking at strings that have not been interned. Unless it object identity is what is being tested or intern is an invariant, I think it is dangerous. It is easy to forget to intern or to propagate the pattern via cut and paste to an inappropriate context. > > Please correct me if I'm wrong. > > I can commit to do the changes to the core code if it is considered > interesting. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
storing term text internally as byte array and bytecount as prefix, etc.
Hi, All, Recently I have been following through the whole discussion on storing text/string as standard UTF-8 and how to achieve that in Lucene. If we are stroing the term text and the field strings as UTF-8 bytes, I now understand that it is a tricky issue because of the performance problem we are still facing when converting back and forth between the UTF-8 bytes and java String. This especially seems to be a problem for the segment merger routine, which loads the segment term enums and will convert the UTF-8 bytes back to String during merge operation. Just a thought here, could we always represent the term text as UTF-8 bytes internally? So Term.java will have the private member variable: private byte[] utf8bytes; instead of private String text; Plus, Term object could be construct either from a String or from a utf8 byte array. This way, for indexing new documents, the new Term(String text) is called and utf8bytes will be obtained from the input term text. For segment term info merge, the utf8bytes will be loaded from the Lucene index, which already stores the term text as utf8 bytes. Therefore, no conversion is needed. I hope I explained my thoughts. Make sense? Cheers, Jian Chen
Re: storing term text internally as byte array and bytecount as prefix, etc.
Hi, Marvin, Thanks for your quick response. I am in the camp of fearless refactoring, even at the expense of breaking compatibility with previous releases. ;-) Compatibility aside, I am trying to identify if changing the implementation of Term is the right way to go for this problem. If it is, I think it would be worthwhile rather than putting band-aid on the existing API. Cheers, Jian Changing the implementation of Term would have a very broad impact; I'd look for other ways to go about it first. But I'm not an expert on SegmentMerger, as KinoSearch doesn't use the same technique for merging. My plan was to first submit a patch that made the change to the file format but didn't touch SegmentMerger, then attack SegmentMerger and also see if other developers could suggest optimizations. However, I have an awful lot on my plate right now, and I basically get paid to do KinoSearch-related work, but not Lucene-related work. It's hard for me to break out the time to do the java coding, especially since I don't have that much experience with java and I'm slow. I'm not sure how soon I'll be able to get back to those bytecount patches. Marvin Humphrey Rectangular Research http://www.rectangular.com/
Re: storing term text internally as byte array and bytecount as prefix, etc.
Hi, Chuck, Using standard UTF-8 is very important for Lucene index so any program could read the Lucene index easily, be it written in perl, c/c++ or any new future programming languages. It is like storing data in a database for web application. You want to store it in such a way that other programs can manipulate easily other than only the web app program. Because there will be cases that you want to mass update or mass change the data, and you don't want to write only web apps for doing it, right? Cheers, Jian On 5/1/06, Chuck Williams <[EMAIL PROTECTED]> wrote: Could someone summarize succinctly why it is considered a major issue that Lucene uses the Java modified UTF-8 encoding within its index rather than the standard UTF-8 encoding. Is the only concern compatibility with index formats in other Lucene variants? The API to the values is a String, which uses Java's char representation, so I'm confused why the encoding in the index is so important. One possible benefit of a standard UTF-8 index encoding would be streaming content into and out of the index with no copying or conversions. This relates to the lazy field loading mechanism. Thanks for any clarification, Chuck jian chen wrote on 05/01/2006 04:24 PM: > Hi, Marvin, > > Thanks for your quick response. I am in the camp of fearless refactoring, > even at the expense of breaking compatibility with previous releases. ;-) > > Compatibility aside, I am trying to identify if changing the > implementation > of Term is the right way to go for this problem. > > If it is, I think it would be worthwhile rather than putting band-aid > on the > existing API. > > Cheers, > > Jian > > Changing the implementation of Term >> would have a very broad impact; I'd look for other ways to go about >> it first. But I'm not an expert on SegmentMerger, as KinoSearch >> doesn't use the same technique for merging. >> >> My plan was to first submit a patch that made the change to the file >> format but didn't touch SegmentMerger, then attack SegmentMerger and >> also see if other developers could suggest optimizations. >> >> However, I have an awful lot on my plate right now, and I basically >> get paid to do KinoSearch-related work, but not Lucene-related work. >> It's hard for me to break out the time to do the java coding, >> especially since I don't have that much experience with java and I'm >> slow. I'm not sure how soon I'll be able to get back to those >> bytecount patches. >> >> Marvin Humphrey >> Rectangular Research >> http://www.rectangular.com/ >> > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: storing term text internally as byte array and bytecount as prefix, etc.
Plus, as open source and open standard advocates, we don't want to be like Micros$ft, who claims to use industrial "standard" XML as the next generation word file format. However, it is very hard to write your own Word reader, because their word file format is proprietary and hard to write programs for. Jian On 5/1/06, jian chen <[EMAIL PROTECTED]> wrote: Hi, Chuck, Using standard UTF-8 is very important for Lucene index so any program could read the Lucene index easily, be it written in perl, c/c++ or any new future programming languages. It is like storing data in a database for web application. You want to store it in such a way that other programs can manipulate easily other than only the web app program. Because there will be cases that you want to mass update or mass change the data, and you don't want to write only web apps for doing it, right? Cheers, Jian On 5/1/06, Chuck Williams <[EMAIL PROTECTED]> wrote: > > Could someone summarize succinctly why it is considered a major issue > that Lucene uses the Java modified UTF-8 encoding within its index > rather than the standard UTF-8 encoding. Is the only concern > compatibility with index formats in other Lucene variants? The API to > the values is a String, which uses Java's char representation, so I'm > confused why the encoding in the index is so important. > > One possible benefit of a standard UTF-8 index encoding would be > streaming content into and out of the index with no copying or > conversions. This relates to the lazy field loading mechanism. > > Thanks for any clarification, > > Chuck > > > jian chen wrote on 05/01/2006 04:24 PM: > > Hi, Marvin, > > > > Thanks for your quick response. I am in the camp of fearless > refactoring, > > even at the expense of breaking compatibility with previous releases. > ;-) > > > > Compatibility aside, I am trying to identify if changing the > > implementation > > of Term is the right way to go for this problem. > > > > If it is, I think it would be worthwhile rather than putting band-aid > > on the > > existing API. > > > > Cheers, > > > > Jian > > > > Changing the implementation of Term > >> would have a very broad impact; I'd look for other ways to go about > >> it first. But I'm not an expert on SegmentMerger, as KinoSearch > >> doesn't use the same technique for merging. > >> > >> My plan was to first submit a patch that made the change to the file > >> format but didn't touch SegmentMerger, then attack SegmentMerger and > >> also see if other developers could suggest optimizations. > >> > >> However, I have an awful lot on my plate right now, and I basically > >> get paid to do KinoSearch-related work, but not Lucene-related work. > >> It's hard for me to break out the time to do the java coding, > >> especially since I don't have that much experience with java and I'm > >> slow. I'm not sure how soon I'll be able to get back to those > >> bytecount patches. > >> > >> Marvin Humphrey > >> Rectangular Research > >> http://www.rectangular.com/ > >> > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: storing term text internally as byte array and bytecount as prefix, etc.
Hi, Doug, I totally agree with what you said. Yeah, I think it is more of a file format issue, less of an API issue. It seems that we just need to add an extra constructor to Term.java to take in utf8 byte array. Lucene 2.0 is going to break the backward compability anyway, right? So, maybe this change to standard UTF-8 could be a hot item on the Lucene 2.0list? Cheers, Jian Chen On 5/2/06, Doug Cutting <[EMAIL PROTECTED]> wrote: Chuck Williams wrote: > For lazy fields, there would be a substantial benefit to having the > count on a String be an encoded byte count rather than a Java char > count, but this has the same problem. If there is a way to beat this > problem, then I'd start arguing for a byte count. I think the way to beat it is to keep things as bytes as long as possible. For example, each term in a Query needs to be converted from String to byte[], but after that all search computation could happen comparing byte arrays. (Note that lexicographic comparisons of UTF-8 encoded bytes give the same results as lexicographic comparisions of Unicode character strings.) And, when indexing, each Token would need to be converted from String to byte[] just once. The Java API can easily be made back-compatible. The harder part would be making the file format back-compatible. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: when was the document number initially written into .frq file?
It is in DocumentWriter.java class. Look at writePostings(...) method. Here are the lines: // add an entry to the freq file int f = posting.freq; if (f == 1) // optimize freq=1 freq.writeVInt(1); // set low bit of doc num. else { freq.writeVInt(0); // the document number freq.writeVInt(f); // frequency in doc } Any other question? Jian On 5/6/06, Charlie <[EMAIL PROTECTED]> wrote: Hello, Would any developer please give me a hint of when the document number was initially written into .frq file? From: //it is not really write the doc# in writePostings() final class DocumentWriter private final void writePostings(Posting[] postings, String segment) int postingFreq = posting.freq; if (postingFreq == 1) // optimize freq=1 freq.writeVInt(1); // set low bit of doc num. else { freq.writeVInt(0); // the document number freq.writeVInt(postingFreq);// frequency in doc } //it is write the doc# in appendPostings() final class SegmentMerger private final int appendPostings(SegmentMergeInfo[] smis, int n) int docCode = (doc - lastDoc) << 1; // use low bit to flag freq=1 lastDoc = doc; int freq = postings.freq(); if (freq == 1) { freqOutput.writeVInt(docCode | 1); // write doc & freq=1 } else { freqOutput.writeVInt(docCode); // write doc freqOutput.writeVInt(freq); // write frequency in doc } //but then I am further confused that in order to call int doc = postings.doc(); in appendPostings(), the doc# should already been written. Chicken-egg-chicken-egg ... Should there be another place for the initial writing of the doc# ? -- Thanks for your advice, Charlie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: when was the document number initially written into .frq file?
Looking at your email again. You are confusing the initial writing of postings with the segment merging. Once the doc number is written, the .frq file is not changed. The segment merge process will write to a new .frq file. Make sense? Jian On 5/8/06, jian chen <[EMAIL PROTECTED]> wrote: It is in DocumentWriter.java class. Look at writePostings(...) method. Here are the lines: // add an entry to the freq file int f = posting.freq; if (f == 1) // optimize freq=1 freq.writeVInt(1); // set low bit of doc num. else { freq.writeVInt(0); // the document number freq.writeVInt(f); // frequency in doc } Any other question? Jian On 5/6/06, Charlie <[EMAIL PROTECTED]> wrote: > > Hello, > > Would any developer please give me a hint of when the document number > was > initially written into .frq file? > > From: //it is not really write the doc# in writePostings() > > final class DocumentWriter > private final void writePostings(Posting[] postings, String segment) > int postingFreq = posting.freq; > if (postingFreq == 1) // optimize > freq=1 > freq.writeVInt(1); // set low bit of doc > num. > else { > freq.writeVInt(0); // the document number > freq.writeVInt(postingFreq);// frequency > in doc > } > > //it is write the doc# in appendPostings() > > final class SegmentMerger > private final int appendPostings(SegmentMergeInfo[] smis, int n) > > int docCode = (doc - lastDoc) << 1; // use low bit to flag > freq=1 > lastDoc = doc; > > int freq = postings.freq(); > if (freq == 1) { > freqOutput.writeVInt(docCode | 1); // write doc & freq=1 > } else { > freqOutput.writeVInt(docCode); // write doc > freqOutput.writeVInt(freq); // write frequency in > doc > } > > //but then I am further confused that in order to call > int doc = postings.doc(); in appendPostings(), > the doc# should already been written. > > Chicken-egg-chicken-egg ... > > Should there be another place for the initial writing of the doc# ? > > -- > Thanks for your advice, > Charlie > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: How To find which field has the search term in Hit?
You can store the field values and then, load the field values to do a real-time comparision. Simple solution... Jian On 5/24/06, N <[EMAIL PROTECTED]> wrote: Hi I am searching on multiple fields. Is it possible to retrieve the field (s) which contains the search terms from the documents returned as Hits. Best Noon - Sneak preview the all-new Yahoo.com. It's not radically different. Just radically better.
Re: How To find which field has the search term in Hit?
Hi, Noon, Sorry I did not initially understand the detailed problem you have. This sounds like a prefix match problem. You can create index for each field and then do a prefix mach for these fields. By the way, I think you question could be better served by posting to the lucene user group. Cheers, Jian On 5/29/06, N <[EMAIL PROTECTED]> wrote: Thanks for the reply but I couldnt get your point..Could you elaborate it further? Fopr instance we have FirstName (= Martin ), LastName (= Spaniol), Company (= Mark Co.) and we search for the "Mar*" which will be found in FirstName and Company ..so how can I retrieve this info that it is found in only FirstName and Company fields. Best Noon. jian chen <[EMAIL PROTECTED]> wrote: You can store the field values and then, load the field values to do a real-time comparision. Simple solution... Jian On 5/24/06, N wrote: > > Hi > > I am searching on multiple fields. Is it possible to retrieve the field > (s) which contains the search terms from the documents returned as Hits. > > Best > Noon > > > - > Sneak preview the all-new Yahoo.com. It's not radically different. Just > radically better. > - Blab-away for as little as 1ยข/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice.
Kudo to the wonderful Lucene search library
Hi, All, Our site, www.destinationlighting.com, went live yesterday. It is powered by the Lucene search engine and the Velocity template engine. It will be the best and most comprehensive online store for lighting fixtures and related hardware. Many thanks to the Lucene developers and the open source community. Jian Chen Lead Developer www.destinationlighting.com
Re: Using Database instead of File system
For real search engine, performance is the most important factor. I think file system based system is better than storing the indexes in database because of the pure speed you will get. Cheers, Jian On 9/25/06, Simon Willnauer <[EMAIL PROTECTED]> wrote: Have a look at the compass framework http://www.opensymphony.com/compass/ Compass also provides a Lucene Jdbc Directory implementation, allowing storing Lucene index within a database for both pure Lucene applications and Compass enabled applications. best regards simon On 9/25/06, Reuven Ivgi <[EMAIL PROTECTED]> wrote: > Hello, > > I have just started to work with Lucene > > Is it possible to define the index files of lucene to be on a database > (such as MySql), just for backup and restore porposes > > Thanks in Advance > > > > Reuven Ivgi > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Beyond Lucene 2.0 Index Design
Hi, Jeff, I like the idea of impact based scoring. However, could you elaborate more on why we only need to use single field at search time? In Lucene, the indexed terms are field specific, and two terms, even if they are the same, are still different terms if they are of different fields. So, I think the multiple field scenario is still needed right? What if the user wants to search on both subject and content for emails, for example, and sometimes, only wants to search on subject, this type of tasks, without multiple fields, how this would be handled. I got lost on this, could any one educate? Thanks, Jian On 1/9/07, Dalton, Jeffery <[EMAIL PROTECTED]> wrote: I'm not sure we fully understand one another, but I'll try to explain what I am thinking. Yes, it has use after sorting. It is used at query time for document scoring in place of the TF and length norm components (new scorers would need to be created). Using an impact based index moves most of the scoring from query time to index time (trades query time flexibility for greatly improved query search performance). Because the field boosts, length norm, position boosts, etc... are incorporated into a single document-term-score, you can use a single field at search time. It allows one posting list per query term instead of the current one posting list per field per query term (MultiFieldQueryParser wouldn't be necessary in most cases). In addition to having fewer posting lists to examine, you often don't need to read to the end of long posting lists when processing with a score-at-a-time approach (see Anh/Moffat's Pruned Query Evaluation Using Pre-Computed Impacts, SIGIR 2006) for details on one potential algorithm. I'm not quite sure what you mean when mention leaving them out and re-calculating them at merge time. - Jeff > -Original Message- > From: Marvin Humphrey [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 09, 2007 2:58 PM > To: java-dev@lucene.apache.org > Subject: Re: Beyond Lucene 2.0 Index Design > > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote: > > > e. > > f. ],...[docN, freq > > ,]) > > Does the impact have any use after it's used to sort the postings? > Can we leave it out of the index format and recalculate at merge-time? > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Beyond Lucene 2.0 Index Design
Hi, Jeff, Also, how to handle the phrase based queries? For example, here are two posting lists: TermA: X Y TermB: Y X I am not sure how you would return document X or Y for a search of the phrase "TermA Term B". Which should come first? Thanks, Jian On 1/9/07, Dalton, Jeffery <[EMAIL PROTECTED]> wrote: I'm not sure we fully understand one another, but I'll try to explain what I am thinking. Yes, it has use after sorting. It is used at query time for document scoring in place of the TF and length norm components (new scorers would need to be created). Using an impact based index moves most of the scoring from query time to index time (trades query time flexibility for greatly improved query search performance). Because the field boosts, length norm, position boosts, etc... are incorporated into a single document-term-score, you can use a single field at search time. It allows one posting list per query term instead of the current one posting list per field per query term (MultiFieldQueryParser wouldn't be necessary in most cases). In addition to having fewer posting lists to examine, you often don't need to read to the end of long posting lists when processing with a score-at-a-time approach (see Anh/Moffat's Pruned Query Evaluation Using Pre-Computed Impacts, SIGIR 2006) for details on one potential algorithm. I'm not quite sure what you mean when mention leaving them out and re-calculating them at merge time. - Jeff > -Original Message- > From: Marvin Humphrey [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 09, 2007 2:58 PM > To: java-dev@lucene.apache.org > Subject: Re: Beyond Lucene 2.0 Index Design > > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote: > > > e. > > f. ],...[docN, freq > > ,]) > > Does the impact have any use after it's used to sort the postings? > Can we leave it out of the index format and recalculate at merge-time? > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Beyond Lucene 2.0 Index Design
I also got the same question. It seems it is very hard to efficiently do phrase based query. I think most search engines do phrase based query, or at least appear to be. So, like in google, the query result must contain all the words user searched on. It seems to me that the impacted-sorted list makes sense if you are trying to do pure vector space based ranking. This is from what I have read from the research papers. They all talk about how to optimize the vector space model using this impact-sorted list approach. Unfortunately, the vector space model has serious drawbacks. It does not take the inter-word relation into account. Thus, could result in a search result where documents matching only some keywords ranked higher than documents matching all of them. I still yet to see whether the impact-sorted list approach could handle this efficiently. Cheers, Jian On 1/11/07, Marvin Humphrey <[EMAIL PROTECTED]> wrote: On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote: > e. > f. ],...[docN, freq > ,]) How do you build an efficient PhraseScorer to work with an impact- sorted posting list? The way PhraseScorer currently works is: find a doc that contains all terms, then see if the terms occur consecutively in phrase order, then determine a score. The TermDocs objects feeding PhraseScorer return doc_nums in ascending order, so finding an intersection is easy. But if the document numbers are returned in what looks to the PhraseScorer like random order... ?? Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NewIndexModifier - - - DeletingIndexWriter
Hey guys, Following the Lucene dev mailing list for sometime now, I am concerned that lucene is slowing losing all the simplicity and become a complicated mess. I think keeping IndexReader and IndexWriter the way it works in 1.2 even is better, no? Software should be designed to be simple to use and maintain, that's my concern. Cheers, Jian On 2/9/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 2/9/07, Doug Cutting <[EMAIL PROTECTED]> wrote: > Yonik Seeley wrote: > > As long as you wouldn't object to a org.apache.lucene package in Solr... > > With the understanding of course, that the onus would be on Solr developers > > to keep up with any changes. > > I wouldn't object to that. Would you? Nope... Solr bundles Lucene, so if there are changes that take longer to adapt to - so be it. Solr doesn't need to work with every Lucene version. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NewIndexModifier - - - DeletingIndexWriter
I totally second Robert's thought. My concern is, to get the raw speed of Lucene, you got to get to the basics. If we start to apply layers upon layers of code to just mask off the internals of Lucene, it will not do any good. An example perhaps is the Windoze vs. Linux. As an end user, you get all the fancy features in Windoze, but, as a software developer, you get frustrated when not able to access the low level stuff easily. Linux is good in this aspect. I think the Lucene library should be designed simple and efficient in order to allow tweaking for raw speed. That's the spirit for large scale search engines, right? Even Google file system has to sacrifice some design for raw speed, i.e., files are append-only. Cheers, Jian On 2/13/07, robert engels <[EMAIL PROTECTED]> wrote: Lucene is not a word processor. It is a development library. I think an understanding of any development library is essential to using it properly. Once you have even a basic understanding of the Lucene design, it is very clear as to why deletes are performed using the IndexReader. If you attempt to use Lucene without understanding its use proper and design (there are many people on this list that think it is a database) you will probably get most things wrong. On Feb 13, 2007, at 1:17 AM, Nadav Har'El wrote: > On Fri, Feb 09, 2007, jian chen wrote about "Re: NewIndexModifier - > - - DeletingIndexWriter": >> Following the Lucene dev mailing list for sometime now, I am >> concerned that >> lucene is slowing losing all the simplicity and become a >> complicated mess. >> I think keeping IndexReader and IndexWriter the way it works in >> 1.2 even is >> better, no? >> Software should be designed to be simple to use and maintain, >> that's my >> concern. > > Hi, I wonder - how do you see the original IndexReader and IndexWriter > separation "simple to use"? > > Every single user of Lucene that I know, encountered very quickly > the problem > of how to delete documents; Many of them started to use > IndexModifier, and > then suddenly realized its performance makes it unusable; Many (as > you can > also see from examples sent to the user list once in a while) ended > up writing > their own complex code for buffering deletes (and similar solutions). > > So for users, the fact that an index "writer" cannot delete, but > rather an > index "reader" (!) is the one that can delete documents, wasn't > simplicity - > it was simply confusing, and hard to use. It meant each user needed > to work > hard to get around this limitation. Wouldn't it be better if Lucene > included > this functionality that many (if not most) users need, out of the box? > > -- > Nadav Har'El| Tuesday, Feb 13 2007, 25 > Shevat 5767 > IBM Haifa Research Lab > |- > |Just remember that if the > world didn't > http://nadav.harel.org.il |suck, we would all fall off. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text
Hi, Mark, Your program is very helpful. I am trying to understand your code but it seems would take longer to do that than simply asking you some questions. 1) What is the sliding window used for? It is that the Analyzer remembers the previously seen N tokens, and N is the window size? 2) As the Analyzer does text parsing, is it that the patterns happened before (in the previous N token window) is used and any such pattern in the latest N token window is recognized? Could you provide some more insights how your algorithm works by removing duplicate snippets of text from many documents? Thanks and really appreciate your help. Jian On 3/20/07, Mark Harwood (JIRA) <[EMAIL PROTECTED]> wrote: [ https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] Mark Harwood updated LUCENE-725: Attachment: NovelAnalyzer.java Updated version can now process any number of documents and remove "boilerplate" text tokens such as copyright notices etc. New version automatically maintains only a sliding window of content in which it searches for duplicate paragraphs enabling it to process unlimited numbers of documents. > NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text > --- > > Key: LUCENE-725 > URL: https://issues.apache.org/jira/browse/LUCENE-725 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis >Reporter: Mark Harwood >Priority: Minor > Attachments: NovelAnalyzer.java, NovelAnalyzer.java > > > This is a class I have found to be useful for analyzing small (in the hundreds) collections of documents and removing any duplicate content such as standard disclaimers or repeated text in an exchange of emails. > This has applications in sampling query results to identify key phrases, improving speed-reading of results with similar content (eg email threads/forum messages) or just removing duplicated noise from a search index. > To be more generally useful it needs to scale to millions of documents - in which case an alternative implementation is required. See the notes in the Javadocs for this class for more discussion on this -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text
Also, how about this scenario. 1) The Analyzer does 100 documents, each with copy right notice inside. I guess in this case, the copy right notices will be removed when indexing. 2) The Analyzer does another 50 documents, each without any copy right notice inside. 3) Then, the Analyzer runs into a document that has copy right notice inside again. My question is, would the Analyzer be able to remove the copy right notice in step 3)? Cheers, Jian On 3/20/07, jian chen <[EMAIL PROTECTED]> wrote: Hi, Mark, Your program is very helpful. I am trying to understand your code but it seems would take longer to do that than simply asking you some questions. 1) What is the sliding window used for? It is that the Analyzer remembers the previously seen N tokens, and N is the window size? 2) As the Analyzer does text parsing, is it that the patterns happened before (in the previous N token window) is used and any such pattern in the latest N token window is recognized? Could you provide some more insights how your algorithm works by removing duplicate snippets of text from many documents? Thanks and really appreciate your help. Jian On 3/20/07, Mark Harwood (JIRA) <[EMAIL PROTECTED] > wrote: > > > [ > https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Mark Harwood updated LUCENE-725: > > > Attachment: NovelAnalyzer.java > > Updated version can now process any number of documents and remove > "boilerplate" text tokens such as copyright notices etc. > New version automatically maintains only a sliding window of content in > which it searches for duplicate paragraphs enabling it to process unlimited > numbers of documents. > > > NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out > all "boilerplate" text > > > --- > > > > Key: LUCENE-725 > > URL: https://issues.apache.org/jira/browse/LUCENE-725 > > Project: Lucene - Java > > Issue Type: New Feature > > Components: Analysis > >Reporter: Mark Harwood > >Priority: Minor > > Attachments: NovelAnalyzer.java, NovelAnalyzer.java > > > > > > This is a class I have found to be useful for analyzing small (in the > hundreds) collections of documents and removing any duplicate content such > as standard disclaimers or repeated text in an exchange of emails. > > This has applications in sampling query results to identify key > phrases, improving speed-reading of results with similar content (eg email > threads/forum messages) or just removing duplicated noise from a search > index. > > To be more generally useful it needs to scale to millions of documents > - in which case an alternative implementation is required. See the notes in > the Javadocs for this class for more discussion on this > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text
Hi, Mark, Thanks a lot for your explanation. This code is very useful so it could even be in a separate library for text extraction. Again, thanks for taking time to answer my question. Jian On 3/21/07, markharw00d <[EMAIL PROTECTED]> wrote: The Analyzer keeps a window of (by default) the last 300 documents. Every token created in these cached documents is stored for reference and as new documents arrive their token sequences are examined to see if any of the sequences was seen before, in which case the analyzer does not emit them as tokens. A sequence is of a definable length but I have found something like 10 to be a good value (passed to the constructor). If I was indexing this newslist for example all of your content copied below would be removed automatically (because it occurred more than once within a 300 documents window). >>My question is, would the Analyzer be able to remove the copy right notice in step 3)? In your example, "yes" - because it re-occurred within 300 documents. >>Could you provide some more insights how your algorithm works There are a number of optimizations to make it run fast that make the code trickier to read. The basis of it is: 1) a "tokens" map is contained with a key for every unique term 2) The value for each map entry is a list of ChainedTokens each of which represent an occurrence of the term in a doc 3) ChainedTokens contain the current term plus a reference to the previous term in that document. 4) The analyzer periodically (i.e not for every token) checks the tokens map for the current term and looks at all previous occurrences of this term, following the sequences of ChainedTokens looking for a common pattern. 5) As soon as a pattern looks like it is established and the analyzer is "onto something" it switches to a mode of concentrating solely on comparing the current sequence with a single likely previous sequence rather than testing ALL previous sequences as in step 4). If the repeated chains of tokens is over the desired sequence length these tokens are not emitted as part of the output TokenStream. * Periodically the tokens map and ChainedToken occurrences are pruned to avoid bloating memory. As part of this exercise "Stop words" are also automatically identified and recorded to avoid the cost of chasing all occurrences (step 4) or recording occurrences for very common words. Glad you find it useful. Cheers, Mark jian chen wrote: > Also, how about this scenario. > > 1) The Analyzer does 100 documents, each with copy right notice inside. I > guess in this case, the copy right notices will be removed when indexing. > > 2) The Analyzer does another 50 documents, each without any copy right > notice inside. > > 3) Then, the Analyzer runs into a document that has copy right notice > inside > again. > > My question is, would the Analyzer be able to remove the copy right > notice > in step 3)? > > Cheers, > > Jian > > On 3/20/07, jian chen <[EMAIL PROTECTED]> wrote: >> >> Hi, Mark, >> >> Your program is very helpful. I am trying to understand your code but it >> seems would take longer to do that than simply asking you some >> questions. >> >> 1) What is the sliding window used for? It is that the Analyzer >> remembers >> the previously seen N tokens, and N is the window size? >> >> 2) As the Analyzer does text parsing, is it that the patterns happened >> before (in the previous N token window) is used and any such pattern >> in the >> latest N token window is recognized? >> >> Could you provide some more insights how your algorithm works by >> removing >> duplicate snippets of text from many documents? >> >> Thanks and really appreciate your help. >> >> Jian >> >> >> On 3/20/07, Mark Harwood (JIRA) <[EMAIL PROTECTED] > wrote: >> > >> > >> > [ >> > >> https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] >> >> > >> > Mark Harwood updated LUCENE-725: >> > >> > >> > Attachment: NovelAnalyzer.java >> > >> > Updated version can now process any number of documents and remove >> > "boilerplate" text tokens such as copyright notices etc. >> > New version automatically maintains only a sliding window of >> content in >> > which it searches for duplicate paragraphs enabling it to process >> unlimited >> > numbers of documents. >> > >> > > NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out >> > all "boilerplate" text >> > > >> &
(LUCENE-835) An IndexReader with run-time support for synonyms
Hi, Mark, Thanks for providing this original approach for synonyms. I read through your code and think maybe this could be extended to handle the word stemming problem as well. Here is my thought. 1) Before indexing, create a Map> stemmedWordMap, the key is the stemmed word. 1) At indexing, we still index the word as it is, but, we stem the word (using PorterStemmer) and then insert/update the stemmedWordMap to add the mapping: stemmedWord <=>Word. Example, "lighting", "lighted", these two words will be stored in the ArrayList with the key "light". 2) At query time, when someone searched on "lighting", we stem the word to "light", then, find from the stemmedWordMap the synonyms for this word. In this case, we find "lighted". Then, we perform the search using the synonyms search. This way, we can combine both the synonyms and the stemmed words together. The nice part of this is, we only need to store the index with the original words. Saving disk space as well as indexing time. However, I do have the following concerns: 1) As documents could be removed from the index, the stemmedWordMap needs to be somehow kept up to date. This could be done periodically by rebuilding the stemmedWordMap? 2) Typically, people would like to see their exact match first. So, the synonyms search could be enhanced to take advantage of the position level boosting (payload for position). So, the search result for "lighting" should rank the documents with 'lighting" higher than documents with "lighted". 3) I am still not sure if this is a best approach in general. Does it make sense to keep the two indexes, one with original words indexed, the other one with all words stemmed? Then, searching will be run against both indexes. 4) How does Google perform this type of search? I guess the web search engines have different approach. There maybe no need for using a stemmer at all. First, the web documents are huge, searching for "lighting" will bring up enough results, who cares bringing back results with "lighted"? Second, the anchor texts that point to a web page of interest would contain all the variants (synonyms and stemmed words), so, they don't need to worry about search results being incomplete? For example, search for "rectangular" in google, http://www.google.com/search?hl=en&q=rectangular&btnG=Search, the wikipedia page comes up first. It only contains "Rectangle", however, click on Cached link, you will see "rectangular" is contained in the anchor text that points to this page. My ultimate question, if I want to do a search engine, as a general rule, what's the best way to do it? Mark, could be shed some light? Thanks, Jian On 3/18/07, Mark Harwood (JIRA) <[EMAIL PROTECTED]> wrote: [ https://issues.apache.org/jira/browse/LUCENE-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] Mark Harwood updated LUCENE-835: Attachment: TestSynonymIndexReader.java > An IndexReader with run-time support for synonyms > - > > Key: LUCENE-835 > URL: https://issues.apache.org/jira/browse/LUCENE-835 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.1 >Reporter: Mark Harwood > Assigned To: Mark Harwood > Attachments: Synonym.java, SynonymIndexReader.java, SynonymSet.java, TestSynonymIndexReader.java > > > These classes provide support for enabling the use of synonyms for terms in an existing index. > While Analyzers can be used at Query-parse time or Index-time to inject synonyms these are not always satisfactory means of providing support for synonyms: > * Index-time injection of synonyms is less flexible because changing the lists of synonyms requires an index rebuild. > * Query-parse-time injection is awkward because special support is required in the parser/query logic to recognise and cater for the tokens that appear in the same position. Additionally, any statistical analysis of the index content via TermEnum/TermDocs etc does not consider the synonyms unless specific code is added. > What is perhaps more useful is a transparent wrapper for the IndexReader that provides a synonym-ized view of the index without requiring specialised support in the calling code. All of the TermEnum/TermDocs interfaces remain the same but behind the scenes synonyms are being considered/applied silently. > The classes supplied here provide this "virtual" view of the index and all queries or other code that examines this index using the special reader benefit from this view without requiring specialized code. A Junit test illustrates this code in action. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additio
Re: Large scale sorting
Hi, Doug, I have been thinking about this as well lately and have some thoughts similar to Paul's approach. Lucene has the norm data for each document field. Conceptually it is a byte array with one byte for each document field. At query time, I think the norm array is loaded into memory the first time it is accessed, allowing for efficient look up of the norm value for each document. Now, if we could use integers to represent the sort field values, which is typically the case for most applications, maybe we can afford to have the sort field values stored in the disk and do disk lookup for each document matched? The look up of the sort field value will be as simple as docNo * 4 * offset. This way, we use the same approach as constructing the norms (proper merging for incremental indexing), but, at search time, we don't load the sort field values into memory, instead, just store them in disk. Will this approach be good enough? Thanks for your feedback. Jian On 4/9/07, Doug Cutting <[EMAIL PROTECTED]> wrote: Paul Smith wrote: > Disadvantages to this approach: > * It's a lot more I/O intensive I think this would be prohibitive. Queries matching more than a few hundred documents will take several seconds to sort, since random disk accesses are required per matching document. Such an approach is only practical if you can guarantee that queries match fewer than a hundred documents, which is not generally the case, especially with large collections. > I'm working on the basis that it's a LOT harder/more expensive to simply > allocate more heap size to cover the current sorting infrastructure. > One hits memory limits faster. Not everyone can afford 64-bit hardware > with many Gb RAM to allocate to a heap. It _is_ cheaper/easier to build > a disk subsystem to tune this I/O approach, and one can still use any > RAM as buffer cache for the memory-mapped file anyway. In my experience, raw search time starts to climb towards one second per query as collections grow to around 10M documents (in round figures and with lots of assumptions). Thus, searching on a single CPU is less practical as collections grow substantially larger than 10M documents, and distributed solutions are required. So it would be convenient if sorting is also practical for ~10M document collections on standard hardware. If 10M strings with 20 characters are required in memory for efficient search, this requires 400MB. This is a lot, but not an unusual amount on todays machines. However, if you have a large number of fields, then this approach may be problematic and force you to consider a distributed solution earlier than you might otherwise. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
Hi, Paul, Thanks for your reply. For your previous email about the need for disk based sorting solution, I kind of agree about your points. One incentive for your approach is that we don't need to warm-up the index anymore in case that the index is huge. In our application, we have to sync up the index pretty frequently, the warm-up of the index is killing it. To address your concern about single sort locale, what about creating a sort field for each sort locale? So, if you have, say, 10 locales, you will have 10 sort fields, each utilizing the mechanism of constructing the norms. At query time, in the HitCollector, for each doc id matched, you can load the field value (integer) through the IndexReader. (here you need to enhance the IndexReader to be able to load the sort field values). Then, you can use that value to reject/accept the doc, or factor into the score. How do you think? Jian On 4/9/07, Paul Smith <[EMAIL PROTECTED]> wrote: > > Now, if we could use integers to represent the sort field values, > which is > typically the case for most applications, maybe we can afford to > have the > sort field values stored in the disk and do disk lookup for each > document > matched? The look up of the sort field value will be as simple as > docNo * 4 > * offset. > > This way, we use the same approach as constructing the norms > (proper merging > for incremental indexing), but, at search time, we don't load the > sort field > values into memory, instead, just store them in disk. > > Will this approach be good enough? While a nifty idea, I think this only works for a single sort locale. I initially came up with a similar idea that the terms are already stored in 'sorted' order and one might be able to use the terms position for sorting, it's just that the terms ordering position is different in different locales. Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
Hi, Paul, I think to warm-up or not, it needs some benchmarking for specific application. For the implementation of the sort fields, when I talk about norms in Lucene, I am thinking we could borrow the same implmentation of the norms to do it. But, on a higher level, my idea is really just to create an array of integers for each sort field. The array length is NumOfDocs in the index. Each integer corresponds to a displayable string value. For example, if you have a field of different colors, you can assign integers like this: 0 <=> whilte 1 <=> blue 2 <=> yellow ... Thus, you don't need to use strings for sorting. For example, if you have document number 0,1,2, which stores colors blue, white, yellow respectively, the array would be: {1, 0, 2}. To do sorting, this array could be pre-loaded into memory (warming up the index), or, during collecting the hits (in HitCollector), the relevant integer values could be loaded from disk given a doc id. If you have 10 million documents, for one sort field, you will have 10x4=40 MB array. Cheers, Jian On 4/9/07, Paul Smith <[EMAIL PROTECTED]> wrote: > > In our application, we have to sync up the index pretty frequently, > the > warm-up of the index is killing it. > Yep, it speeds up the first sort, but at the cost of making all the others slower (maybe significantly so). That's obviously not ideal but could make use of sorts in larger indexes practical. > To address your concern about single sort locale, what about > creating a sort > field for each sort locale? So, if you have, say, 10 locales, you > will have > 10 sort fields, each utilizing the mechanism of constructing the > norms. > I really don't understand norms properly so I'm not sure exactly how that would help. I'll have to go over your original email again to understand. My main goal is to get some discussion going amongst the community, which hopefully we've kicked along. Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
I agree. this falls into the area where technical limit is reached. Time to modify the spec. I thought about this issue over this couple of days, there is really NO silver bullet. If the field is multi-value field and the distinct field values are not too many, you might reduce memory usage by storing the field as bitset. Each bit corresponding to a distinct value. But either way, you have to load the whole thing into memory for good performance. Jian On 4/10/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : I'm wondering then if the Sorting infrastructure could be refactored : to allow with some sort of policy/strategy where one can choose a : point where one is not willing to use memory for sorting, but willing ... : To accomplish this would require a substantial change to the : FieldSortHitQueue et al, and I realize that the use of NIO I don't follow ... why could this be implemented entirely via a new SortComparatorSource? (you would also need something to create your file, but that could probably be done as a decorator or subclass of IndexWRiter couldn't it?) : immediately pins Lucene to Java 1.4, so I'm sure this is : controversial. But, if we wish Lucene to go beyond where it is now, Java 1.5 is controversial, Lucene already has 1.4 dependencies. : I think we need to start thinking about this particular problem : sooner rather than later. it depends on your timeline, Lucene's gotten pretty far with what it's got. Personally i'm banking on RAM getting cheaper fast enough that I won't ever need to worry about this. If i needed to support sorting on lots of fields with lots of differnet locales, and my index was big enough that i couldn't feasibly keep all of the FieldCaches in memory on one box, i wouldn't partition the index across multiple boxes and merge results with a MultiSearcher ... i'd clone the index across multiple boxes and partition the traffic based on the field/locale it's searching on. it's a question of cache management, if i know i have two very differnet use cases for a Solr index, i partition those use case to seperate tiers of machines to get better cache utilization, FieldCache is just another type of cache. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
using FieldCache or storing quality score in lucene term index
Hi, This is probably a question for the user list. However, as it relates to the performance issue, also Lucene index format, I think better to ask the gurus in this list ;-) In my application, I have implemented a quality score for each document. For each search performed, the relevancy score is first computed using the lucene scoring, then, the relevancy score is combined with the quality score to finally score the document. For storing the quality score, I could use the FieldCache feature and then load the quality scores as a byte array into memory when warming up the index. However, I pay the price for the warm up. However, if I store the quality score in the term index, as in: term, + This way, no need to warm up the index. But, I guess the index would be significantly bigger, and for each term, the quality score for a document is stored. I haven't done any testing yet to see which way is better. But, in general, could anyone give me some advice which way is better? I think it could be a classic time vs. space issue in computer science. But still would get the opinions from you gurus. Thanks in advance. Jian
Re: possible segment merge improvement?
Hi, Robert, That's a brilliant idea! Thanks so much for suggesting that. Cheers, Jian On 10/31/07, robert engels <[EMAIL PROTECTED]> wrote: > > Currently, when merging segments, every document is [parsed and then > rewritten since the field numbers may differ between the segments > (compressed data is not uncompressed in the latest versions). > > It would seem that in many (if not most) Lucene uses the fields > stored within each document with an index are relatively static, > probably changing for all documents added after point X, if at all. > > Why not check the fields dictionary for the segments being merged, > and if the same, just copy the binary data directly? > > In the common case this should be a vast improvement. > > Anyone worked on anything like this? Am I missing something? > > Robert Engels > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >