Lucene Dynamic http Web Page Search
Hi, I am working on adding a search feature to a web site that uses single database driven aspx pages and would like to know if Lucene can search using the http url address or database to index from. As current I can only see Lucene being able to search physical files in a windows folder. Any ideas? -- View this message in context: http://www.nabble.com/Lucene-Dynamic-http-Web-Page-Search-tf1867987.html#a5104457 Sent from the Lucene - Java Users forum at Nabble.com.
Re: Lucene Dynamic http Web Page Search
Thanks' for the promped reply I will look for something similar for the dot net version, I posted in this group as it is more active. -- View this message in context: http://www.nabble.com/Lucene-Dynamic-http-Web-Page-Search-tf1867987.html#a5111083 Sent from the Lucene - Java Users forum at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
BlockJoinQuery Clarification
Hello I've been looking at the BlockJoinQuery in Lucene 3.4.0 and would like to clarify my understanding. Suppose we have a parent document that we index with (say) 4 child documents. My understanding is that these go in as an atomic unit and allows us to query and join across the documents. Now what say I wanted to update one of the child documents (only). If the child document was update with a standard update, I presume the join to the parent is broken. If I update using a Collection, is it necessary to reindex all of the documents (4 children + pareent). And finally, if I updated one child using a collection with the parent, would both of these documents require reindexing and loose their affinity with other children ? Thanks, C,
JoinUtil.createJoinQuery in 3.6 ?
Hi Guys, Will this be available in Lucene 3.6 or is it only going into version 4.0 ? Clive
createJoinQuery use
Hi Chaps, JoinUtil.createJoinQuery() specifies a Query for the from side of the join. Is it possible to query over both sides of the join (while still providing the two join fields) ? If not, what is the recommended best practice to do this? Thanks, and apologies for the dumb questions C
Re: JoinUtil.createJoinQuery in 3.6 ?
Thank you for the feedback, that is very useful :-) From: Martijn v Groningen To: [email protected] Sent: Thursday, March 29, 2012 3:14 PM Subject: Re: JoinUtil.createJoinQuery in 3.6 ? Some extra notes: 1) he implementation in 3.6 is about 3 times slower. I noticed this during some tests that I ran on my test data set. The 3.6 implementation is definitely usable. Also on larger indices. I believe the average query time was around 350 ms on an index containing 10M documents. 2) The fromField can only contain 1 term per document for the 3.6 impl whilst the trunk (4.0) doesn't have this limitation. Martijn On 29 March 2012 14:39, Michael McCandless wrote: > It'll be in both 3.6 and 4.0. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Mar 29, 2012 at 7:55 AM, kiwi clive wrote: > > Hi Guys, > > Will this be available in Lucene 3.6 or is it only going into version > 4.0 ? > > > > Clive > > - > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Met vriendelijke groet, Martijn van Groningen
Re: createJoinQuery use
Hi Martijn, Thank you for responding so quickly. I found my reply was stuck in my drafts folder so apologies for not getting back sooner. I'm trying to search and join across two disparate document types that have some common features. Your response confirms what I was looking into. My understanding about the join query is that it does a 'search' for the from field and then uses those results as part of the second or 'real' query. The first of these 'queries' I believe is an internal mechanism that does not return docs, but smaller more efficient objects. So if I wanted to search across two different types of document in one index with some fields on one doc type and some on the other, I effectively need to perform 4 queries. This is kind of where I was coming from. Thanks, Clive From: Martijn v Groningen To: [email protected] Sent: Thursday, March 29, 2012 3:23 PM Subject: Re: createJoinQuery use It is only possible to join on one side at the same time. You mean something like this: Query fromQuery = JoinUtil.createJoinQuery("from", "to", actualQuery, indexSearcher); Query toQuery = JoinUtil.createJoinQuery("to", "from", actualQuery, indexSearcher); And then use a boolean query to combine it? What is it you want to achieve? Martijn On 29 March 2012 14:04, kiwi clive wrote: > Hi Chaps, > > JoinUtil.createJoinQuery() specifies a Query for the from side of the join. > > > Is it possible to query over both sides of the join (while still providing > the two join fields) ? If not, what is the recommended best practice to do > this? > > Thanks, and apologies for the dumb questions > C -- Met vriendelijke groet, Martijn van Groningen
StandardAnalyzer functionality change
Hi all, Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 and I see StandardAnalyzer has changed its behaviour, particularly when tokenizing email addresses. From reading the forums, I understand StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ? If I pass the string '[email protected]' through these analyzers, I get the following tokens: Using StandardAnalyzer(Version.LUCENE_23): --> [email protected] (one token) Using StandardAnalyzer(Version.LUCENE_36): --> user domain.com (two tokens) Using ClassicAnalyzer(Version.LUCENE_36): --> [email protected] (one token) StandardAnalyzer is normally a good compromise as a default analyzer but the failure to keep an email address intact makes it less fit for purpose than it used to be. Is this a bug or is it by design ? If by design, what is the reason for the change and is ClassicAnalyzer now the defacto-analyzer to use ? Thanks, Clive
Re: StandardAnalyzer functionality change
Thanks for the responses chaps, very informative, and most appreciated :-) From: Ian Lea To: [email protected] Sent: Wednesday, October 24, 2012 4:19 PM Subject: Re: StandardAnalyzer functionality change If you want email addresses, UAX29URLEmailAnalyzer is another alternative. -- Ian. On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky wrote: > Yes, by design. StandardAnalyzer implements "simple word boundaries" (the > technical term is "Unicode text segmentation"), period. As the javadoc says, > "As of Lucene version 3.1, this class implements the Word Break rules from > the Unicode Text Segmentation algorithm, as specified in Unicode Standard > Annex #29." That is a "standard". > > See: > http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html > http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html > > -- Jack Krupansky > > -Original Message- From: kiwi clive > Sent: Wednesday, October 24, 2012 6:42 AM > To: [email protected] > Subject: StandardAnalyzer functionality change > > > Hi all, > > Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 > and I see StandardAnalyzer has changed its behaviour, particularly when > tokenizing email addresses. From reading the forums, I understand > StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ? > > > If I pass the string '[email protected]' through these analyzers, I get the > following tokens: > > Using StandardAnalyzer(Version.LUCENE_23): --> [email protected] (one token) > > Using StandardAnalyzer(Version.LUCENE_36): --> user domain.com (two > tokens) > Using ClassicAnalyzer(Version.LUCENE_36): --> [email protected] (one > token) > > StandardAnalyzer is normally a good compromise as a default analyzer but the > failure to keep an email address intact makes it less fit for purpose than > it used to be. Is this a bug or is it by design ? If by design, what is the > reason for the change and is ClassicAnalyzer now the defacto-analyzer to use > ? > > Thanks, > Clive > > - > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: StandardAnalyzer functionality change
I did some tests and found for our need, ClassicAnalyzer was better (backwards compatible). Our analyzer uses different tokenizers on certain fields but (used to) fall back to StandardAnalyzer by default. ClassicAnalyzer will meet our needs but I see we should move onto a newer implementation such as the email-specific analyzer going forward. Thanks for the clarification.
Lucene 3.6.0 Index Size
Hello. We have an index that when creted using lucene2.3.2, has a size of about 4G. Creating the same index (with the same parameters) with lucene 3.6.0 results in an 11G index. Could someone shed some light into why the index is so much larger, given the same data and the same parameters? I realize this is a large version jump but a doubling in index size does not seem a step in the right direction to me ;-) I am using cfs format. Thanks, Clive
Re: Lucene 3.6.0 Index Size
Hi Vitaly, Your hunch is correct, yes there are unmerged segments leftover. However to get indexing throughput, we use multiple threads on the writer flushing to disk periodically, but the writer can stay open for some time (until the last thread terminates). However, after an optimize, the index is closed. Thanks for the advice, I need to revisit the merging section of the application. Clive From: Vitaly Funstein To: [email protected] Sent: Friday, October 26, 2012 8:13 PM Subject: Re: Lucene 3.6.0 Index Size One thing to keep in mind is that the default merge policy has changed in 3.6 from 2.3.2 (I'm almost certain of that). So it's just a hunch but you may have some unmerged segments left over at the end. Try calling IndexWriter.close(true) after you're done indexing. On Fri, Oct 26, 2012 at 10:50 AM, kiwi clive wrote: > Hello. > > We have an index that when creted using lucene2.3.2, has a size of about > 4G. > > Creating the same index (with the same parameters) with lucene 3.6.0 > results in an 11G index. > > Could someone shed some light into why the index is so much larger, given > the same data and the same parameters? > > I realize this is a large version jump but a doubling in index size does > not seem a step in the right direction to me ;-) > > I am using cfs format. > > Thanks, > Clive >
Re: A large number of files in an index (3.6)
Hi Lance, File handles can be a problem but the instantaneous opening of a great many files at exactly the same time give a big I/O hit during a query. This is compounded by many indexes on the server than can get hit at the same time. Limiting the number of files per index directory makes a difference. Clive From: Lance Norskog To: [email protected]; kiwi clive Sent: Sunday, October 28, 2012 11:09 PM Subject: Re: A large number of files in an index (3.6) An option: instead of merging continuously as you run, you can optimize with 'maxSegments=10'. This mean 'optimize but only until there are 10 segments'. If there are fewer than 10 segments, nothing happens. This lets you schedule merging I/O. Is the number of files a problem due to file space breakage? - Original Message - | From: "kiwi clive" | To: [email protected] | Sent: Saturday, October 27, 2012 12:44:34 PM | Subject: A large number of files in an index (3.6) | | Hi guys, | | I've recently moved from lucene 2.3 to 3.6. The application uses CF | format. With lucene 2.3, I understood the interaction of merge | factor etc with repect to how many files were created in the index | directory. With a merge factor of 10, the number of files in the | index directory could sometimes get up to 30, but you can see the | merging happen and the numeber of files would roll up after a while | and settle around 10-15. | | | With lucene 3.6, this is not the case. Firstly, even with MergePolicy | set to useCFS, the index appears to be a hybrid of cfs and raw index | format. I can understand that may have been done for performance | reasons, but it does increase the file count considerably. Also the | rollup of the merged segments is not occurring as it did on the | previous version. Originally I set the CFSRatio to 1.0 and found | the behaviour similar to lucene2.3 (file number wise) but this came | at a i/o cost and the machines ran with a higher load average. The | higher i/o starts to affect query performance. Reducing cfsRatio to | 0.1 (default), helped reduce i/o load but I am running several | thousand concurrent indexes across many disks on the servers and | the larger number of files per index means a large number of files | are being opened when a query hits the index, in addition to the | indexing load. | | I'm sure this is probably down to Merge policies and schedules, but | there are quite a few knobs to tweak here so some guidance as to the | the most beneficial parameters to tweak would be very helpful. | | I'm using the LogByteSizeMergePolicy with 3 background merge threads. | I'm considering using TieredMergePolicy and even reducing the number | of merge threads, but there is not much point if it does not roll up | the segments as expected. I can tweak with the cfsRatio but this | strikes me a large hammer and there may be more subtle ways to do | this ! | | So tell me I'm being stupid, just say 'derr- why dont you do | this' and I'll be a happy man!! | | Thanks, | Clive - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: Lucene 3.6.0 high CPU usage
Having played with merge parameters and various index parameters, it seems possible to change the I/O usage at the cost of the number of index files. However, it does appear that this version of lucene is using more CPU. Is there any reason for this ? is it normal ? We can push a large amount of documents through the indexer and it seams to work admirably although I would be happier if the load average came down. Any lucene devs out there who could shed some light on this behaviour ? Thanks, Clive From: kiwi clive To: "[email protected]" Sent: Wednesday, November 7, 2012 10:58 AM Subject: Lucene 3.6.0 high CPU usage Hi Guys, Having upgraded from lucene2.3.2 to lucene3.6.0 and jumping through a few hoops, we note that the CPU usage on the new version is perhaps 20-30% higher. The i/o wait appears about the same on both versions. Can anyone suggest why this should be so or is the newer lucene version just more CPU hungry? If not, any tuning suggestions to reduce CPU usage would be gratefully received ! Thanks Clive
Re: Lucene 3.6.0 high CPU usage
Hi Ian, Yes I/O ->CPU, but read on... The throughput is about the same with similar i/o and 20% higher CPU. We have plenty of CPU and I would rather be cpu bound, thean i/o bound. However, we have increased the number of indexing threads in our application. On Lucene 2.3.2, this gave no performance improvement as the I/O just increased. However, lucene 3.6.0 ramps up the CPU (user) usage while I/O only increases marginally. Our throughput has trebled !! - amazing improvement, and all in CPU space. So it appears there have been a lot of performance improvements made in the new version that I was not previously using, the Ferrari was in first gear and now we are taking advantage of these features. This is truly phenominal step-change in throughput, thank you Lucene developers ! Clive From: Ian Lea To: [email protected]; kiwi clive Sent: Friday, November 9, 2012 10:04 AM Subject: Re: Lucene 3.6.0 high CPU usage Are you getting the same, improved or worse performance/throughput? Has the bottleneck switched from IO to CPU? -- Ian. On Thu, Nov 8, 2012 at 12:40 PM, kiwi clive wrote: > Having played with merge parameters and various index parameters, it seems > possible to change the I/O usage at the cost of the number of index files. > However, it does appear that this version of lucene is using more CPU. Is > there any reason for this ? is it normal ? We can push a large amount of > documents through the indexer and it seams to work admirably although I would > be happier if the load average came down. > > Any lucene devs out there who could shed some light on this behaviour ? > > > Thanks, > Clive > > > > From: kiwi clive > To: "[email protected]" > Sent: Wednesday, November 7, 2012 10:58 AM > Subject: Lucene 3.6.0 high CPU usage > > Hi Guys, > > Having upgraded from lucene2.3.2 to lucene3.6.0 and jumping through a few > hoops, we note that the CPU usage on the new version is perhaps 20-30% > higher. The i/o wait appears about the same on both versions. > > Can anyone suggest why this should be so or is the newer lucene version just > more CPU hungry? If not, any tuning suggestions to reduce CPU usage would be > gratefully received ! > > Thanks > Clive - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: Combining The results from DB and Index Regd.,
I have used the last solution you mention many times to good effect as you can sort across the two data sources and merge the results. Obviously it depends on your architecture, RAM and and the amount of data you are dealing with. Clive From: selvakumar netaji To: [email protected] Sent: Tuesday, November 13, 2012 3:15 AM Subject: Combining The results from DB and Index Regd., Hi All, We are using lucene for searching data from the database in our enterprise application. The searches would be in a single index, whose documents are indexed from two different databases A and B. The frequency of updating the database A is linear i.e. for every minute it gets inserted whereas the frequency of updating of the database B is on a weekly basis. The problem is with the indexing of the database A. For eg if the indexing got completed in t second and and a data(d1) gets inserted in (t+1) second then the search of Data d1 would not be in index. To avoid this data loss, Searching can be performed in index and in db(whose record are not in index). The problem over here is that we won't be able to get the score base ordering in database and there would be problems in combining the results from the db and from the index. Is there are any way to get the lucene score form the search results in db. The other alternative would be update the index for every 30(might be less than that) sec so that the whenever the db gets updated the index gets updated. Is there are any other solution to update the index directly whenever the db gets updated. Can you please suggest. The final solution as I've thought would be to have two indexes, one file system index and a in-memory index. The file system index would be indexed or updated on a daily basis and the in-memory index would be updated whenever the db changes. So we'll search both the indexes and we'll combine the data since both have the lucene scores. So there would not be any data loss. Can anyone suggest is there any other solution to avoid this kind of data loss problems.
Re: Combining The results from DB and Index Regd.,
You could do it in page-size chunks. Get the db to do the searching and sorting and return the top page-size records. Do the same for the index. You then can build a ramindex that takes the db output and index output and creates 2*pagesize entries. Apply the same sorting mechanism and return the top page-size records. This way the ramindex only has to be small and the database does the heavy lifting. - Although at the cost of some sql trickery :-) From: selvakumar netaji To: [email protected]; kiwi clive Sent: Tuesday, November 13, 2012 11:02 AM Subject: Re: Combining The results from DB and Index Regd., Thanks Clive, Clive, Can we do this way of indexing if the RAM is limited. There would be two indexes, one in the file system and another in in-memory index as already mentioned. If the in-memory has reached a threshold then can we force the manual indexing of the databases which is supposed to happen automatically everyday. Then the RAM constraint would also be handled. Clive are there are any other solutions. On Tue, Nov 13, 2012 at 4:18 PM, kiwi clive wrote: > I have used the last solution you mention many times to good effect as you > can sort across the two data sources and merge the results. > Obviously it depends on your architecture, RAM and and the amount of data > you are dealing with. > > Clive > > > > > From: selvakumar netaji > To: [email protected] > Sent: Tuesday, November 13, 2012 3:15 AM > Subject: Combining The results from DB and Index Regd., > > Hi All, > > > We are using lucene for searching data from the database in our enterprise > application. > > The searches would be in a single index, whose documents are indexed from > two different databases A and B. The frequency of updating the database A > is linear i.e. for every minute it gets inserted whereas the frequency of > updating of the database B is on a weekly basis. > > > The problem is with the indexing of the database A. For eg if the indexing > got completed in t second and and a data(d1) gets inserted in (t+1) second > then the search of Data d1 would not be in index. > > To avoid this data loss, > Searching can be performed in index and in db(whose record are not in > index). The problem over here is that we won't be able to get the score > base ordering in database and there would be problems in combining the > results from the db and from the index. Is there are any way to get the > lucene score form the search results in db. > > The other alternative would be update the index for every 30(might be less > than that) sec so that the whenever the db gets updated the index gets > updated. Is there are any other solution to update the index directly > whenever the db gets updated. Can you please suggest. > > The final solution as I've thought would be to have two indexes, one file > system index and a in-memory index. The file system index would be indexed > or updated on a daily basis and the in-memory index would be updated > whenever the db changes. So we'll search both the indexes and we'll combine > the data since both have the lucene scores. So there would not be any data > loss. > > Can anyone suggest is there any other solution to avoid this kind of data > loss problems. >
Re: Changing behavior of StandardAnalyzer
Hi Bin Lan, This bit me too. You can choose to StandardAnalyzer and set the version number to 2.9. Otherwise you can try using ClassicAnalyzer which I belive is 'old' Standard Analyzer before it was tidied up. Clive From: Bin Lan To: [email protected] Sent: Wednesday, November 14, 2012 5:20 AM Subject: Changing behavior of StandardAnalyzer So currently, if I use StandardAnalyzer to construct a QueryParser, and pass tString "From: [email protected]" to the parser, it returns a query which is "From: someone From: gmail.com". Is there easy way that I can change this so it returns "From: "someone gmail.com"" instead? We had a in-house Analyzer implementation, it used to do this on lucene 2.9, but somehow the behavior changes after we upgrade to 3.6.1. Regards -- Bin Lan Software Developer Perimeter E-Security O - (203)541-3412 Follow Us on Twitter: www.twitter.com/PerimeterNews Read Our Blog: security.perimeterusa.com/blog This message is for the sole use of the intended recipient(s) and may contain confidential and/or privileged information of Perimeter Internetworking Corp. Any unauthorized review, use, copying, disclosure, or distribution is prohibited. If you are not the intended recipient, please immediately contact the sender by reply email and delete all copies of the original message. -- The sender of this email subscribes to Perimeter E-Security's email anti-virus service. This email has been scanned for malicious code and is believed to be virus free. For more information on email security please visit: http://www.perimeterusa.com/services/messaging This communication is confidential, intended only for the named recipient(s) above and may contain trade secrets or other information that is exempt from disclosure under applicable law. Any use, dissemination, distribution or copying of this communication by anyone other than the named recipient(s) is strictly prohibited. If you have received this communication in error, please delete the email and immediately notify our Command Center at 203-541-3444.
Re: Using Lucene 2.3 indices with Lucene 4.0
Be aware that StandardAnalyzer changed slightly. This is particularly important if you use it to analyze email addresses and certain text-numeral combinations. My understanding is that the newer version of StandardAnalyzer is more consistent with what it should be doing but if you relied on its old functionality, that could bite you. There are two solutions that I am aware of: (1) Replace StandardAnalyzer with ClassicAnalyzer which I believe is the 'old' StandardAnalayzer before it was fixed. (2) Use StandardAnalyzer with Version_23 rather than Version_40. Cheers, Clive From: Ramprakash Ramamoorthy To: [email protected] Sent: Tuesday, November 20, 2012 10:31 AM Subject: Re: Using Lucene 2.3 indices with Lucene 4.0 On Tue, Nov 20, 2012 at 3:54 PM, Danil ŢORIN wrote: > However behavior of some analyzers changed. > > So even after upgrade the old index is readable with 4.0, it doesn't mean > everything still works as before. > Thank you Torin, I am using the standard analyzer only and both the systems use Unicode 4.0 and I don't smell any problems here. > > On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea wrote: > > > You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader. > > You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work > > fine as far as I know. > > > > > > -- > > Ian. > > > Thank you Ian, this is giving me some head starts. > > > > > > On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy < > > [email protected]> wrote: > > > > > I understand lucene 2.x indexes are not compatible with the latest > > version > > > of lucene 4.0. However we have all our indexes indexed with lucene 2.3. > > > > > > Now that we are planning to migrate to Lucene 4.0, is there any work > > > around/hack I can do, so that I can still read the 2.3 indices? Or is > > > forgoing the older indices the only option? > > > > > > P.S : Am afraid, Re-indexing is not feasible. > > > > > > -- > > > With Thanks and Regards, > > > Ramprakash Ramamoorthy, > > > Chennai, > > > India. > > > > > > -- With Thanks and Regards, Ramprakash Ramamoorthy, Engineer Trainee, Zoho Corporation. +91 9626975420
BlockJoin and RawTermFilter (lucene 4.0.0)
Hi Guys,
Apologies if this has been asked before but I could not an appropriate post.
The Lucene Documentation stresses the use of a RawTermFiler when building up
the parentFilter and I was previously using the following in lucene 3.6.0:
Filter parentQueryFilter = new CachingWrapperFilter(new
RawTermFilter(new Term("type", "T1")), true);
Having moved to lucene 4.0.0, I cannot find the RawTermFilter class. It may be
in another module but I have not been able to find it. Is this filter still
required for BlockJoin as the Join API does not seem to have changed
appreciably.
Thanks for any help.
Clive
Lucene 4.1 org.apache.lucene.document.Field Deprecation
Hi chaps, Lucene 4.1.0: I notice org.apache.lucene.document.Field(String name, String value, Field.Store store, Field.Index index, Field.TermVector termVector) is marked as deprecated while its suggested replacements (TextField and StringField) to not seem to have support for Term Vectors. Is there an equivalent of STORED=NO, ANALYZED=YES and TERMVECTORS_WITH_POSITIONS in the new API? Apologies if I've missed something, but I don't want to lose this functionality! Thanks, Clive
Re: Lucene 4.1 org.apache.lucene.document.Field Deprecation
Thank you Mike, Much appreciated :-) From: Michael McCandless To: [email protected]; kiwi clive Sent: Monday, March 18, 2013 4:14 PM Subject: Re: Lucene 4.1 org.apache.lucene.document.Field Deprecation You need to create your own FieldType, e.g.: FieldType textWithTermVectors = new FieldType(TextField.TYPE_STORED); textWithTermVectors.setStoreTermVectorstrue); textWithTermVectors.setStoreTermVectorPositions(true); Then create new Field("name", "value", textWithTermVectors) and add that to your document. Mike McCandless http://blog.mikemccandless.com On Mon, Mar 18, 2013 at 12:08 PM, kiwi clive wrote: > Hi chaps, > > Lucene 4.1.0: > > I notice org.apache.lucene.document.Field(String name, String value, > Field.Store store, Field.Index index, Field.TermVector termVector) is marked > as deprecated while its suggested replacements (TextField and StringField) to > not seem to have support for Term Vectors. > > > Is there an equivalent of STORED=NO, ANALYZED=YES and > TERMVECTORS_WITH_POSITIONS in the new API? > > Apologies if I've missed something, but I don't want to lose this > functionality! > > Thanks, > Clive - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
search-time facetting in Lucene
Hello all Lucene version 3.6.1. Sorry if this is a really stupid question, but is it possible to use search-time facetting on an existing lucene index without the need to reindex? My (limited) understanding is that FacetsCollector will pull facet data from indexes that have been created with the use of TaxonomyWriter and CategoryDocumentBuilder. It does look like the Bobo contribution does not require index changes (and solr looks similar) but I was wondering what lucene does out-of-the-box. So, what I need to achieve is: - (simple) facetted search with raw lucene without the need to reindex. - use of solr is not an option but a lucene version upgrade is. Am I right in thinking the implementation of facetting is different in solr to that in lucene ? If you could point me a resource so I can learn more, I'd be very grateful. Many thanks, Clive
Re: search-time facetting in Lucene
Hi Shai, Thanks very much for the reply. I see there is not a quick win here but as we are going through an index consolidation process, it may pay to make the leap to 4.3 and put in facetting while I'm in there. We will get facetting slowly through the back door while the consolidation runs (we have 10,000+ shards). If it were not for the consolidation required, I thin bobo would have been the way forward. I appreciate you taking the time to explain the situation. Clive From: Shai Erera To: "[email protected]" ; kiwi clive Sent: Monday, May 6, 2013 5:56 AM Subject: Re: search-time facetting in Lucene Hi Clive, In order to use Lucene facets you need to make indexing time decisions. It's not that you don't make these decisions anyway, even with Solr -- for example, you need to decide how to tokenize the fields by which you want to facet, or in Lucene 4.0 index them as SortedSetDocValuesField. If you upgrade to Lucene 4.3, you can avoid the use of the taxonomy index, in exchange for real simple facetting, by using SortedSetDocValuesFacetFields, but again you will need to reindex your data. Shai On Mon, May 6, 2013 at 6:22 AM, kiwi clive wrote: > Hello all > > > Lucene version 3.6.1. > > Sorry if this is a really stupid question, but is it possible to use > search-time facetting on an existing lucene index without the need to > reindex? > > My (limited) understanding is that FacetsCollector will pull facet data > from indexes that have been created with the use of TaxonomyWriter and > CategoryDocumentBuilder. It does look like the Bobo contribution does not > require index changes (and solr looks similar) but I was wondering what > lucene does out-of-the-box. > > > So, what I need to achieve is: > - (simple) facetted search with raw lucene without the need to reindex. > - use of solr is not an option but a lucene version upgrade is. > > Am I right in thinking the implementation of facetting is different in > solr to that in lucene ? > > If you could point me a resource so I can learn more, I'd be very grateful. > > Many thanks, > Clive
Re: Closing IndexWriter can be very slow on large indexes
Hi Mike, The problem was due to close(). A shutdown was calling close() which seems to cause lucene to perform a merge. For a busy very large index (with lots of deletes and updates), the merge process could take a very long time to complete (hours). Calling close(false) solved the problem as this appears to close the index without performing the merge. At least that is my understanding of things ! Clive - Original Message - From: Michael McCandless To: [email protected] Cc: Sent: Tuesday, July 26, 2011 5:30 PM Subject: Re: Closing IndexWriter can be very slow on large indexes Which method (abort or close) do you see taking so much time? It's odd, because IW.abort should quickly stop any running BG merges. Can you get a dump of the thread stacks during this long abort/close and post that back? Can't answer if Lucene 3.x will improve this situation until we find the source of the slowness... Mike McCandless http://blog.mikemccandless.com On Tue, Jul 26, 2011 at 11:33 AM, Chris Bamford wrote: > Hi > > I think I must be doing something wrong, but not sure what. > > I have some long running indexing code which sometimes needs to be shutdown > in a hurry. To achieve this, I set a shutdown flag which causes it to break > from the loop and call first abort() and then close(). The problem is that > with a large index (say, 15Gb) in Lucene 2.3.2, it can take over an hour. > (Yes, I know I should be on a later version of Lucene, but that's another > issue - we are stuck with this for now!). > > The IW is opened in autoCommit mode and mergeFactor=10. > > During this closedown stage, the indexes are being constantly updated by > Lucene itself, making me suspect it could be merging. > > Firstly, can someone explain what it is doing under the covers that takes so > long? (And any action I can take to get around it) > > Second, if I were to rebuild the code with say, Lucene 3 and run it in > compatibility mode with the 2.3.2 indexes, would I have a richer set of tools > I could use to overcome the issue? > > Thanks, > > - Chris > - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Lucene Version Upgrade (3->4) and Java JVM Versions(6->8)
Hello guys, We currently run with Lucene 3.6 and Java6. In view of the fact that Java7 is soon to be deprecated, we are keen to move to Java8 and also to move to the latest version of Lucene. I understand Lucene 5 is coming although we are happy to move to 4.x as there are lots of goodies there we can use. I seem to remember reading that certain versions of lucene were incompatible with some java versions although I cannot find anything to verify this. As we have tens of thousands of large indexes, backwards compatibility without the need to reindex on an upgrade is of prime importance to us. Does anyone have any words of wisdom, or better still, pointers to some documentation that would be of use here? I can obviously run some tests but incompatibilities can be insidious and it would be good to know from the outset if there are any gotchas before embarking along this road. In an ideal world we would have Java8 + lucene4.x reading a lucene3.6 index (that was created with Java6). Then we would write to the lucen3.6 index using java8 and lucene4.x. Any suggestions would be most welcome! Many thanks,Clive
Re: Lucene Version Upgrade (3->4) and Java JVM Versions(6->8)
Hi Hoss, Many thanks for the information. This looks very encouraging as the Java7 bug I remember was fixed and as far as I know, we should not be affected by the others. I'll put a few tests together and put my toe in the water :-) Clive From: Chris Hostetter To: "[email protected]" ; kiwi clive Sent: Tuesday, January 27, 2015 4:01 PM Subject: Re: Lucene Version Upgrade (3->4) and Java JVM Versions(6->8) : I seem to remember reading that certain versions of lucene were : incompatible with some java versions although I cannot find anything to : verify this. As we have tens of thousands of large indexes, backwards : compatibility without the need to reindex on an upgrade is of prime : importance to us. All known JVM bugs affecting Lucene are listed here... https://wiki.apache.org/lucene-java/JavaBugs -Hoss http://www.lucidworks.com/ - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Search Performance with NRT
Hi Guys We are considering changing our Lucene indexer / search architecture from 2 separate JVMs to a single one to benefit from the very latest index views NRT readers provide. In the past we cached our IndexSearchers to avoid cold searches every time and reopened them periodically. In the single-JVM model where we will be keeping the IndexWriters open for long periods, will we still face the same problem, or will calling searcherManager.maybeRefresh() periodically be enough to guarantee fast searches (as well as near-real time views)? (We intend to instantiate our SearcherManager with the IndexWriter rather than a Directory.) ThanksClive
Re: Search Performance with NRT
Hi Mike, Thanks for the very prompt and clear response. We look forward to using the new (new for us) Lucenene goodies :-) Clive From: Michael McCandless To: Lucene Users ; kiwi clive Sent: Thursday, May 28, 2015 2:34 AM Subject: Re: Search Performance with NRT As long as you call SM.maybeRefresh from a dedicated refresh thread (not from a query's thread) it will work well. You may want to use a warmer so that the new searcher is warmed before becoming visible to incoming queries ... this ensures any lazy data structures are initialized by the time a query sees them. Mike McCandless http://blog.mikemccandless.com On Wed, May 27, 2015 at 7:16 AM, kiwi clive wrote: > Hi Guys > > We are considering changing our Lucene indexer / search architecture from 2 > separate JVMs to a single one to benefit from the very latest index views NRT > readers provide. > > In the past we cached our IndexSearchers to avoid cold searches every time > and reopened them periodically. In the single-JVM model where we will be > keeping the IndexWriters open for long periods, will we still face the same > problem, or will calling searcherManager.maybeRefresh() periodically be > enough to guarantee fast searches (as well as near-real time views)? > > (We intend to instantiate our SearcherManager with the IndexWriter rather > than a Directory.) > > > ThanksClive - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Lucene Searcher Caching and Performance
Hi Guys, We have an index/query server that contains several thousand fairly hefty indexes. Each searcher is shared between many 'user-threads' and once opened we keep the searcher in a cache which is refreshed depending on how often it is used. Due to memory limitations on the server, we need some kind of LRU mechanism to drop unused searchers to make way for newer ones. We are seeing load spikes when we get hit by queries that try to open several non-cached searches at the same (or at least a small delta) time. This looks to be the disks struggling to open all the appropriate files for that period, and it takes a little while for the server to return to normal operating limits thereafter. Given that upgrading hardware/memory is not currently an option, we need a way to smooth over these spikes, even if it is at the cost of slowing query performance overall. It strikes me that if we could cache all of our searchers on the machine (ie have all of our indexes 'open for business'), possibly having to alter kernel parameters to cater for the large number of file handles, without caching many query results, this might solve the problem, without pushing memory usage too high. Also, the higher number of searchers stored in the heap is going to steal space from the lucene filecache so is there a recommended mechanism for doing this? So is there a way to mimimize the searcher cache memory footprint to possibly keep more of them in memory, at the cost of storing less data? Any insight would be most appreciated. ThanksClive
How do I write in 3.x format to an upgradeded index using Lucene 4.10
Hi Guys We have several hundred thousand indexes that have been written in Lucene 3.x format. I am trying to migrate to Lucene 4.10 without the need to reindex and the process should be transparent to our customers. Reindexing all our legacy data is not an option. The predominant analyzer we currently use is ClassicAnalyzer as we needed to backwards compatible with the old StandardAnalyzer from pre-Lucene 3.x days. Our latest application uses 4.10 lucene jars and we knobble it to use Lucene 3.x format. When we create IndexWiriters, we are doing this:IndexWriterConfig idxCfg = new IndexWriterConfig(Version.LUCENE_3_6, new ClassicAnalyzer()); New indexes could be written in Lucene 4.10 format and we aim to apply newer analyzers to these new indexes. So all new index reading and writing should be fine. We need to query lucene 4.10 indexes with lucene 4.10 analyzers and our architecture is such that we can query lucene 3.x indexes with lucene 3.x analyzers (using Lucene Versions etc). However, there is a difference between how Lucene 3.x and Lucene 4.10 write indexes which breaks phrase queries. Lucene 3.x Lucene 3.x seems to write tokens to the index adjacent to each other (and I assume the positions are stored elsewhere). This means if we index "Thanks for coming", it get indexed as: "thanks", "coming" after stop-word removal. If we use a phrase query and pass it to QueryParser as: content:"Thanks for coming" queryParser will (using lowercase and stopword removal via ClassicAnalyzer) apply the phrase query content:"thanks coming" and find the document correctly. Lucene 4.10My understanding is that Lucene 4.10 keeps the position increments in the index as placeholders in the data. I believe this is due to a change in how StopFilter works. So if we index our "Thanks for coming" data in Lucene 4.10, it appears to be stored as: "thanks", , "coming" Where is some kind of position increment marker (excuse my ignorance, I don't know the low level details). Now if we send the query through QueryParser and after lowercasing and stopword removal it is the same as before:content:"thanks coming". This query fails because there is a between the two terms. If we turn on positionIncrements in queryParser, the document is returned. A phaseQuery with a slop of 1 also finds the document, but that is a different query to that used on lucene 3.x. So, knowing which are 3.x indexes and which are 4.x indexes means we can toggle positionincrements in QueryParser and our customers should be unaware of any changes while we start our migration. Well, that was the plan! Lucene IndexUpgrader If we take our old 3.x index and apply IndexUpgrader to it, we end up with a 4.10 index. There are several lucene 4.x files created in the index directory and no errors are thrown. However, it appears that the index data is still in the 3.x format, namely it remains: "thanks", "coming" and not: "thanks", , "coming" This means that although the newly upgraded index is in theory a 4.10 index, we still have to use a 3.x QueryParser syntax (positionincrement=false) for phrase queries. Not the end of the world, but if a new document is added "Ivan the Terrible", we end up with the index containing: "thanks", "coming""ivan", , "terrible" So now we have one record in 3.x format and one in 4.10 format and having a hybrid index means we cannot meaningfully use phrase queries on it. The ProblemSo we need a way to write documents in 3.x format (no ), to our upgraded indexes, new indexes can use native 4.10 format. I have tried turning off positions when writing indexes (DOCS_AND_FREQ only) but I can't see how phrase queries can work without positional information and Lucene complains about an illegal state when querying such an index. So, my cry for help here is:How do I write documents in 3.x format to a 3.x index upgraded to a 4.10 index so that phrase queries work ? Any other suggestions welcome! Thank you for taking the time to work through this rather lengthy explanation and please let me know if I have not described the issue clearly. If it's just "Duh, you need to just do this..", I'd be a happy man :-) Many thanks, Clive
Re: How do I write in 3.x format to an upgradeded index using Lucene 4.10
Hi Tx Thank you for the detailed response, that makes a lot of sense. I feel we may have to freeze some old analyzer code as we have indexes that were written with Lucene 2.3 analyzers and that is no longer supported. I'll need to do some experimentation to see how we go. Further reading has shown that StopFilter changed behaviour as of Lucene version 2.9. Keeping old analyzer code forever is not great but as long as we can co-exist with newer indexes, we are in good shape as legacy indexes can be reindexed in slow time as necessary. I'll do some digging ! May thanksClive From: Trejkaz To: Lucene Users Mailing List ; kiwi clive Sent: Wednesday, February 1, 2017 2:53 PM Subject: Re: How do I write in 3.x format to an upgradeded index using Lucene 4.10 > If we take our old 3.x index and apply IndexUpgrader to it, we end up with a > 4.10 index. > There are several lucene 4.x files created in the index directory and no > errors are thrown. > However, it appears that the index data is still in the 3.x format, namely it > remains: > "thanks", "coming" > and not: > "thanks", , "coming" Well, this is a different thing really. The index is in the 4.x format, but the analysis which was performed remains the 3.x analysis, because nothing was done to change the postings. So this whole thing is really just a "make sure to use the same analyser to query which you used to index" problem. So if you indexed using a Lucene 3 analyser, then you should be using the same v3 analyser when you query against the index in Lucene 4. So the usual rules apply: * Beware of Version.LATEST/LUCENE_CURRENT. Always use the exact version, and keep using it. * If Lucene remove support for some Version you were using, don't update the Version you're using. Instead, take a copy of the Tokenizer/TokenFilter you were using from the older version and port it to work on the new version. Maintain these frozen off analysis components forever. But that said, we didn't experience any problems like this from 3 to 4, but rather obscure problems where backwards compatibility was not maintained in Lucene itself, e.g. places where despite passing in a Version object, the older behaviour was not maintained. IIRC, the term length limits being changed was one of these. And in these situations, for the most part, freezing off a copy of the old behaviour works fine. That said, we don't use the "classic" query parser, but rather the flexible one. And maybe if you're using the classic one, it might have some misbehaviour around this which we didn't strike by using the flexible one. > So we need a way to write documents in 3.x format (no ), to our upgraded > indexes, > new indexes can use native 4.10 format. It sounds like you just need to use the same analyser you were previously using, possibly forever... TX
Re: How do I write in 3.x format to an upgradeded index using Lucene 4.10
Hi Tx, This is just to close the loop. Thank you very much for your helpful suggestions. This works fine and solves our problem. Much appreciated.Clive From: kiwi clive To: Trejkaz ; Lucene Users Mailing List Sent: Wednesday, February 1, 2017 7:37 PM Subject: Re: How do I write in 3.x format to an upgradeded index using Lucene 4.10 Hi Tx Thank you for the detailed response, that makes a lot of sense. I feel we may have to freeze some old analyzer code as we have indexes that were written with Lucene 2.3 analyzers and that is no longer supported. I'll need to do some experimentation to see how we go. Further reading has shown that StopFilter changed behaviour as of Lucene version 2.9. Keeping old analyzer code forever is not great but as long as we can co-exist with newer indexes, we are in good shape as legacy indexes can be reindexed in slow time as necessary. I'll do some digging ! May thanksClive From: Trejkaz To: Lucene Users Mailing List ; kiwi clive Sent: Wednesday, February 1, 2017 2:53 PM Subject: Re: How do I write in 3.x format to an upgradeded index using Lucene 4.10 > If we take our old 3.x index and apply IndexUpgrader to it, we end up with a > 4.10 index. > There are several lucene 4.x files created in the index directory and no > errors are thrown. > However, it appears that the index data is still in the 3.x format, namely it > remains: > "thanks", "coming" > and not: > "thanks", , "coming" Well, this is a different thing really. The index is in the 4.x format, but the analysis which was performed remains the 3.x analysis, because nothing was done to change the postings. So this whole thing is really just a "make sure to use the same analyser to query which you used to index" problem. So if you indexed using a Lucene 3 analyser, then you should be using the same v3 analyser when you query against the index in Lucene 4. So the usual rules apply: * Beware of Version.LATEST/LUCENE_CURRENT. Always use the exact version, and keep using it. * If Lucene remove support for some Version you were using, don't update the Version you're using. Instead, take a copy of the Tokenizer/TokenFilter you were using from the older version and port it to work on the new version. Maintain these frozen off analysis components forever. But that said, we didn't experience any problems like this from 3 to 4, but rather obscure problems where backwards compatibility was not maintained in Lucene itself, e.g. places where despite passing in a Version object, the older behaviour was not maintained. IIRC, the term length limits being changed was one of these. And in these situations, for the most part, freezing off a copy of the old behaviour works fine. That said, we don't use the "classic" query parser, but rather the flexible one. And maybe if you're using the classic one, it might have some misbehaviour around this which we didn't strike by using the flexible one. > So we need a way to write documents in 3.x format (no ), to our upgraded > indexes, > new indexes can use native 4.10 format. It sounds like you just need to use the same analyser you were previously using, possibly forever... TX
