TooManyClauses by wildcard queries
Hi all, I get the TooManyClauses exception by some wildcard queries like : (a) de* (b) country AND de* (c) ma?s* AND de* I'm not sure how to apply the solution proposed in LuceneFAQ for the case of WildcardQueries like the examples above. Can you confirm if it is the right procedure? 1. Override QueryParser.getWildcardQuery() to return a ConstantScoreQuery. 2. Break up the query to identify the wildcard query part. 3. Create a custom Filter for the wildcard query 4. Create the final query using the custom filter. If the item 2. is right, can you suggest me an optimal way to do that? Thank you Patricio - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IndexReader.isCurrent for cached indexes
isCurrent() will only return true if there have been committed changes to the index. Maybe for some reason your index update job hasn't committed or closed the index. Probably not relevant to this problem, but your reopen code snippet doesn't close the old reader. It should. See the javadocs. What version of lucene are you running? -- Ian. On Wed, Sep 9, 2009 at 10:33 PM, Nick Bailey wrote: > Looking for some help figuring out a problem with the IndexReader.isCurrent() > method and cached indexes. > > We have a number of lucene indexes that we attempt to keep in memory after an > initial query is performed. In order to prevent the indexes from becoming > stale, we check for changes about every minute by calling isCurrent(). If > the index has changed, we will then reopen it. > > From our logs it appears that in some cases isCurrent() will return true even > though the index has changed since the last time the reader was opened. > > The code to refresh the index is basically this: > > // Checked every minute > if(!reader.isCurrent()){ > // reopen the existing reader > reader = this.searcher.getIndexReader(); > reader = reader.reopen(); > } > > This is an example of the problem from the logs: > > 2009-08-29 17:50:51,387 Indexed 0 documents and deleted 1 documents from > index 'example' in 0 ms > 2009-08-30 03:11:58,410 Indexed 0 documents and deleted 5 documents from > index 'example' in 0 ms > 2009-08-30 16:30:03,466 Using cached reader lastRefresh=81415526> > // numbers indicate milliseconds since opened or refreshed aka age = 24.6hrs, > lastRefresh = 22.6hrs > > The logs indicate we deleted documents from the index at about 5:50 on August > 29th, and then again on the 30th at 3:11. Then at 4:30 on we attempted to > query the index. We found the cached reader and used it, however, the last > time the cache was refreshed was about 22 hours previously, coinciding with > the first delete. The index should have been reopened after the second > delete. > > I have checked, and the code to refresh the indexes is definitely being run > every 60 seconds. All I can see is that the problem might be with the > isCurrent() method. > > Could it be due to holding the reader open for so long? Any other ideas? > > Thanks a lot, > Nick Bailey > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to avoid huge index files
First, you need to limit the size of segments initially created by IndexWriter due to newly added documents. Probably the simplest way is to call IndexWriter.commit() frequently enough. You might want to use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently consumed by IndexWriter's buffer to determine when to commit. But it won't be an exact science, ie, the segment size will be different from the RAM buffer size. So, experiment w/ it... Second, you need to prevent merging from creating a segment that's too large. For this I would use the setMaxMergeMB method of the LogByteSizeMergePolicy (which is IndexWriter's default merge policy). But note that this max size applies to the *input* segments, so you'd roughly want that to be 1.0 MB (your 10.0 MB divided by the merge factor = 10), but probably make it smaller to be sure things stay small enough. Note that with this approach, if your index is large enough, you'll wind up with many segments and search performance will suffer when compared to an index that doesn't have this max 10.0 MB file size restriction. Mike On Thu, Sep 10, 2009 at 2:32 AM, Dvora wrote: > > Hello again, > > Can someone please comment on that, whether what I'm looking is possible or > not? > > > Dvora wrote: >> >> Hello, >> >> I'm using Lucene2.4. I'm developing a web application that using Lucene >> (via compass) to do the searches. >> I'm intending to deploy the application in Google App Engine >> (http://code.google.com/appengine/), which limits files length to be >> smaller than 10MB. I've read about the various policies supported by >> Lucene to limit the file sizes, but on matter which policy I used and >> which parameters, the index files still grew to be lot more the 10MB. >> Looking at the code, I've managed to limit the cfs files (predicting the >> file size in CompoundFileWriter before closing the file) - I guess that >> will degrade performance, but it's OK for now. But now the FDT files are >> becoming huge (about 60MB) and I cant identifiy a way to limit those >> files. >> >> Is there some built-in and correct way to limit these files length? If no, >> can someone direct me please how should I tweak the source code to achieve >> that? >> >> Thanks for any help. >> > > -- > View this message in context: > http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25378056.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: TooManyClauses by wildcard queries
Or use Lucene 2.9, it automatically uses constant score mode in wild card queries, if needed. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Patricio Galeas [mailto:gal...@prometa.de] > Sent: Thursday, September 10, 2009 10:41 AM > To: java-user@lucene.apache.org > Subject: TooManyClauses by wildcard queries > > Hi all, > > I get the TooManyClauses exception by some wildcard queries like : > (a) de* > (b) country AND de* > (c) ma?s* AND de* > > I'm not sure how to apply the solution proposed in LuceneFAQ for the > case of WildcardQueries like the examples above. > > Can you confirm if it is the right procedure? > > 1. Override QueryParser.getWildcardQuery() to return a ConstantScoreQuery. > 2. Break up the query to identify the wildcard query part. > 3. Create a custom Filter for the wildcard query > 4. Create the final query using the custom filter. > > If the item 2. is right, can you suggest me an optimal way to do that? > > Thank you > Patricio > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: New "Stream closed" exception with Java 6
Hi Hoss, I have been thinking more about what you said (below) - could you please expand on the indented part of this sentence: "it's possibly you just have a simple bug where you are closing the reader before you pass it to Lucene, or maybe you are mistakenly adding the same field twice (or in two different documents)" Are you saying that if I were attempting to delete a doc and then add it again (e.g. update), but for some reason the delete didn't work, I would get a "Stream closed" exception? Thanks - Chris - Original Message - From: Chris Hostetter Sent: Tue, 8/9/2009 7:57pm To: java-user@lucene.apache.org Subject: RE: New "Stream closed" exception with Java 6 : I'm coming to the same conclusion - there must be >1 threads accessing this index at the same time. Better go figure it out ... :-) careful about your assumptions ... you could get this same type of exception even with only one thread, the stream that's being closed isn't internal to Lucene, it's the InputStreamReader you supplied as the value of some Field. it's possibly you just have a simple bug where you are closing hte reader before you pass it to Lucene, or maybe you are mistakenly adding the saame field twice (or in two different documents) -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to avoid huge index files
Hi, Thanks a lot for that, will peforms the experiments and publish the results. I'm aware to the risk of peformance degredation, but for the pilot I'm trying to run I think it's acceptable. Thanks again! Michael McCandless-2 wrote: > > First, you need to limit the size of segments initially created by > IndexWriter due to newly added documents. Probably the simplest way > is to call IndexWriter.commit() frequently enough. You might want to > use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently > consumed by IndexWriter's buffer to determine when to commit. But it > won't be an exact science, ie, the segment size will be different from > the RAM buffer size. So, experiment w/ it... > > Second, you need to prevent merging from creating a segment that's too > large. For this I would use the setMaxMergeMB method of the > LogByteSizeMergePolicy (which is IndexWriter's default merge policy). > But note that this max size applies to the *input* segments, so you'd > roughly want that to be 1.0 MB (your 10.0 MB divided by the merge > factor = 10), but probably make it smaller to be sure things stay > small enough. > > Note that with this approach, if your index is large enough, you'll > wind up with many segments and search performance will suffer when > compared to an index that doesn't have this max 10.0 MB file size > restriction. > > Mike > > On Thu, Sep 10, 2009 at 2:32 AM, Dvora wrote: >> >> Hello again, >> >> Can someone please comment on that, whether what I'm looking is possible >> or >> not? >> >> >> Dvora wrote: >>> >>> Hello, >>> >>> I'm using Lucene2.4. I'm developing a web application that using Lucene >>> (via compass) to do the searches. >>> I'm intending to deploy the application in Google App Engine >>> (http://code.google.com/appengine/), which limits files length to be >>> smaller than 10MB. I've read about the various policies supported by >>> Lucene to limit the file sizes, but on matter which policy I used and >>> which parameters, the index files still grew to be lot more the 10MB. >>> Looking at the code, I've managed to limit the cfs files (predicting the >>> file size in CompoundFileWriter before closing the file) - I guess that >>> will degrade performance, but it's OK for now. But now the FDT files are >>> becoming huge (about 60MB) and I cant identifiy a way to limit those >>> files. >>> >>> Is there some built-in and correct way to limit these files length? If >>> no, >>> can someone direct me please how should I tweak the source code to >>> achieve >>> that? >>> >>> Thanks for any help. >>> >> >> -- >> View this message in context: >> http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25378056.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25380052.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to avoid huge index files
You're welcome! Another, bottoms-up option would be to make a custom Directory impl that simply splits up files above a certain size. That'd be more generic and more reliable... Mike On Thu, Sep 10, 2009 at 5:26 AM, Dvora wrote: > > Hi, > > Thanks a lot for that, will peforms the experiments and publish the results. > I'm aware to the risk of peformance degredation, but for the pilot I'm > trying to run I think it's acceptable. > > Thanks again! > > > > Michael McCandless-2 wrote: >> >> First, you need to limit the size of segments initially created by >> IndexWriter due to newly added documents. Probably the simplest way >> is to call IndexWriter.commit() frequently enough. You might want to >> use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently >> consumed by IndexWriter's buffer to determine when to commit. But it >> won't be an exact science, ie, the segment size will be different from >> the RAM buffer size. So, experiment w/ it... >> >> Second, you need to prevent merging from creating a segment that's too >> large. For this I would use the setMaxMergeMB method of the >> LogByteSizeMergePolicy (which is IndexWriter's default merge policy). >> But note that this max size applies to the *input* segments, so you'd >> roughly want that to be 1.0 MB (your 10.0 MB divided by the merge >> factor = 10), but probably make it smaller to be sure things stay >> small enough. >> >> Note that with this approach, if your index is large enough, you'll >> wind up with many segments and search performance will suffer when >> compared to an index that doesn't have this max 10.0 MB file size >> restriction. >> >> Mike >> >> On Thu, Sep 10, 2009 at 2:32 AM, Dvora wrote: >>> >>> Hello again, >>> >>> Can someone please comment on that, whether what I'm looking is possible >>> or >>> not? >>> >>> >>> Dvora wrote: Hello, I'm using Lucene2.4. I'm developing a web application that using Lucene (via compass) to do the searches. I'm intending to deploy the application in Google App Engine (http://code.google.com/appengine/), which limits files length to be smaller than 10MB. I've read about the various policies supported by Lucene to limit the file sizes, but on matter which policy I used and which parameters, the index files still grew to be lot more the 10MB. Looking at the code, I've managed to limit the cfs files (predicting the file size in CompoundFileWriter before closing the file) - I guess that will degrade performance, but it's OK for now. But now the FDT files are becoming huge (about 60MB) and I cant identifiy a way to limit those files. Is there some built-in and correct way to limit these files length? If no, can someone direct me please how should I tweak the source code to achieve that? Thanks for any help. >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25378056.html >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>> >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > -- > View this message in context: > http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25380052.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Problem in lucene query
Hello I am new to Lucene and facing a problem while performing searches. I am using lucene 2.2.0. My application indexes documents on "keyword" field which contains integer values. If the value is negative the query does not return correct results. Following is my lucene query: (keyword: \-1) I also tried: (keyword: "-1") But none of them returns correct results. It seems that Lucene ignores '-'. My purpose is to search documents with index value "-1". Any ideas?? Thanks
Re: support for PayloadTermQuery in MoreLikeThis
On Sep 9, 2009, at 4:39 PM, Bill Au wrote: Has anyone done anything regarding the support of PayloadTermQuery in MoreLikeThis? Not yet! Sounds interesting I took a quick look at the code and it seems to be simply a matter of swapping TermQuery with PayloadTermQuery. I guess a generic solution would be to add a enable method to enable PayloadTermQuery, keeping TermQuery as the default for backwards compatibility. The call signature of the same enable method would also include the PayloadFunction to use for the PayloadTermQuery. Any comments/thoughts? Hmm, this could work, but I think we should try to be generic if we can and have it be overridable. Today's PTQ is likely to segue to tomorrow's AttributeTermQuery and I wouldn't want to preclude them. -Grant - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to avoid huge index files
Hi again, Can you add some details and guidelines how to implement that? Different files types have different structure, is such spliting doable without knowing Lucene internals? Michael McCandless-2 wrote: > > You're welcome! > > Another, bottoms-up option would be to make a custom Directory impl > that simply splits up files above a certain size. That'd be more > generic and more reliable... > > Mike > > On Thu, Sep 10, 2009 at 5:26 AM, Dvora wrote: >> >> Hi, >> >> Thanks a lot for that, will peforms the experiments and publish the >> results. >> I'm aware to the risk of peformance degredation, but for the pilot I'm >> trying to run I think it's acceptable. >> >> Thanks again! >> >> >> >> Michael McCandless-2 wrote: >>> >>> First, you need to limit the size of segments initially created by >>> IndexWriter due to newly added documents. Probably the simplest way >>> is to call IndexWriter.commit() frequently enough. You might want to >>> use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently >>> consumed by IndexWriter's buffer to determine when to commit. But it >>> won't be an exact science, ie, the segment size will be different from >>> the RAM buffer size. So, experiment w/ it... >>> >>> Second, you need to prevent merging from creating a segment that's too >>> large. For this I would use the setMaxMergeMB method of the >>> LogByteSizeMergePolicy (which is IndexWriter's default merge policy). >>> But note that this max size applies to the *input* segments, so you'd >>> roughly want that to be 1.0 MB (your 10.0 MB divided by the merge >>> factor = 10), but probably make it smaller to be sure things stay >>> small enough. >>> >>> Note that with this approach, if your index is large enough, you'll >>> wind up with many segments and search performance will suffer when >>> compared to an index that doesn't have this max 10.0 MB file size >>> restriction. >>> >>> Mike >>> >>> On Thu, Sep 10, 2009 at 2:32 AM, Dvora wrote: Hello again, Can someone please comment on that, whether what I'm looking is possible or not? Dvora wrote: > > Hello, > > I'm using Lucene2.4. I'm developing a web application that using > Lucene > (via compass) to do the searches. > I'm intending to deploy the application in Google App Engine > (http://code.google.com/appengine/), which limits files length to be > smaller than 10MB. I've read about the various policies supported by > Lucene to limit the file sizes, but on matter which policy I used and > which parameters, the index files still grew to be lot more the 10MB. > Looking at the code, I've managed to limit the cfs files (predicting > the > file size in CompoundFileWriter before closing the file) - I guess > that > will degrade performance, but it's OK for now. But now the FDT files > are > becoming huge (about 60MB) and I cant identifiy a way to limit those > files. > > Is there some built-in and correct way to limit these files length? If > no, > can someone direct me please how should I tweak the source code to > achieve > that? > > Thanks for any help. > -- View this message in context: http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25378056.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25380052.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25381489.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: How to avoid huge index files
The idea is just to put a layer on top of the abstract file system function supplied by directory. Whenever somebody wants to create a file and write data to it, the methods create more than one file and switch e.g. after 10 Megabytes to another file. E.g. look into MMapDirectory that uses MMap to map files into address space. Because MappedByteBuffer only supports 32 bit offsets, there will be created different mappings for the same file (the file is splitted up into parts of 2 Gigabytes). You could use similar code here and just use another file, if somebody seeks or writes above the 10 MiB limit. Just "virtualize" the files. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > From: Dvora [mailto:barak.ya...@gmail.com] > Sent: Thursday, September 10, 2009 1:23 PM > To: java-user@lucene.apache.org > Subject: Re: How to avoid huge index files > > > Hi again, > > Can you add some details and guidelines how to implement that? Different > files types have different structure, is such spliting doable without > knowing Lucene internals? > > > Michael McCandless-2 wrote: > > > > You're welcome! > > > > Another, bottoms-up option would be to make a custom Directory impl > > that simply splits up files above a certain size. That'd be more > > generic and more reliable... > > > > Mike > > > > On Thu, Sep 10, 2009 at 5:26 AM, Dvora wrote: > >> > >> Hi, > >> > >> Thanks a lot for that, will peforms the experiments and publish the > >> results. > >> I'm aware to the risk of peformance degredation, but for the pilot I'm > >> trying to run I think it's acceptable. > >> > >> Thanks again! > >> > >> > >> > >> Michael McCandless-2 wrote: > >>> > >>> First, you need to limit the size of segments initially created by > >>> IndexWriter due to newly added documents. Probably the simplest way > >>> is to call IndexWriter.commit() frequently enough. You might want to > >>> use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently > >>> consumed by IndexWriter's buffer to determine when to commit. But it > >>> won't be an exact science, ie, the segment size will be different from > >>> the RAM buffer size. So, experiment w/ it... > >>> > >>> Second, you need to prevent merging from creating a segment that's too > >>> large. For this I would use the setMaxMergeMB method of the > >>> LogByteSizeMergePolicy (which is IndexWriter's default merge policy). > >>> But note that this max size applies to the *input* segments, so you'd > >>> roughly want that to be 1.0 MB (your 10.0 MB divided by the merge > >>> factor = 10), but probably make it smaller to be sure things stay > >>> small enough. > >>> > >>> Note that with this approach, if your index is large enough, you'll > >>> wind up with many segments and search performance will suffer when > >>> compared to an index that doesn't have this max 10.0 MB file size > >>> restriction. > >>> > >>> Mike > >>> > >>> On Thu, Sep 10, 2009 at 2:32 AM, Dvora wrote: > > Hello again, > > Can someone please comment on that, whether what I'm looking is > possible > or > not? > > > Dvora wrote: > > > > Hello, > > > > I'm using Lucene2.4. I'm developing a web application that using > > Lucene > > (via compass) to do the searches. > > I'm intending to deploy the application in Google App Engine > > (http://code.google.com/appengine/), which limits files length to be > > smaller than 10MB. I've read about the various policies supported by > > Lucene to limit the file sizes, but on matter which policy I used > and > > which parameters, the index files still grew to be lot more the > 10MB. > > Looking at the code, I've managed to limit the cfs files (predicting > > the > > file size in CompoundFileWriter before closing the file) - I guess > > that > > will degrade performance, but it's OK for now. But now the FDT files > > are > > becoming huge (about 60MB) and I cant identifiy a way to limit those > > files. > > > > Is there some built-in and correct way to limit these files length? > If > > no, > > can someone direct me please how should I tweak the source code to > > achieve > > that? > > > > Thanks for any help. > > > > -- > View this message in context: > http://www.nabble.com/How-to-avoid-huge-index-files- > tp25347505p25378056.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > >>> > >>> - > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java
September 2009 Hadoop/Lucene/Solr/UIMA/katta/Mahout Get Together Berlin
Hi, I cross-post this here, Isabel Drost is managing the meetup. This time it is more about Hadoop, but there is also a talk about the new Lucene 2.9 release (presented by me). As far as I know, Simon Willnauer will also be there: --- I would like to announce the September-2009 Hadoop Get Together in newthinking store Berlin. When: 29. September 2009 at 5:00pm Where: newthinking store, Tucholskystr. 48, Berlin, Germany As always there will be slots of 20min each for talks on your Hadoop topic. After each talk there will be a lot time to discuss. You can order drinks directly at the bar in the newthinking store. If you like, you can order pizza. There are quite a few good restaurants nearby, so we can go there after the official part. Talks scheduled so far: Thorsten Schuett, Solving Puzzles with MapReduce: MapReduce is most often used for data mining and filtering large datasets. In this talk we will show that it also useful for a completely different problem domain: solving puzzles. Based on MapReduce, we can implement massively parallel breadth-first and heuristic search. MapReduce will take care of the hard problems, like parallelization, disk and error handling, while we can concentrate on the puzzle. Throughout the talk we will use the sliding puzzle (http://en.wikipedia.org/wiki/Sliding_puzzle) as our example. Thilo Götz, Text analytics on jaql: Jaql (JSON query language) is a query language for Javascript Object Notation that runs on top of Apache Hadoop. It was primarily designed for large scale analysis of semi-structured data. I will give an introduction to jaql and describe our experiences using it for text analytics tasks. Jaql is open source and available from http://code.google.com/p/jaql. Uwe Schindler, Lucene 2.9 Developments: Numeric Search, Per-Segment- and Near-Real-Time Search, new TokenStream API: Uwe Schindler presents some new additions to Lucene 2.9. In the first half he will talk about fast numerical and date range queries (NumericRangeQuery, formerly TrieRangeQuery) and their usage in geospatial search applications like the Publishing Network for Geoscientific & Environmental Data (PANGAEA). In the second half of his talk, Uwe will highlight various improvements to the internal search implementation for near-real-time search. Finally, he will present the new TokenStream API, based on AttributeSource/Attributes that make indexing more pluggable. Future developments in the Flexible Indexing Area will make use of it. Uwe will show a Tokenizer that uses custom attributes to index XML files into various document fields based on XML element names as a possible use-case. We would like to invite you, the visitor to also tell your Hadoop story, if you like, you can bring slides - there will be a beamer. A big Thanks goes to the newthinking store for providing a room in the center of Berlin for us. See the Upcoming page: http://upcoming.yahoo.com/event/4314020/ - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: September 2009 Hadoop/Lucene/Solr/UIMA/katta/Mahout Get Together Berlin
Hi again, By the way, if somebody of the other involved developers want to provide me some PPT Slides about the other new features in Lucene 2.9 (NRT, future Flexible Indexing), I would be happy! Uwe > Uwe Schindler, Lucene 2.9 Developments: Numeric Search, Per-Segment- and > Near-Real-Time Search, new TokenStream API: Uwe Schindler presents some > new > additions to Lucene 2.9. In the first half he will talk about fast > numerical > and date range queries (NumericRangeQuery, formerly TrieRangeQuery) and > their usage in geospatial search applications like the Publishing Network > for Geoscientific & Environmental Data (PANGAEA). In the second half of > his > talk, Uwe will highlight various improvements to the internal search > implementation for near-real-time search. Finally, he will present the new > TokenStream API, based on AttributeSource/Attributes that make indexing > more > pluggable. Future > developments in the Flexible Indexing Area will make use of it. Uwe will > show a Tokenizer that uses custom attributes to index XML files into > various > document fields based on XML element names as a possible use-case. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: TooManyClauses by wildcard queries
Hi Uwe But if I don't use Lucene 2.9, is this procedure (items 1-4) the right way to avoid the TooManyClauses exception? or is there a more efficients procedure to do that? Thanks Patricio Uwe Schindler schrieb: Or use Lucene 2.9, it automatically uses constant score mode in wild card queries, if needed. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Patricio Galeas [mailto:gal...@prometa.de] Sent: Thursday, September 10, 2009 10:41 AM To: java-user@lucene.apache.org Subject: TooManyClauses by wildcard queries Hi all, I get the TooManyClauses exception by some wildcard queries like : (a) de* (b) country AND de* (c) ma?s* AND de* I'm not sure how to apply the solution proposed in LuceneFAQ for the case of WildcardQueries like the examples above. Can you confirm if it is the right procedure? 1. Override QueryParser.getWildcardQuery() to return a ConstantScoreQuery. 2. Break up the query to identify the wildcard query part. 3. Create a custom Filter for the wildcard query 4. Create the final query using the custom filter. If the item 2. is right, can you suggest me an optimal way to do that? Thank you Patricio - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- P a t r i c i o G a l e a s ProMeta Team -- I n f o t r a X G m b H Fon +49 (0)271 30 30 888 Fax +49 (0)271 74124-77 Mob +49 (0)177 2962611 Adresse: Friedrichstraße 81 D-57072 Siegen Geschäftsführerin Dipl.-Wi.-Inf. Stephanie Sarach Handelsregister HRB 8877 Amtsgericht Siegen http://www.prometa.de http://www.infotrax.de -- - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: How to avoid huge index files
Me again :-) I'm looking at the code of FSDirectory and MMapDirectory, and found that its somewhat difficult for to understand how should subclass the FSDirectory and adjust it to my needs. If I understand correct, MMapDirectory overrides the openInput() method and returns MultiMMapIndexInput if the file size exceeds the threshold. What I'm not understand is that how the new impl should keep track on the generated files (or shouldn't it?..) so when searhcing Lucene will know in which file to search - I'm confused :-) Can I bother you so you supply some kind of psuedo code illustrating how the implementation should look like? Thanks again for your huge help! Uwe Schindler wrote: > > The idea is just to put a layer on top of the abstract file system > function > supplied by directory. Whenever somebody wants to create a file and write > data to it, the methods create more than one file and switch e.g. after 10 > Megabytes to another file. E.g. look into MMapDirectory that uses MMap to > map files into address space. Because MappedByteBuffer only supports 32 > bit > offsets, there will be created different mappings for the same file (the > file is splitted up into parts of 2 Gigabytes). You could use similar code > here and just use another file, if somebody seeks or writes above the 10 > MiB > limit. Just "virtualize" the files. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >> From: Dvora [mailto:barak.ya...@gmail.com] >> Sent: Thursday, September 10, 2009 1:23 PM >> To: java-user@lucene.apache.org >> Subject: Re: How to avoid huge index files >> >> >> Hi again, >> >> Can you add some details and guidelines how to implement that? Different >> files types have different structure, is such spliting doable without >> knowing Lucene internals? >> >> >> Michael McCandless-2 wrote: >> > >> > You're welcome! >> > >> > Another, bottoms-up option would be to make a custom Directory impl >> > that simply splits up files above a certain size. That'd be more >> > generic and more reliable... >> > >> > Mike >> > >> > On Thu, Sep 10, 2009 at 5:26 AM, Dvora wrote: >> >> >> >> Hi, >> >> >> >> Thanks a lot for that, will peforms the experiments and publish the >> >> results. >> >> I'm aware to the risk of peformance degredation, but for the pilot I'm >> >> trying to run I think it's acceptable. >> >> >> >> Thanks again! >> >> >> >> >> >> >> >> Michael McCandless-2 wrote: >> >>> >> >>> First, you need to limit the size of segments initially created by >> >>> IndexWriter due to newly added documents. Probably the simplest way >> >>> is to call IndexWriter.commit() frequently enough. You might want to >> >>> use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently >> >>> consumed by IndexWriter's buffer to determine when to commit. But it >> >>> won't be an exact science, ie, the segment size will be different >> from >> >>> the RAM buffer size. So, experiment w/ it... >> >>> >> >>> Second, you need to prevent merging from creating a segment that's >> too >> >>> large. For this I would use the setMaxMergeMB method of the >> >>> LogByteSizeMergePolicy (which is IndexWriter's default merge policy). >> >>> But note that this max size applies to the *input* segments, so you'd >> >>> roughly want that to be 1.0 MB (your 10.0 MB divided by the merge >> >>> factor = 10), but probably make it smaller to be sure things stay >> >>> small enough. >> >>> >> >>> Note that with this approach, if your index is large enough, you'll >> >>> wind up with many segments and search performance will suffer when >> >>> compared to an index that doesn't have this max 10.0 MB file size >> >>> restriction. >> >>> >> >>> Mike >> >>> >> >>> On Thu, Sep 10, 2009 at 2:32 AM, Dvora wrote: >> >> Hello again, >> >> Can someone please comment on that, whether what I'm looking is >> possible >> or >> not? >> >> >> Dvora wrote: >> > >> > Hello, >> > >> > I'm using Lucene2.4. I'm developing a web application that using >> > Lucene >> > (via compass) to do the searches. >> > I'm intending to deploy the application in Google App Engine >> > (http://code.google.com/appengine/), which limits files length to >> be >> > smaller than 10MB. I've read about the various policies supported >> by >> > Lucene to limit the file sizes, but on matter which policy I used >> and >> > which parameters, the index files still grew to be lot more the >> 10MB. >> > Looking at the code, I've managed to limit the cfs files >> (predicting >> > the >> > file size in CompoundFileWriter before closing the file) - I guess >> > that >> > will degrade performance, but it's OK for now. But now the FDT >> files >> > are >> > becoming huge (about 60MB) and I cant identifiy a way to limit >> those >> > files. >> > >> > Is there some built-in and correct way to limit
Re: Problem in lucene query
> I am new to Lucene and facing a problem while performing > searches. I am using lucene 2.2.0. > > My application indexes documents on "keyword" field which > contains integer values. Which analyzer/tokenizer are you using on that field? I am assuming it is a tokenized field. >If the value is negative the query does not return > correct results. Is it returning 1's as well as -1's? - is a special character so you have to escape it when querying. So keyword:\-1 is correct. But the problem is StandardTokenizer tokenizes -1 to 1. If you use it all -1's and 1's are threated same. Use whitespaceanalyzer instead. Hope this helps. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problem in lucene query
Hi Vibhuti, Not in sync with your query, but I'd advice you to graduate you to a rather recent lucene release. Something like 2.4.1 or atleast a 2.3.1 [Considering its already time for 2.9]. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Thu, Sep 10, 2009 at 4:17 PM, vibhuti wrote: > Hello > > > > I am new to Lucene and facing a problem while performing searches. I am > using lucene 2.2.0. > > My application indexes documents on "keyword" field which contains integer > values. If the value is negative the query does not return correct results. > > > > Following is my lucene query: > > > > (keyword: \-1) > > > > I also tried: > > (keyword: "-1") > > > > > > But none of them returns correct results. It seems that Lucene ignores '-'. > My purpose is to search documents with index value "-1". > > > > Any ideas?? > > > > Thanks > >
MultiSearcherThread.hits(ParallelMultiSearcher.java:280) nullPointerException
Hello every . I have a problem with MultiSearcherThread.hits in ParallelMultiSearcher.java . Some times when I want to search via paralleMultiSearcher, the method MultiSearcherThread.hits() throws nullPointerException. this is because docs somehow has become null. but why this field is null. I've checked lucene code . this field never becomes null except in the ParallelMultiSearcher when lucene wants to aggregate all results ( in line 79) the instruction of msta[i].join throws InterruptedException. then the ioe will be null and because msta[i] hasn't finished its work yet so docs will be null. is this right? or is it possible the msta[i] be interrupted in this part of code? the exception is : java.lang.NullPointerException at org.apache.lucene.search.MultiSearcherThread.hits(ParallelMultiSearcher.java:280) at org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:83) Best Regards
Re: Problem in lucene query
Also, get a copy of Luke and examine your index, that'll tell you what isactually in there *and* it will let you see how queries parse under various analyzers. Best Erick On Thu, Sep 10, 2009 at 6:47 AM, vibhuti wrote: > Hello > > > > I am new to Lucene and facing a problem while performing searches. I am > using lucene 2.2.0. > > My application indexes documents on "keyword" field which contains integer > values. If the value is negative the query does not return correct results. > > > > Following is my lucene query: > > > > (keyword: \-1) > > > > I also tried: > > (keyword: "-1") > > > > > > But none of them returns correct results. It seems that Lucene ignores '-'. > My purpose is to search documents with index value "-1". > > > > Any ideas?? > > > > Thanks > >
Re: IndexReader.isCurrent for cached indexes
Our commit code will close the IndexWriter after adding the documents and before we see the log message indicating the documents have been added and deleted, so I don't believe that is the problem. Thanks for the tip about reopen. I actually noticed that when researching this problem but didn't think it was related. We are running 2.4.1 -Original Message- From: "Ian Lea" Sent: Thursday, September 10, 2009 5:05am To: java-user@lucene.apache.org Subject: Re: IndexReader.isCurrent for cached indexes isCurrent() will only return true if there have been committed changes to the index. Maybe for some reason your index update job hasn't committed or closed the index. Probably not relevant to this problem, but your reopen code snippet doesn't close the old reader. It should. See the javadocs. What version of lucene are you running? -- Ian. On Wed, Sep 9, 2009 at 10:33 PM, Nick Bailey wrote: > Looking for some help figuring out a problem with the IndexReader.isCurrent() > method and cached indexes. > > We have a number of lucene indexes that we attempt to keep in memory after an > initial query is performed. In order to prevent the indexes from becoming > stale, we check for changes about every minute by calling isCurrent(). If > the index has changed, we will then reopen it. > > From our logs it appears that in some cases isCurrent() will return true even > though the index has changed since the last time the reader was opened. > > The code to refresh the index is basically this: > > // Checked every minute > if(!reader.isCurrent()){ > // reopen the existing reader > reader = this.searcher.getIndexReader(); > reader = reader.reopen(); > } > > This is an example of the problem from the logs: > > 2009-08-29 17:50:51,387 Indexed 0 documents and deleted 1 documents from > index 'example' in 0 ms > 2009-08-30 03:11:58,410 Indexed 0 documents and deleted 5 documents from > index 'example' in 0 ms > 2009-08-30 16:30:03,466 Using cached reader lastRefresh=81415526> > // numbers indicate milliseconds since opened or refreshed aka age = 24.6hrs, > lastRefresh = 22.6hrs > > The logs indicate we deleted documents from the index at about 5:50 on August > 29th, and then again on the 30th at 3:11. Then at 4:30 on we attempted to > query the index. We found the cached reader and used it, however, the last > time the cache was refreshed was about 22 hours previously, coinciding with > the first delete. The index should have been reopened after the second > delete. > > I have checked, and the code to refresh the indexes is definitely being run > every 60 seconds. All I can see is that the problem might be with the > isCurrent() method. > > Could it be due to holding the reader open for so long? Any other ideas? > > Thanks a lot, > Nick Bailey > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to avoid huge index files
Another alternative is storing the indexes in the Google Datastore, I think Compass already supports that (though I have not used it). Also, I have successfully run Lucene on GAE using GaeVFS (http://code.google.com/p/gaevfs/) to store the index in the Datastore. (I developed a Lucene Directory implementation on top of GaeVFS that's available at http://sf.net/contrail). > Dvora wrote: > > > > Hello, > > > > I'm using Lucene2.4. I'm developing a web application that using Lucene > > (via compass) to do the searches. > > I'm intending to deploy the application in Google App Engine > > (http://code.google.com/appengine/), which limits files length to be > > smaller than 10MB. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Extending Sort/FieldCache
I think CSF hasn't been implemented because it's only marginally useful yet requires fairly significant rewrites of core code (i.e. SegmentMerger) so no one's picked it up including myself. An interim solution that fulfills the same function (quickly loading field cache values) using what works reliably today (i.e. payloads) is a good simple forward moving step. Shai, feel free to open an issue and post your code. I'd will check it out and help where possible. On Tue, Sep 8, 2009 at 8:46 PM, Shai Erera wrote: > I didn't say we won't need CSF, but that at least conceptually, CSF and my > sort-by-payload are the same. If however it turns out that CSF performs > better, then I'll definitely switch my sort-by-payload package to use it. I > thought that CSF is going to be implemented using payloads, but perhaps I'm > wrong. > > Shai > > On Wed, Sep 9, 2009 at 1:39 AM, Yonik Seeley > wrote: > >> On Sun, Sep 6, 2009 at 4:42 AM, Shai Erera wrote: >> >> I've resisted using payloads for this purpose in Solr because it felt >> >> like an interim hack until CSF is implemented. >> > >> > I don't see it as a hack, but as a proper use of a great feature in >> Lucene. >> >> It's proper use for an application perhaps, but not for core Lucene. >> Applications are pretty much required to work with what's given in >> Lucene... but Lucene developers can make better choices. Hence if at >> all possible, work should be put into implementing CSF rather than >> sorting by payloads. >> >> > CSF and this are essentially the same. >> >> In which case we wouldn't need CSF? >> >> -Yonik >> http://www.lucidimagination.com >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Chinese Japanese Korean Indexing issue Version 2.4
Hi, We are trying to index html files which have japanese / korean / chinese content using the CJK analyser. But while indexing we are getting Lexical parse error. Encountered unkown character. We tried setting the string encoding to UTF 8 but it does not help. Can anyone please help. Any pointers will be highly appreciated. Thanks -- View this message in context: http://www.nabble.com/Chinese-Japanese-Korean-Indexing-issue-Version-2.4-tp25388003p25388003.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Chinese Japanese Korean Indexing issue Version 2.4
TO add some more context - I am able to index english and Western european langauages. asitag wrote: > > Hi, > > We are trying to index html files which have japanese / korean / chinese > content using the CJK analyser. But while indexing we are getting Lexical > parse error. Encountered unkown character. We tried setting the string > encoding to UTF 8 but it does not help. > > Can anyone please help. Any pointers will be highly appreciated. > > Thanks > -- View this message in context: http://www.nabble.com/Chinese-Japanese-Korean-Indexing-issue-Version-2.4-tp25388003p25388078.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to avoid huge index files
Is it possible to upload to GAE an already exist index? My index is data I'm collecting for long time, and I prefer not to give it up. ted stockwell wrote: > > Another alternative is storing the indexes in the Google Datastore, I > think Compass already supports that (though I have not used it). > > Also, I have successfully run Lucene on GAE using GaeVFS > (http://code.google.com/p/gaevfs/) to store the index in the Datastore. > (I developed a Lucene Directory implementation on top of GaeVFS that's > available at http://sf.net/contrail). > > > >> Dvora wrote: >> > >> > Hello, >> > >> > I'm using Lucene2.4. I'm developing a web application that using Lucene >> > (via compass) to do the searches. >> > I'm intending to deploy the application in Google App Engine >> > (http://code.google.com/appengine/), which limits files length to be >> > smaller than 10MB. > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25389394.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to avoid huge index files
Not at the moment. Actually, I'm already working on a remote copy utility for gaevfs that will upload large files and folders but the first cut is about a week away. - Original Message > From: Dvora > To: java-user@lucene.apache.org > Sent: Thursday, September 10, 2009 2:18:35 PM > Subject: Re: How to avoid huge index files > > > Is it possible to upload to GAE an already exist index? My index is data I'm > collecting for long time, and I prefer not to give it up. > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Index docstore flush problem
I'm seeing a strange exception when indexing using the latest Solr rev on EC2. org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs vs 298404 length in bytes of _0.fdx at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:268) at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86) at org.apache.solr.hadoop.SolrRecordWriter$1.run(SolrRecordWriter.java:239) Caused by: org.apache.solr.client.solrj.SolrServerException: java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs vs 298404 length in bytes of _0.fdx at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141) ... 3 more Caused by: java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs vs 298404 length in bytes of _0.fdx at org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:95) at org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50) at org.apache.lucene.index.DocumentsWriter.closeDocStore(DocumentsWriter.java:380) at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:574) at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4212) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4110) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4101) at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2108) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2071) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2035) at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:215) at org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:180) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:404) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85) at org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:105) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:48) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139) ... 3 more - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Index docstore flush problem
That's an odd exception. It means IndexWriter thinks 468 docs have been written to the stored fields file, which should mean the fdx file size is 3748 (= 4 + 468*8), yet the file size is far larger than that (298404). How repeatable is it? Can you turn on infoStream, get the exception to happen, then post the resulting output? Mike On Thu, Sep 10, 2009 at 7:19 PM, Jason Rutherglen wrote: > I'm seeing a strange exception when indexing using the latest Solr rev on EC2. > > org.apache.solr.client.solrj.SolrServerException: > org.apache.solr.client.solrj.SolrServerException: > java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs > vs 298404 length in bytes of _0.fdx > at > org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153) > at > org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:268) > at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86) > at > org.apache.solr.hadoop.SolrRecordWriter$1.run(SolrRecordWriter.java:239) > Caused by: org.apache.solr.client.solrj.SolrServerException: > java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs > vs 298404 length in bytes of _0.fdx > at > org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141) > ... 3 more > Caused by: java.lang.RuntimeException: after flush: fdx size mismatch: > 468 docs vs 298404 length in bytes of _0.fdx > at > org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:95) > at > org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50) > at > org.apache.lucene.index.DocumentsWriter.closeDocStore(DocumentsWriter.java:380) > at > org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:574) > at > org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4212) > at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4110) > at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4101) > at > org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2108) > at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2071) > at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2035) > at > org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:215) > at > org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:180) > at > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:404) > at > org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85) > at > org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:105) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:48) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299) > at > org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139) > ... 3 more > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Index docstore flush problem
Indexing locking was off, there was a bug higher up clobbering the index. Sorry and thanks! On Thu, Sep 10, 2009 at 4:49 PM, Michael McCandless wrote: > That's an odd exception. It means IndexWriter thinks 468 docs have > been written to the stored fields file, which should mean the fdx file > size is 3748 (= 4 + 468*8), yet the file size is far larger than that > (298404). > > How repeatable is it? Can you turn on infoStream, get the exception > to happen, then post the resulting output? > > Mike > > On Thu, Sep 10, 2009 at 7:19 PM, Jason Rutherglen > wrote: >> I'm seeing a strange exception when indexing using the latest Solr rev on >> EC2. >> >> org.apache.solr.client.solrj.SolrServerException: >> org.apache.solr.client.solrj.SolrServerException: >> java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs >> vs 298404 length in bytes of _0.fdx >> at >> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153) >> at >> org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:268) >> at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86) >> at >> org.apache.solr.hadoop.SolrRecordWriter$1.run(SolrRecordWriter.java:239) >> Caused by: org.apache.solr.client.solrj.SolrServerException: >> java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs >> vs 298404 length in bytes of _0.fdx >> at >> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141) >> ... 3 more >> Caused by: java.lang.RuntimeException: after flush: fdx size mismatch: >> 468 docs vs 298404 length in bytes of _0.fdx >> at >> org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:95) >> at >> org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50) >> at >> org.apache.lucene.index.DocumentsWriter.closeDocStore(DocumentsWriter.java:380) >> at >> org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:574) >> at >> org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4212) >> at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4110) >> at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4101) >> at >> org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2108) >> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2071) >> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2035) >> at >> org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:215) >> at >> org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:180) >> at >> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:404) >> at >> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85) >> at >> org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:105) >> at >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:48) >> at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299) >> at >> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139) >> ... 3 more >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
quick survey on schema less database usage
I am a MIT student doing a project on schema-less database usage and would greatly appreciate if you guys can fill out a quick survey on this (should take < 5 mins) http://bit.ly/nosqldb -- View this message in context: http://www.nabble.com/quick-survey-on-schema-less-database-usage-tp25394429p25394429.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org