QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load
The following testcase runs endlessly and produces VERY heavy load. ... String query = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut + labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et + ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. + Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt + ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores + et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet; String query = query.replaceAll( \\s+, * ); try { QueryParserUtil.parse( query, new String[] { test }, new Occur[] { Occur.MUST }, new KeywordAnalyzer() ); } catch ( Exception e ) { Assert.fail( e.getMessage() ); } ... I don't say this testcase makes sense, nevertheless the question remains whether this is a bug or a feature? Context: Lucene 4.7.2, Java 6 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
SortedDocValuesField
Hi, I was checking the SortedDocValuesField and its performance in Sort as opposed to a normal i.e. StringField and its performance in the same sort. So, I used the same string/bytesref value in both fields and in separate JVM processes, I launched the two sorts. I used a RAMDirectory and created a million items. The SortedDocValuesField sort took 12/13 seconds and consumed approx 500-550 megs of RAM whereas the StringField took 10/11 seconds and consumed 350-400 megs of RAM. Is this normal behavior? I was expecting the SDVF to perform better since it is indexed for sorting and not stored for any other purpose. --- Thanks n Regards, Sandeep Ramesh Khanzode
Re: SortedDocValuesField
don't use RAMDirectory: its not very performant and really intended for e.g. testing and so on. also, using a ramdirectory here defeats the purpose: the idea behind using a docvaluesfield in most cases is to keep (most of) such datastructures out of heap memory. The datastructures and even the compression used are optimized for mmap and nio access... On Thu, Jun 26, 2014 at 11:59 AM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, I was checking the SortedDocValuesField and its performance in Sort as opposed to a normal i.e. StringField and its performance in the same sort. So, I used the same string/bytesref value in both fields and in separate JVM processes, I launched the two sorts. I used a RAMDirectory and created a million items. The SortedDocValuesField sort took 12/13 seconds and consumed approx 500-550 megs of RAM whereas the StringField took 10/11 seconds and consumed 350-400 megs of RAM. Is this normal behavior? I was expecting the SDVF to perform better since it is indexed for sorting and not stored for any other purpose. --- Thanks n Regards, Sandeep Ramesh Khanzode - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load
I'll defer the the hard-core Lucene committers for the technical details, but I would suggest that a very large term with dozens of wildcards is a known limitation (albeit not well-documented.) IOW, to use wildcards in Lucene in a performant manner, they need to be brief. -- Jack Krupansky -Original Message- From: Clemens Wyss DEV Sent: Thursday, June 26, 2014 3:17 AM To: java-user@lucene.apache.org Subject: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load The following testcase runs endlessly and produces VERY heavy load. ... String query = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut + labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et + ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. + Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt + ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores + et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet; String query = query.replaceAll( \\s+, * ); try { QueryParserUtil.parse( query, new String[] { test }, new Occur[] { Occur.MUST }, new KeywordAnalyzer() ); } catch ( Exception e ) { Assert.fail( e.getMessage() ); } ... I don't say this testcase makes sense, nevertheless the question remains whether this is a bug or a feature? Context: Lucene 4.7.2, Java 6 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Batch wise Indexing Structured Documents
Hi, I have to index millions of files, that's why i am thinking batch wise indexing is good. Is it possible to do batch indexing using lucene? If batch indexing is possible using lucene provide me sample snippet. So could you please provide your valuable suggestions. Thanks Venkata krishna tolusuri. -- View this message in context: http://lucene.472066.n3.nabble.com/Batch-wise-Indexing-Structured-Documents-tp4144264.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load
I suspect you're getting leading wildcard searches as well, which must do entire term scans unless you're doing the reverse trick. Replacing all successive whitespace gives you: Lorem*ipsum*dolor*sit*amet,*consetetur*sadipscing*elitr,*sed*diam*nonumy*eirmod*tempor*invidunt*ut*labore*et*dolore*magna*aliquyam*erat,*sed*diam*voluptua.*At*vero*eos*et*accusam*et*justo*duo*dolores*et*ea*rebum.*Stet*clita*kasd*gubergren,*no*sea*takimata*sanctus*est*Lorem*ipsum*dolor*sit*amet.*Lorem*ipsum*dolor*sit*amet,*consetetur*sadipscing*elitr,*sed*diam*nonumy*eirmod*tempor*invidunt*ut*labore*et*dolore*magna*aliquyam*erat,*sed*diam*voluptua.*At*vero*eos*et*accusam*et*justo*duo*dolores*et*ea*rebum.*Stet*clita*kasd*gubergren,*no*sea*takimata*sanctus*est*Lorem*ipsum*dolor*sit*amet Note, no spaces. Then you're pushing it through the KeywordTokenizer which does essentially nothing. What a term! Your point is valid however, why this is taking so long I don't quite know. But I tend to agree that it's such an edge case that the hard-core FST guys would look at it for curiosity's sake only Best, Erick On Thu, Jun 26, 2014 at 5:34 AM, Jack Krupansky j...@basetechnology.com wrote: I'll defer the the hard-core Lucene committers for the technical details, but I would suggest that a very large term with dozens of wildcards is a known limitation (albeit not well-documented.) IOW, to use wildcards in Lucene in a performant manner, they need to be brief. -- Jack Krupansky -Original Message- From: Clemens Wyss DEV Sent: Thursday, June 26, 2014 3:17 AM To: java-user@lucene.apache.org Subject: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load The following testcase runs endlessly and produces VERY heavy load. ... String query = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut + labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et + ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. + Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt + ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores + et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet; String query = query.replaceAll( \\s+, * ); try { QueryParserUtil.parse( query, new String[] { test }, new Occur[] { Occur.MUST }, new KeywordAnalyzer() ); } catch ( Exception e ) { Assert.fail( e.getMessage() ); } ... I don't say this testcase makes sense, nevertheless the question remains whether this is a bug or a feature? Context: Lucene 4.7.2, Java 6 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Batch wise Indexing Structured Documents
download lucene source code... and check the demo source files that are shipped with it ... you should find a sample indexing file... On Thu, Jun 26, 2014 at 9:27 PM, Venkata krishna venkat1...@gmail.com wrote: Hi, I have to index millions of files, that's why i am thinking batch wise indexing is good. Is it possible to do batch indexing using lucene? If batch indexing is possible using lucene provide me sample snippet. So could you please provide your valuable suggestions. Thanks Venkata krishna tolusuri. -- View this message in context: http://lucene.472066.n3.nabble.com/Batch-wise-Indexing-Structured-Documents-tp4144264.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load
The test case is only parsing this query, not trying to run it, right? So it doesn't involve automaton/FST ... just the flexible query parser code? It seems bad that flexible QP would take so long, even if the query is strange. Can you open an issue, and maybe attach a thread dump so we can see where it's spending its time? Thanks. Mike McCandless http://blog.mikemccandless.com On Thu, Jun 26, 2014 at 5:30 PM, Erick Erickson erickerick...@gmail.com wrote: I suspect you're getting leading wildcard searches as well, which must do entire term scans unless you're doing the reverse trick. Replacing all successive whitespace gives you: Lorem*ipsum*dolor*sit*amet,*consetetur*sadipscing*elitr,*sed*diam*nonumy*eirmod*tempor*invidunt*ut*labore*et*dolore*magna*aliquyam*erat,*sed*diam*voluptua.*At*vero*eos*et*accusam*et*justo*duo*dolores*et*ea*rebum.*Stet*clita*kasd*gubergren,*no*sea*takimata*sanctus*est*Lorem*ipsum*dolor*sit*amet.*Lorem*ipsum*dolor*sit*amet,*consetetur*sadipscing*elitr,*sed*diam*nonumy*eirmod*tempor*invidunt*ut*labore*et*dolore*magna*aliquyam*erat,*sed*diam*voluptua.*At*vero*eos*et*accusam*et*justo*duo*dolores*et*ea*rebum.*Stet*clita*kasd*gubergren,*no*sea*takimata*sanctus*est*Lorem*ipsum*dolor*sit*amet Note, no spaces. Then you're pushing it through the KeywordTokenizer which does essentially nothing. What a term! Your point is valid however, why this is taking so long I don't quite know. But I tend to agree that it's such an edge case that the hard-core FST guys would look at it for curiosity's sake only Best, Erick On Thu, Jun 26, 2014 at 5:34 AM, Jack Krupansky j...@basetechnology.com wrote: I'll defer the the hard-core Lucene committers for the technical details, but I would suggest that a very large term with dozens of wildcards is a known limitation (albeit not well-documented.) IOW, to use wildcards in Lucene in a performant manner, they need to be brief. -- Jack Krupansky -Original Message- From: Clemens Wyss DEV Sent: Thursday, June 26, 2014 3:17 AM To: java-user@lucene.apache.org Subject: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load The following testcase runs endlessly and produces VERY heavy load. ... String query = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut + labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et + ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. + Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt + ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores + et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet; String query = query.replaceAll( \\s+, * ); try { QueryParserUtil.parse( query, new String[] { test }, new Occur[] { Occur.MUST }, new KeywordAnalyzer() ); } catch ( Exception e ) { Assert.fail( e.getMessage() ); } ... I don't say this testcase makes sense, nevertheless the question remains whether this is a bug or a feature? Context: Lucene 4.7.2, Java 6 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org