QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load

2014-06-26 Thread Clemens Wyss DEV
The following testcase runs endlessly and produces VERY heavy load.
...
String query = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed 
diam nonumy eirmod tempor invidunt ut 
+ labore et dolore magna aliquyam erat, sed 
diam voluptua. At vero eos et accusam et justo duo dolores et 
+ ea rebum. Stet clita kasd gubergren, no sea 
takimata sanctus est Lorem ipsum dolor sit amet. 
+ Lorem ipsum dolor sit amet, consetetur 
sadipscing elitr, sed diam nonumy eirmod tempor invidunt 
+ ut labore et dolore magna aliquyam erat, sed 
diam voluptua. At vero eos et accusam et justo duo dolores 
+ et ea rebum. Stet clita kasd gubergren, no 
sea takimata sanctus est Lorem ipsum dolor sit amet;
String query  = query.replaceAll( \\s+, * );
try
{
QueryParserUtil.parse( query, new String[] { test }, new Occur[] { Occur.MUST 
}, new KeywordAnalyzer() );
}
catch ( Exception e )
{
Assert.fail( e.getMessage() );
}
...
I don't say this testcase makes sense, nevertheless the question remains 
whether this is a bug or a feature?

Context: Lucene 4.7.2, Java 6

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



SortedDocValuesField

2014-06-26 Thread Sandeep Khanzode
Hi,
 
I was checking the SortedDocValuesField and its performance in Sort as opposed 
to a normal i.e. StringField and its performance in the same sort. So, I used 
the same string/bytesref value in both fields and in separate JVM processes, I 
launched the two sorts.

I used a RAMDirectory and created a million items. The SortedDocValuesField 
sort took 12/13 seconds and consumed approx 500-550 megs of RAM whereas the 
StringField took 10/11 seconds and consumed 350-400 megs of RAM. 
Is this normal behavior? I was expecting the SDVF to perform better since it is 
indexed for sorting and not stored for any other purpose.

---

Thanks n Regards,
Sandeep Ramesh Khanzode

Re: SortedDocValuesField

2014-06-26 Thread Robert Muir
don't use RAMDirectory: its not very performant and really intended
for e.g. testing and so on.

also, using a ramdirectory here defeats the purpose: the idea behind
using a docvaluesfield in most cases is to keep (most of) such
datastructures out of heap memory. The datastructures and even the
compression used are optimized for mmap and nio access...



On Thu, Jun 26, 2014 at 11:59 AM, Sandeep Khanzode
sandeep_khanz...@yahoo.com.invalid wrote:
 Hi,

 I was checking the SortedDocValuesField and its performance in Sort as 
 opposed to a normal i.e. StringField and its performance in the same sort. 
 So, I used the same string/bytesref value in both fields and in separate JVM 
 processes, I launched the two sorts.

 I used a RAMDirectory and created a million items. The SortedDocValuesField 
 sort took 12/13 seconds and consumed approx 500-550 megs of RAM whereas the 
 StringField took 10/11 seconds and consumed 350-400 megs of RAM.
 Is this normal behavior? I was expecting the SDVF to perform better since it 
 is indexed for sorting and not stored for any other purpose.

 ---

 Thanks n Regards,
 Sandeep Ramesh Khanzode

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load

2014-06-26 Thread Jack Krupansky
I'll defer the the hard-core Lucene committers for the technical details, 
but I would suggest that a very large term with dozens of wildcards is a 
known limitation (albeit not well-documented.) IOW, to use wildcards in 
Lucene in a performant manner, they need to be brief.


-- Jack Krupansky

-Original Message- 
From: Clemens Wyss DEV

Sent: Thursday, June 26, 2014 3:17 AM
To: java-user@lucene.apache.org
Subject: QueryParserUtil, big query with wildcards - runs endlessly and 
produces heavy load


The following testcase runs endlessly and produces VERY heavy load.
...
String query = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed 
diam nonumy eirmod tempor invidunt ut 
+ labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et 
accusam et justo duo dolores et 
+ ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem 
ipsum dolor sit amet. 
+ Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy 
eirmod tempor invidunt 
+ ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos 
et accusam et justo duo dolores 
+ et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem 
ipsum dolor sit amet;

String query  = query.replaceAll( \\s+, * );
try
{
QueryParserUtil.parse( query, new String[] { test }, new Occur[] { 
Occur.MUST }, new KeywordAnalyzer() );

}
catch ( Exception e )
{
Assert.fail( e.getMessage() );
}
...
I don't say this testcase makes sense, nevertheless the question remains 
whether this is a bug or a feature?


Context: Lucene 4.7.2, Java 6

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Batch wise Indexing Structured Documents

2014-06-26 Thread Venkata krishna
Hi,

I have to index millions of files, that's why i am thinking batch wise
indexing is good.

Is it possible to do batch indexing using lucene?


If  batch indexing is possible using lucene provide me sample snippet.

So could you please provide your valuable suggestions.


Thanks

Venkata krishna tolusuri.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Batch-wise-Indexing-Structured-Documents-tp4144264.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load

2014-06-26 Thread Erick Erickson
I suspect you're getting leading wildcard searches as well, which must
do entire term scans unless you're doing the reverse trick.

Replacing all successive whitespace gives you:
Lorem*ipsum*dolor*sit*amet,*consetetur*sadipscing*elitr,*sed*diam*nonumy*eirmod*tempor*invidunt*ut*labore*et*dolore*magna*aliquyam*erat,*sed*diam*voluptua.*At*vero*eos*et*accusam*et*justo*duo*dolores*et*ea*rebum.*Stet*clita*kasd*gubergren,*no*sea*takimata*sanctus*est*Lorem*ipsum*dolor*sit*amet.*Lorem*ipsum*dolor*sit*amet,*consetetur*sadipscing*elitr,*sed*diam*nonumy*eirmod*tempor*invidunt*ut*labore*et*dolore*magna*aliquyam*erat,*sed*diam*voluptua.*At*vero*eos*et*accusam*et*justo*duo*dolores*et*ea*rebum.*Stet*clita*kasd*gubergren,*no*sea*takimata*sanctus*est*Lorem*ipsum*dolor*sit*amet

Note, no spaces. Then you're pushing it through the KeywordTokenizer
which does essentially nothing. What a term!

Your point is valid however, why this is taking so long I don't quite
know. But I tend to agree that it's such an edge case that the
hard-core FST guys would look at it for curiosity's sake only

Best,
Erick


On Thu, Jun 26, 2014 at 5:34 AM, Jack Krupansky j...@basetechnology.com wrote:
 I'll defer the the hard-core Lucene committers for the technical details,
 but I would suggest that a very large term with dozens of wildcards is a
 known limitation (albeit not well-documented.) IOW, to use wildcards in
 Lucene in a performant manner, they need to be brief.

 -- Jack Krupansky

 -Original Message- From: Clemens Wyss DEV
 Sent: Thursday, June 26, 2014 3:17 AM
 To: java-user@lucene.apache.org
 Subject: QueryParserUtil, big query with wildcards - runs endlessly and
 produces heavy load


 The following testcase runs endlessly and produces VERY heavy load.
 ...
 String query = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed
 diam nonumy eirmod tempor invidunt ut 
 + labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et
 accusam et justo duo dolores et 
 + ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem
 ipsum dolor sit amet. 
 + Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
 eirmod tempor invidunt 
 + ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos
 et accusam et justo duo dolores 
 + et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem
 ipsum dolor sit amet;
 String query  = query.replaceAll( \\s+, * );
 try
 {
 QueryParserUtil.parse( query, new String[] { test }, new Occur[] {
 Occur.MUST }, new KeywordAnalyzer() );
 }
 catch ( Exception e )
 {
 Assert.fail( e.getMessage() );
 }
 ...
 I don't say this testcase makes sense, nevertheless the question remains
 whether this is a bug or a feature?

 Context: Lucene 4.7.2, Java 6

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Batch wise Indexing Structured Documents

2014-06-26 Thread parnab kumar
 download lucene source code... and check the demo source files that are
shipped with it ... you should find a sample indexing file...


On Thu, Jun 26, 2014 at 9:27 PM, Venkata krishna venkat1...@gmail.com
wrote:

 Hi,

 I have to index millions of files, that's why i am thinking batch wise
 indexing is good.

 Is it possible to do batch indexing using lucene?


 If  batch indexing is possible using lucene provide me sample snippet.

 So could you please provide your valuable suggestions.


 Thanks

 Venkata krishna tolusuri.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Batch-wise-Indexing-Structured-Documents-tp4144264.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load

2014-06-26 Thread Michael McCandless
The test case is only parsing this query, not trying to run it,
right?  So it doesn't involve automaton/FST ... just the flexible
query parser code?

It seems bad that flexible QP would take so long, even if the query is
strange.

Can you open an issue, and maybe attach a thread dump so we can see
where it's spending its time?  Thanks.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jun 26, 2014 at 5:30 PM, Erick Erickson erickerick...@gmail.com wrote:
 I suspect you're getting leading wildcard searches as well, which must
 do entire term scans unless you're doing the reverse trick.

 Replacing all successive whitespace gives you:
 Lorem*ipsum*dolor*sit*amet,*consetetur*sadipscing*elitr,*sed*diam*nonumy*eirmod*tempor*invidunt*ut*labore*et*dolore*magna*aliquyam*erat,*sed*diam*voluptua.*At*vero*eos*et*accusam*et*justo*duo*dolores*et*ea*rebum.*Stet*clita*kasd*gubergren,*no*sea*takimata*sanctus*est*Lorem*ipsum*dolor*sit*amet.*Lorem*ipsum*dolor*sit*amet,*consetetur*sadipscing*elitr,*sed*diam*nonumy*eirmod*tempor*invidunt*ut*labore*et*dolore*magna*aliquyam*erat,*sed*diam*voluptua.*At*vero*eos*et*accusam*et*justo*duo*dolores*et*ea*rebum.*Stet*clita*kasd*gubergren,*no*sea*takimata*sanctus*est*Lorem*ipsum*dolor*sit*amet

 Note, no spaces. Then you're pushing it through the KeywordTokenizer
 which does essentially nothing. What a term!

 Your point is valid however, why this is taking so long I don't quite
 know. But I tend to agree that it's such an edge case that the
 hard-core FST guys would look at it for curiosity's sake only

 Best,
 Erick


 On Thu, Jun 26, 2014 at 5:34 AM, Jack Krupansky j...@basetechnology.com 
 wrote:
 I'll defer the the hard-core Lucene committers for the technical details,
 but I would suggest that a very large term with dozens of wildcards is a
 known limitation (albeit not well-documented.) IOW, to use wildcards in
 Lucene in a performant manner, they need to be brief.

 -- Jack Krupansky

 -Original Message- From: Clemens Wyss DEV
 Sent: Thursday, June 26, 2014 3:17 AM
 To: java-user@lucene.apache.org
 Subject: QueryParserUtil, big query with wildcards - runs endlessly and
 produces heavy load


 The following testcase runs endlessly and produces VERY heavy load.
 ...
 String query = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed
 diam nonumy eirmod tempor invidunt ut 
 + labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et
 accusam et justo duo dolores et 
 + ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem
 ipsum dolor sit amet. 
 + Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
 eirmod tempor invidunt 
 + ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos
 et accusam et justo duo dolores 
 + et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem
 ipsum dolor sit amet;
 String query  = query.replaceAll( \\s+, * );
 try
 {
 QueryParserUtil.parse( query, new String[] { test }, new Occur[] {
 Occur.MUST }, new KeywordAnalyzer() );
 }
 catch ( Exception e )
 {
 Assert.fail( e.getMessage() );
 }
 ...
 I don't say this testcase makes sense, nevertheless the question remains
 whether this is a bug or a feature?

 Context: Lucene 4.7.2, Java 6

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org