from:"Peter Spam"

Re: Can Solr handle large text files?

2012-07-28 Thread Peter Spam

Has the performance of highlighting large text documents been improved in Solr
4?

Thanks!
Pete

On Nov 5, 2011, at 9:03 AM, Erick Erickson erickerick...@gmail.com wrote:

Sure, if you write a custom update handler. But I'm not at all sure
this is ideal.
You're requiring all that data to be transmitted across the wire and processed
by Solr. Assuming you have more than one input source, the Solr server in
the background will be handling up to N documents simultaneously. Plus
the effort to index. I think I'd recommend splitting them up on the client
side.

Best
Erick

On Fri, Nov 4, 2011 at 3:23 AM, Peter Spam ps...@mac.com wrote:
Solr 4.0 (11/1 snapshot)
Data: 80k files, average size 2.5MB, largest is 750MB;
Solr: Each document is max 256k; total docs = 800k
Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java;
Admin shows 30% mem usage

I originally tried injecting the entire file into a single Solr document,
and this had disastrous results when trying to highlight. I've now tried
splitting each file into 256k segments per Solr document, and the results
are better, but still not what I was hoping for. Queries are around 2-8
seconds, with some reaching into 30+ second territory.

Ideally, I'd like to feed Solr the metadata and the entire file at once, and
have the back-end split the file into thousands of pieces. Is this possible?

Thanks!
Pete

On Nov 1, 2011, at 5:15 PM, Peter Spam wrote:

Wow, 50 lines is tiny! Is that how small you need to go, to get good
highlighting performance?

I'm looking at documents that can be up to 800MB in size, so I've decided
to split them down into 256k chunks. I'm still indexing right now - I'm
curious to see how performance is when the injection is finished.

Has anyone done analysis on where the knee in the curve is, wrt document
size vs. # of documents?

Thanks!
Pete

On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote:

Hi,

Basically I need to index very large log files. I have modified the
ExtractingDocumentLoader to create a new document for every 50 lines (it
is made configurable by keeping it as a system property) of the log file
being indexed. 'Filename' field for document created from 1 log file is
kept the same and unique id is generated by appending the line no. with
the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the
custom score stored in field called 'custom_score' which is directly
proportional to its distance from the beginning of the file.

I have also found 'hitGrouped.vm' from the net. Since I am reading only 50
lines for each document so the default max chunk size works for me but it
can be easily adjusted depending upon the no of lines you are reading per
doc.

Now I have done the grouping based on the 'filename' field and show the
results from docs having highest score as a result I am able to show the
last matching results from log file. Query parameters that I am using for
search are:

http://localhost:8080/solr/select?defType=dismaxqf=Contentq=Solrfl=id,scoredefType=dismaxbf=sub(1000,caprice_score)group=truegroup.field=FileName

Results are amazing, I am able to index and search from very larger log
files (few 100 MBs) with very low memory requirements. Highlighting is
also working fine.

Thanks Regards,
Anand

Anand Nigam
RBS Global Banking Markets
Office: +91 124 492 5506

-Original Message-
From: Peter Spam [mailto:ps...@mac.com]
Sent: 21 October 2011 23:04
To: solr-user@lucene.apache.org
Subject: Re: Can Solr handle large text files?

Thanks for your note, Anand. What was the maximum chunk size for you?
Could you post the relevant portions of your configuration file?

Thanks!
Pete

On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:

Hi,

I was also facing the issue of highlighting the large text files. I
applied the solution proposed here and it worked. But I am getting
following error :

Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where
can I get this file from. Its reference is present in browse.vm

div class=results
#if($response.response.get('grouped'))
#foreach($grouping in $response.response.get('grouped'))
#parse(hitGrouped.vm)
#end
#else
#foreach($doc in $response.results)
#parse(hit.vm)
#end
#end
/div

HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or
'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/',
cwd=C:\glassfish3\glassfish\domains\domain1\config
java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in
classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/',
cwd=C:\glassfish3\glassfish\domains\domain1\config at
org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
r.java:268) at
org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
SolrVelocityResourceLoader.java:42

Proper analyzer / tokenizer for syslog data?

2011-11-04 Thread Peter Spam

Example data:
01/23/2011 05:12:34 [Test] a=1; hello_there=50; data=[1,5,30%];

I would love to be able to just grep the data - ie. if I search for ello, 
it finds and returns ello, and if I search for hello_there=5, it would 
match too.

Here's what I'm using now:

   fieldType name=text_sy class=solr.TextField
 analyzer
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0/
 /analyzer
   /fieldType

The problem with this is that if I search for a substring, I don't get anything 
back.  For example, searching for ello or *ello* doesn't return.  Any ideas?

http://localhost:8983/solr/select?q=*ello*start=0rows=50hl.maxAnalyzedChars=2147483647hl.useFastVectorHighlighter=truehl=truehl.fl=bodyhl.snippets=1hl.fragsize=400


Thanks!
Pete

Re: Can Solr handle large text files?

2011-11-04 Thread Peter Spam

Solr 4.0 (11/1 snapshot)
Data: 80k files, average size 2.5MB, largest is 750MB;
Solr: Each document is max 256k; total docs = 800k
Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java; Admin
shows 30% mem usage

I originally tried injecting the entire file into a single Solr document, and
this had disastrous results when trying to highlight. I've now tried splitting
each file into 256k segments per Solr document, and the results are better, but
still not what I was hoping for. Queries are around 2-8 seconds, with some
reaching into 30+ second territory.

Ideally, I'd like to feed Solr the metadata and the entire file at once, and
have the back-end split the file into thousands of pieces. Is this possible?

Thanks!
Pete

On Nov 1, 2011, at 5:15 PM, Peter Spam wrote:

Wow, 50 lines is tiny! Is that how small you need to go, to get good
highlighting performance?

I'm looking at documents that can be up to 800MB in size, so I've decided to
split them down into 256k chunks. I'm still indexing right now - I'm curious
to see how performance is when the injection is finished.

Has anyone done analysis on where the knee in the curve is, wrt document size
vs. # of documents?

Thanks!
Pete

On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote:

Hi,

Basically I need to index very large log files. I have modified the
ExtractingDocumentLoader to create a new document for every 50 lines (it is
made configurable by keeping it as a system property) of the log file being
indexed. 'Filename' field for document created from 1 log file is kept the
same and unique id is generated by appending the line no. with the file
name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score
stored in field called 'custom_score' which is directly proportional to its
distance from the beginning of the file.

http://localhost:8080/solr/select?defType=dismaxqf=Contentq=Solrfl=id,scoredefType=dismaxbf=sub(1000,caprice_score)group=truegroup.field=FileName

Results are amazing, I am able to index and search from very larger log
files (few 100 MBs) with very low memory requirements. Highlighting is also
working fine.

Thanks Regards,
Anand

Anand Nigam
RBS Global Banking Markets
Office: +91 124 492 5506

-Original Message-
From: Peter Spam [mailto:ps...@mac.com]
Sent: 21 October 2011 23:04
To: solr-user@lucene.apache.org
Subject: Re: Can Solr handle large text files?

Thanks for your note, Anand. What was the maximum chunk size for you?
Could you post the relevant portions of your configuration file?

Thanks!
Pete

On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:

Hi,

I was also facing the issue of highlighting the large text files. I applied
the solution proposed here and it worked. But I am getting following error :

Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where
can I get this file from. Its reference is present in browse.vm

Thanks Regards,
Anand
Anand Nigam
RBS Global Banking Markets
Office: +91 124 492 5506

-Original Message-
From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de]
Sent: 21 October 2011 14:58
To: solr-user@lucene.apache.org
Subject: Re: Can Solr handle large text files?

Hi Peter,

highlighting in large text files can not be fast without dividing the
original text in small piece.
So take a look in
http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
and in
http

Re: Proper analyzer / tokenizer for syslog data?

2011-11-04 Thread Peter Spam

Wow, I tried with minGramSize=1 and maxgramSize=1000 (I want someone to be able 
to search on any substring, just like grep), and the index is multiple orders 
of magnitude larger than my data!

There's got to be a better way to support full grep-like searching?


Thanks!
Pete

On Nov 4, 2011, at 1:20 AM, Ahmet Arslan wrote:

 Example data:
 01/23/2011 05:12:34 [Test] a=1; hello_there=50;
 data=[1,5,30%];
 
 I would love to be able to just grep the data - ie. if I
 search for ello, it finds and returns ello, and if I
 search for hello_there=5, it would match too.
 
 Here's what I'm using now:
 
fieldType name=text_sy
 class=solr.TextField
  analyzer
tokenizer
 class=solr.StandardTokenizerFactory/
filter
 class=solr.LowerCaseFilterFactory/
filter
 class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0
 catenateWords=0 catenateNumbers=0 catenateAll=0
 splitOnCaseChange=0/
  /analyzer
/fieldType
 
 The problem with this is that if I search for a substring,
 I don't get anything back.  For example, searching for
 ello or *ello* doesn't return.  Any ideas?
 
 http://localhost:8983/solr/select?q=*ello*start=0rows=50hl.maxAnalyzedChars=2147483647hl.useFastVectorHighlighter=truehl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
 
 For sub-string match NGramFilterFactory is required at index time.
 
 filter class=solr.NGramFilterFactory minGramSize=1
 maxGramSize=15/ 
 
 Plus you may want to use WhiteSpaceTokenizer instead of 
 StandardTokenizerFactory. Analysis admin page displays behavior of each 
 tokenizer.

Re: Can Solr handle large text files?

2011-11-01 Thread Peter Spam

Wow, 50 lines is tiny! Is that how small you need to go, to get good
highlighting performance?

I'm looking at documents that can be up to 800MB in size, so I've decided to
split them down into 256k chunks. I'm still indexing right now - I'm curious
to see how performance is when the injection is finished.

Has anyone done analysis on where the knee in the curve is, wrt document size
vs. # of documents?

Thanks!
Pete

On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote:

Hi,

Basically I need to index very large log files. I have modified the
ExtractingDocumentLoader to create a new document for every 50 lines (it is
made configurable by keeping it as a system property) of the log file being
indexed. 'Filename' field for document created from 1 log file is kept the
same and unique id is generated by appending the line no. with the file name,
e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored
in field called 'custom_score' which is directly proportional to its distance
from the beginning of the file.

I have also found 'hitGrouped.vm' from the net. Since I am reading only 50
lines for each document so the default max chunk size works for me but it can
be easily adjusted depending upon the no of lines you are reading per doc.

Now I have done the grouping based on the 'filename' field and show the
results from docs having highest score as a result I am able to show the last
matching results from log file. Query parameters that I am using for search
are:

http://localhost:8080/solr/select?defType=dismaxqf=Contentq=Solrfl=id,scoredefType=dismaxbf=sub(1000,caprice_score)group=truegroup.field=FileName

Results are amazing, I am able to index and search from very larger log files
(few 100 MBs) with very low memory requirements. Highlighting is also working
fine.

Thanks Regards,
Anand

Anand Nigam
RBS Global Banking Markets
Office: +91 124 492 5506

-Original Message-
From: Peter Spam [mailto:ps...@mac.com]
Sent: 21 October 2011 23:04
To: solr-user@lucene.apache.org
Subject: Re: Can Solr handle large text files?

Thanks for your note, Anand. What was the maximum chunk size for you? Could
you post the relevant portions of your configuration file?

Thanks!
Pete

On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:

Hi,

I was also facing the issue of highlighting the large text files. I applied
the solution proposed here and it worked. But I am getting following error :

Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where
can I get this file from. Its reference is present in browse.vm

Thanks Regards,
Anand
Anand Nigam
RBS Global Banking Markets
Office: +91 124 492 5506

-Original Message-
From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de]
Sent: 21 October 2011 14:58
To: solr-user@lucene.apache.org
Subject: Re: Can Solr handle large text files?

Hi Peter,

highlighting in large text files can not be fast without dividing the
original text in small piece.
So take a look in
http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
and in
http://www.lucidimagination.com/blog/2010/09/16/2446/

Which means that you should divide your files and use Result Grouping /
Field Collapsing to list only one hit per original document.

(xtf also would solve your problem out of the box but xtf does not use
solr).

Best regards
Karsten

Original-Nachricht
Datum: Thu, 20 Oct 2011 17:59:04 -0700
Von: Peter Spam ps...@mac.com
An: solr-user@lucene.apache.org
Betreff: Can Solr handle large text files?

I have about 20k text files, some very small, but some up to 300MB,
and would like to do text searching with highlighting.

Imagine the text is the contents of your syslog.

I would like to type in some terms, such as error and mail, and
have Solr return the syslog lines with those terms PLUS two lines of
context.
Pretty much

Re: Can Solr handle large text files?

2011-11-01 Thread Peter Spam

Oh by the way - what analyzer are you using for your log files?  Here's what 
I'm trying:

fieldType name=text_pl class=solr.TextField
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0/
  /analyzer
/fieldType


Thanks!
Pete

On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote:

 Hi,
 
 Basically I need to index very large log files. I have modified the 
 ExtractingDocumentLoader to create a new document for every 50 lines (it is 
 made configurable by keeping it as a system property)  of the log file being 
 indexed. 'Filename' field for document created from 1 log file is kept the 
 same and unique id is generated by appending the line no. with the file name, 
 e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored 
 in field called 'custom_score' which is directly proportional to its distance 
 from the beginning of the file.
 
 I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 
 lines for each document so the default max chunk size works for me but it can 
 be easily adjusted depending upon the no of lines you are reading per doc.
 
 Now I have done the grouping based on the 'filename' field and show the 
 results from docs having highest score as a result I am able to show the last 
 matching results from log file. Query parameters that I am using for search 
 are:
 
 http://localhost:8080/solr/select?defType=dismaxqf=Contentq=Solrfl=id,scoredefType=dismaxbf=sub(1000,caprice_score)group=truegroup.field=FileName
 
 Results are amazing, I am able to index and search from very larger log files 
 (few 100 MBs) with very low memory requirements. Highlighting is also working 
 fine.
 
 Thanks  Regards,
 Anand
 
 
 
 
 
 Anand Nigam
 RBS Global Banking  Markets
 Office: +91 124 492 5506   
 
 -Original Message-
 From: Peter Spam [mailto:ps...@mac.com] 
 Sent: 21 October 2011 23:04
 To: solr-user@lucene.apache.org
 Subject: Re: Can Solr handle large text files?
 
 Thanks for your note, Anand.  What was the maximum chunk size for you?  Could 
 you post the relevant portions of your configuration file?
 
 
 Thanks!
 Pete
 
 On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:
 
 Hi,
 
 I was also facing the issue of highlighting the large text files. I applied 
 the solution proposed here and it worked. But I am getting following error :
 
 
 Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where 
 can I get this file from. Its reference is present in browse.vm
 
 div class=results
 #if($response.response.get('grouped'))
   #foreach($grouping in $response.response.get('grouped'))
 #parse(hitGrouped.vm)
   #end
 #else
   #foreach($doc in $response.results)
 #parse(hit.vm)
   #end
 #end
 /div
 
 
 HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
 cwd=C:\glassfish3\glassfish\domains\domain1\config 
 java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in 
 classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
 cwd=C:\glassfish3\glassfish\domains\domain1\config at 
 org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
 r.java:268) at 
 org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
 SolrVelocityResourceLoader.java:42) at 
 org.apache.velocity.Template.process(Template.java:98) at 
 org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
 ResourceManagerImpl.java:446) at
 
 Thanks  Regards,
 Anand
 Anand Nigam
 RBS Global Banking  Markets
 Office: +91 124 492 5506   
 
 
 -Original Message-
 From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de]
 Sent: 21 October 2011 14:58
 To: solr-user@lucene.apache.org
 Subject: Re: Can Solr handle large text files?
 
 Hi Peter,
 
 highlighting in large text files can not be fast without dividing the 
 original text in small piece.
 So take a look in
 http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
 and in
 http://www.lucidimagination.com/blog/2010/09/16/2446/
 
 Which means that you should divide your files and use Result Grouping / 
 Field Collapsing to list only one hit per original document.
 
 (xtf also would solve your problem out of the box but xtf does not use 
 solr).
 
 Best regards
 Karsten
 
  Original-Nachricht 
 Datum: Thu, 20 Oct 2011 17:59:04 -0700
 Von: Peter Spam ps...@mac.com
 An: solr-user@lucene.apache.org
 Betreff: Can Solr handle large text files?
 
 I have about 20k text files, some very small, but some up to 300MB, 
 and would like to do text searching with highlighting.
 
 Imagine the text is the contents of your syslog.
 
 I would like to type in some terms, such as error and mail, and 
 have Solr return the syslog lines with those

Re: Can Solr handle large text files?

2011-10-24 Thread Peter Spam

Thanks for the reminder - I had that set to 214xxx... (the max), but perf was 
terrible when I injected large files.

So what's the max recommended field size in kb?  I can try chopping up the 
syslogs into arbitrarily small pieces, but would love to know where to start.

Thanks!

Sent from my iPhone

On Oct 23, 2011, at 2:01 PM, Erick Erickson erickerick...@gmail.com wrote:

 Also be aware that by default Solr is configured to only index the
 first 10,000 lines
 of text. See maxFieldLength in solrconfig.xml
 
 Best
 Erick
 
 On Fri, Oct 21, 2011 at 7:34 PM, Peter Spam ps...@mac.com wrote:
 Thanks for your note, Anand.  What was the maximum chunk size for you?  
 Could you post the relevant portions of your configuration file?
 
 
 Thanks!
 Pete
 
 On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:
 
 Hi,
 
 I was also facing the issue of highlighting the large text files. I applied 
 the solution proposed here and it worked. But I am getting following error :
 
 
 Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I 
 get this file from. Its reference is present in browse.vm
 
 div class=results
  #if($response.response.get('grouped'))
#foreach($grouping in $response.response.get('grouped'))
  #parse(hitGrouped.vm)
#end
  #else
#foreach($doc in $response.results)
  #parse(hit.vm)
#end
  #end
 /div
 
 
 HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
 cwd=C:\glassfish3\glassfish\domains\domain1\config 
 java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in 
 classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
 cwd=C:\glassfish3\glassfish\domains\domain1\config at 
 org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268)
  at 
 org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42)
  at org.apache.velocity.Template.process(Template.java:98) at 
 org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446)
  at
 
 Thanks  Regards,
 Anand
 Anand Nigam
 RBS Global Banking  Markets
 Office: +91 124 492 5506
 
 
 -Original Message-
 From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de]
 Sent: 21 October 2011 14:58
 To: solr-user@lucene.apache.org
 Subject: Re: Can Solr handle large text files?
 
 Hi Peter,
 
 highlighting in large text files can not be fast without dividing the 
 original text in small piece.
 So take a look in
 http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
 and in
 http://www.lucidimagination.com/blog/2010/09/16/2446/
 
 Which means that you should divide your files and use Result Grouping / 
 Field Collapsing to list only one hit per original document.
 
 (xtf also would solve your problem out of the box but xtf does not use 
 solr).
 
 Best regards
  Karsten
 
  Original-Nachricht 
 Datum: Thu, 20 Oct 2011 17:59:04 -0700
 Von: Peter Spam ps...@mac.com
 An: solr-user@lucene.apache.org
 Betreff: Can Solr handle large text files?
 
 I have about 20k text files, some very small, but some up to 300MB,
 and would like to do text searching with highlighting.
 
 Imagine the text is the contents of your syslog.
 
 I would like to type in some terms, such as error and mail, and
 have Solr return the syslog lines with those terms PLUS two lines of 
 context.
 Pretty much just like Google's highlighting.
 
 1) Can Solr handle this?  I had extremely long query times when I
 tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I
 tried breaking the files into 1MB pieces, but searching would be wonky
 = return the wrong number of documents (ie. if one file had a term 5
 times, and that was the only file that had the term, I want 1 result, not 
 5 results).
 
 2) What sort of tokenizer would be best?  Here's what I'm using:
 
   field name=body type=text_pl indexed=true stored=true
 multiValued=false termVectors=true termPositions=true
 termOffsets=true /
 
fieldType name=text_pl class=solr.TextField
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0 catenateWords=0 
 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0/
  /analyzer
/fieldType
 
 
 Thanks!
 Pete
 
 ***
 The Royal Bank of Scotland plc. Registered in Scotland No 90312.
 Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
 Authorised and regulated by the Financial Services Authority. The
 Royal Bank of Scotland N.V. is authorised and regulated by the
 De Nederlandsche Bank and has its seat at Amsterdam, the
 Netherlands, and is registered in the Commercial Register under
 number 33002587. Registered Office: Gustav Mahlerlaan 350,
 Amsterdam, The Netherlands

Re: Sorting fields with letters?

2011-10-24 Thread Peter Spam

Tried using the ord() function, but it was the same as the standard sort.

Do I just need to bite the bullet and reindex everything?


Thanks!
Pete

On Oct 21, 2011, at 5:26 PM, Tomás Fernández Löbbe wrote:

 I don't know if you'll find exactly what you need, but you can sort by any
 field or FunctionQuery. See http://wiki.apache.org/solr/FunctionQuery
 
 On Fri, Oct 21, 2011 at 7:03 PM, Peter Spam ps...@mac.com wrote:
 
 Is there a way to use a custom sorter, to avoid re-indexing?
 
 
 Thanks!
 Pete
 
 On Oct 21, 2011, at 2:13 PM, Tomás Fernández Löbbe wrote:
 
 Well, yes. You probably have a string field for that content, right? so
 the
 content is being compared as strings, not as numbers, that why something
 like 1000 is lower than 2. Leading zeros would be an option. Another
 option
 is to separate the field into numeric fields and sort by those (this last
 option is only recommended if your data always look similar).
 Something like 11C15 to field1: 11, field2:C field3: 15. Then use
 sort=field1,field2,field3.
 
 Anyway, both this options require reindexing.
 
 Regards,
 
 Tomás
 
 On Fri, Oct 21, 2011 at 4:57 PM, Peter Spam ps...@mac.com wrote:
 
 Hi everyone,
 
 I have a field that has a letter in it (for example, 1A1, 2A1, 11C15,
 etc.).  Sorting it seems to work most of the time, except for a few
 things,
 like 10A1 is lower than 8A100, and 10A100 is lower than 10A99.  Any
 ideas?
 I bet if my data had leading zeros (ie 10A099), it would behave better?
 (But I can't really change my data now, as it would take a few days to
 re-inject - which is possible but a hassle).
 
 
 Thanks!
 Pete

Re: Can Solr handle large text files?

2011-10-21 Thread Peter Spam

Thanks for the response, Karsten.

1) What's the recommended maximum chunk size?
2) Does my tokenizer look reasonable?


Thanks!
Pete

On Oct 21, 2011, at 2:28 AM, karsten-s...@gmx.de wrote:

 Hi Peter,
 
 highlighting in large text files can not be fast without dividing the 
 original text in small piece.
 So take a look in
 http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
 and in
 http://www.lucidimagination.com/blog/2010/09/16/2446/
 
 Which means that you should divide your files and use
 Result Grouping / Field Collapsing
 to list only one hit per original document.
 
 (xtf also would solve your problem out of the box but xtf does not use 
 solr).
 
 Best regards
  Karsten
 
  Original-Nachricht 
 Datum: Thu, 20 Oct 2011 17:59:04 -0700
 Von: Peter Spam ps...@mac.com
 An: solr-user@lucene.apache.org
 Betreff: Can Solr handle large text files?
 
 I have about 20k text files, some very small, but some up to 300MB, and
 would like to do text searching with highlighting.
 
 Imagine the text is the contents of your syslog.
 
 I would like to type in some terms, such as error and mail, and have
 Solr return the syslog lines with those terms PLUS two lines of context. 
 Pretty much just like Google's highlighting.
 
 1) Can Solr handle this?  I had extremely long query times when I tried
 this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I tried breaking
 the files into 1MB pieces, but searching would be wonky = return the wrong
 number of documents (ie. if one file had a term 5 times, and that was the
 only file that had the term, I want 1 result, not 5 results).  
 
 2) What sort of tokenizer would be best?  Here's what I'm using:
 
   field name=body type=text_pl indexed=true stored=true
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
 
fieldType name=text_pl class=solr.TextField
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0 catenateWords=0 
 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0/
  /analyzer
/fieldType
 
 
 Thanks!
 Pete

Re: Can Solr handle large text files?

2011-10-21 Thread Peter Spam

Thanks for your note, Anand.  What was the maximum chunk size for you?  Could 
you post the relevant portions of your configuration file?


Thanks!
Pete

On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:

 Hi,
 
 I was also facing the issue of highlighting the large text files. I applied 
 the solution proposed here and it worked. But I am getting following error :
 
 
 Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I 
 get this file from. Its reference is present in browse.vm
 
 div class=results
  #if($response.response.get('grouped'))
#foreach($grouping in $response.response.get('grouped'))
  #parse(hitGrouped.vm)
#end
  #else
#foreach($doc in $response.results)
  #parse(hit.vm)
#end
  #end
 /div
 
 
 HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
 cwd=C:\glassfish3\glassfish\domains\domain1\config 
 java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath 
 or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
 cwd=C:\glassfish3\glassfish\domains\domain1\config at 
 org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268)
  at 
 org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42)
  at org.apache.velocity.Template.process(Template.java:98) at 
 org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446)
  at 
 
 Thanks  Regards,
 Anand
 Anand Nigam
 RBS Global Banking  Markets
 Office: +91 124 492 5506   
 
 
 -Original Message-
 From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] 
 Sent: 21 October 2011 14:58
 To: solr-user@lucene.apache.org
 Subject: Re: Can Solr handle large text files?
 
 Hi Peter,
 
 highlighting in large text files can not be fast without dividing the 
 original text in small piece.
 So take a look in
 http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
 and in
 http://www.lucidimagination.com/blog/2010/09/16/2446/
 
 Which means that you should divide your files and use Result Grouping / Field 
 Collapsing to list only one hit per original document.
 
 (xtf also would solve your problem out of the box but xtf does not use 
 solr).
 
 Best regards
  Karsten
 
  Original-Nachricht 
 Datum: Thu, 20 Oct 2011 17:59:04 -0700
 Von: Peter Spam ps...@mac.com
 An: solr-user@lucene.apache.org
 Betreff: Can Solr handle large text files?
 
 I have about 20k text files, some very small, but some up to 300MB, 
 and would like to do text searching with highlighting.
 
 Imagine the text is the contents of your syslog.
 
 I would like to type in some terms, such as error and mail, and 
 have Solr return the syslog lines with those terms PLUS two lines of context.
 Pretty much just like Google's highlighting.
 
 1) Can Solr handle this?  I had extremely long query times when I 
 tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
 tried breaking the files into 1MB pieces, but searching would be wonky 
 = return the wrong number of documents (ie. if one file had a term 5 
 times, and that was the only file that had the term, I want 1 result, not 5 
 results).
 
 2) What sort of tokenizer would be best?  Here's what I'm using:
 
   field name=body type=text_pl indexed=true stored=true
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
 
fieldType name=text_pl class=solr.TextField
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0 catenateWords=0 
 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0/
  /analyzer
/fieldType
 
 
 Thanks!
 Pete
 
 ***
  
 The Royal Bank of Scotland plc. Registered in Scotland No 90312. 
 Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
 Authorised and regulated by the Financial Services Authority. The 
 Royal Bank of Scotland N.V. is authorised and regulated by the 
 De Nederlandsche Bank and has its seat at Amsterdam, the 
 Netherlands, and is registered in the Commercial Register under 
 number 33002587. Registered Office: Gustav Mahlerlaan 350, 
 Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and 
 The Royal Bank of Scotland plc are authorised to act as agent for each 
 other in certain jurisdictions. 
 
 This e-mail message is confidential and for use by the addressee only. 
 If the message is received by anyone other than the addressee, please 
 return the message to the sender by replying to it and then delete the 
 message from your computer. Internet e-mails are not necessarily 
 secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
 N.V. including its affiliates (RBS group) does not accept responsibility 
 for changes

Sorting fields with letters?

2011-10-21 Thread Peter Spam

Hi everyone,

I have a field that has a letter in it (for example, 1A1, 2A1, 11C15, etc.).  
Sorting it seems to work most of the time, except for a few things, like 10A1 
is lower than 8A100, and 10A100 is lower than 10A99.  Any ideas?  I bet if my 
data had leading zeros (ie 10A099), it would behave better?  (But I can't 
really change my data now, as it would take a few days to re-inject - which is 
possible but a hassle).


Thanks!
Pete

Re: Sorting fields with letters?

2011-10-21 Thread Peter Spam

Is there a way to use a custom sorter, to avoid re-indexing?


Thanks!
Pete

On Oct 21, 2011, at 2:13 PM, Tomás Fernández Löbbe wrote:

 Well, yes. You probably have a string field for that content, right? so the
 content is being compared as strings, not as numbers, that why something
 like 1000 is lower than 2. Leading zeros would be an option. Another option
 is to separate the field into numeric fields and sort by those (this last
 option is only recommended if your data always look similar).
 Something like 11C15 to field1: 11, field2:C field3: 15. Then use
 sort=field1,field2,field3.
 
 Anyway, both this options require reindexing.
 
 Regards,
 
 Tomás
 
 On Fri, Oct 21, 2011 at 4:57 PM, Peter Spam ps...@mac.com wrote:
 
 Hi everyone,
 
 I have a field that has a letter in it (for example, 1A1, 2A1, 11C15,
 etc.).  Sorting it seems to work most of the time, except for a few things,
 like 10A1 is lower than 8A100, and 10A100 is lower than 10A99.  Any ideas?
 I bet if my data had leading zeros (ie 10A099), it would behave better?
 (But I can't really change my data now, as it would take a few days to
 re-inject - which is possible but a hassle).
 
 
 Thanks!
 Pete

Can Solr handle large text files?

2011-10-20 Thread Peter Spam

I have about 20k text files, some very small, but some up to 300MB, and would 
like to do text searching with highlighting.

Imagine the text is the contents of your syslog.

I would like to type in some terms, such as error and mail, and have Solr 
return the syslog lines with those terms PLUS two lines of context.  Pretty 
much just like Google's highlighting.

1) Can Solr handle this?  I had extremely long query times when I tried this 
with Solr 1.4.1 (yes I was using TermVectors, etc.).  I tried breaking the 
files into 1MB pieces, but searching would be wonky = return the wrong number 
of documents (ie. if one file had a term 5 times, and that was the only file 
that had the term, I want 1 result, not 5 results).  

2) What sort of tokenizer would be best?  Here's what I'm using:

   field name=body type=text_pl indexed=true stored=true 
multiValued=false termVectors=true termPositions=true termOffsets=true 
/

fieldType name=text_pl class=solr.TextField
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0/
  /analyzer
/fieldType


Thanks!
Pete

Re: Dismax Request handler and Solrconfig.xml

2011-05-08 Thread Peter Spam

I'm having the same problem - the standard query returns all my documents, but 
the dismax one returns 0.  Any ideas?

http://server:8983/solr/select?qt=standardindent=onq=*

response
−
lst name=responseHeader
int name=status0/int
int name=QTime3592/int
−
lst name=params
str name=indenton/str
str name=qtstandard/str
str name=q*/str
/lst
/lst
−
result name=response numFound=9108 start=0
−
doc
[...]



http://server:8983/solr/select?qt=dismaxindent=onq=*

response
−
lst name=responseHeader
int name=status0/int
int name=QTime10/int
−
lst name=params
str name=indenton/str
str name=qtdismax/str
str name=q*/str
/lst
/lst
result name=response numFound=0 start=0 maxScore=0.0/
/response


On Sep 29, 2010, at 2:31 PM, Chris Hostetter wrote:

 : In Solrconfig.xml, default request handler is set to standard. I am
 : planning to change that to use dismax as the request handler but when I
 : set default=true for dismax - Solr does not return any results - I get
 : results only when I comment out str name=defTypedismax/str. 
 
 you need to elaborate on what you mean by does not return any results 
 ... doesn't return results for what exactly?  what do your requests look 
 like? (ie full URLs with all params) what do you expect to get back?  
 
 what URLs are you using when you don't use defType=dismax? what do you get 
 back then?
 
 not setting defType means you are getting the standard LuceneQParser 
 instead o the DismaxQParser which means the qf param is being ignored and 
 hte defaultSearchField is being used instead.  are the terms you are 
 searching for in your default search field but not in your title or 
 pagedescription field?
 
 Please note these guidelines
 http://wiki.apache.org/solr/UsingMailingLists#Information_useful_for_searching_problems
 
 
 -Hoss
 
 --
 http://lucenerevolution.org/  ...  October 7-8, Boston
 http://bit.ly/stump-hoss  ...  Stump The Chump!

Re: How to Update Value of One Field of a Document in Index?

2011-04-26 Thread Peter Spam

My schema: id, name, checksum, body, notes, date

I'd like for a user to be able to add notes to the notes field, and not have to 
re-index the document (since the body field may contain 100MB of text).  Some 
ideas:

1) How about creating another core which only contains id, checksum, and notes? 
 Then, updating (delete followed by add) wouldn't be that painful?

2) What about using a multValued field?  Could you just keep adding values as 
the user enters more notes?


Pete

On Sep 9, 2010, at 11:06 PM, Liam O'Boyle wrote:

 Hi Savannah,
 
 You can only reindex the entire document; if you only have the ID,
 then do a search to retrieve the rest of the data, then reindex.  This
 assumes that all of the fields you need to index are stored (so that
 you can retrieve them) and not just indexed.
 
 Liam
 
 On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
 savannah_becket...@yahoo.com wrote:
 
 I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
 update the value of one of the fields of a document in the solr index after 
 the
 document was already indexed, and I have only the document id.  How do I do
 that?
 
 Thanks.

Re: Tips for getting unique results?

2011-04-08 Thread Peter Spam

Thanks for the note, Shaun, but the documentation indicates that the sorting is 
only in ascending order :-(

facet.sort

This param determines the ordering of the facet field constraints.

• count - sort the constraints by count (highest count first)
• index - to return the constraints sorted in their index order 
(lexicographic by indexed term). For terms in the ascii range, this will be 
alphabetically sorted.
The default is count if facet.limit is greater than 0, index otherwise.

Prior to Solr1.4, one needed to use true instead of count and false instead of 
index.

This parameter can be specified on a per field basis.


-Pete

On Apr 8, 2011, at 2:49 AM, Shaun Campbell wrote:

 Pete
 
 Surely the default sort order for facets is by descending count order.  See
 http://wiki.apache.org/solr/SimpleFacetParameters.  If your results are
 really sorted in ascending order can't you sort them externally eg Java?
 
 Hope that helps.
 
 Shaun

Re: Tips for getting unique results?

2011-04-07 Thread Peter Spam

The data are fine and not duplicated - however, I want to analyze the data, and 
summarize one field (kind of like faceting), to understand what the largest 
value is.

For example:

Document 1:   label=1A1A1; body=adfasdfadsfasf
Document 2:   label=5A1B1; body=adfaasdfasdfsdfadsfasf
Document 3:   label=1A1A1; body=adasdfasdfasdffaasdfasdfsdfadsfasf
Document 4:   label=7A1A1; body=azxzxcvdfaasdfasdfsdfadsfasf
Document 5:   label=7A1A1; body=azxzxcvdfaasdfasdfsdasdafadsfasf
Document 6:   label=5A1B1; body=adfaasdfasdfsdfadsfasfzzz

How do I get back just ONE of the largest label item?

In other words, what query will return the 7A1A1 label just once?  If I search 
for q=* and sort the results, it works, except I get back multiple hits for 
each label.  If I do a facet, I can only sort by increasing order, when what I 
want is decreasing order.


-Pete
 
On Apr 6, 2011, at 10:22 PM, Otis Gospodnetic wrote:

 Hi,
 
 I think you are saying dupes are the main problem?  If so, 
 http://wiki.apache.org/solr/Deduplication ?
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
 From: Peter Spam ps...@mac.com
 To: solr-user@lucene.apache.org
 Sent: Thu, April 7, 2011 1:13:44 AM
 Subject: Tips for getting unique results?
 
 Hi,
 
 I have documents with a field that has 1A2B3C alphanumeric  characters.  I 
 can query for * and sort results based on this field,  however I'd like to 
 uniq these results (remove duplicates) so that I can get  the 5 largest 
 unique 
 values.  I can't use the StatsComponent because my  values have letters in 
 them 
 too.
 
 Faceting (and ignoring the counts) gets  me half of the way there, but I can 
 only sort ascending.  If I could also  sort facet results descending, I'd be 
 done.  I'd rather not return all  documents and just parse the last few 
 results 
 to work around this.
 
 Any  ideas?
 
 
 -Pete

Re: Tips for getting unique results?

2011-04-07 Thread Peter Spam

Would grouping solve this?  I'd rather not move to a pre-release solr ...

To clarify the problem:

The data are fine and not duplicated - however, I want to analyze the data, and 
summarize one field (kind of like faceting), to understand what the largest 
value is.

For example:

Document 1:   label=1A1A1; body=adfasdfadsfasf
Document 2:   label=5A1B1; body=adfaasdfasdfsdfadsfasf
Document 3:   label=1A1A1; body=adasdfasdfasdffaasdfasdfsdfadsfasf
Document 4:   label=7A1A1; body=azxzxcvdfaasdfasdfsdfadsfasf
Document 5:   label=7A1A1; body=azxzxcvdfaasdfasdfsdasdafadsfasf
Document 6:   label=5A1B1; body=adfaasdfasdfsdfadsfasfzzz

How do I get back just ONE of the largest label item?

In other words, what query will return the 7A1A1 label just once?  If I search 
for q=* and sort the results, it works, except I get back multiple hits for 
each label.  If I do a facet, I can only sort by increasing order, when what I 
want is decreasing order.


-Peter

On Apr 7, 2011, at 10:02 AM, Erick Erickson wrote:

 What version of Solr are you using? And, assuming the version that
 has it in, have you seen grouping?
 
 Which is another way of asking why you want to do this, perhaps it's an
 XY problem
 
 Best
 Erick
 
 On Thu, Apr 7, 2011 at 1:13 AM, Peter Spam ps...@mac.com wrote:
 
 Hi,
 
 I have documents with a field that has 1A2B3C alphanumeric characters.  I
 can query for * and sort results based on this field, however I'd like to
 uniq these results (remove duplicates) so that I can get the 5 largest
 unique values.  I can't use the StatsComponent because my values have
 letters in them too.
 
 Faceting (and ignoring the counts) gets me half of the way there, but I can
 only sort ascending.  If I could also sort facet results descending, I'd be
 done.  I'd rather not return all documents and just parse the last few
 results to work around this.
 
 Any ideas?
 
 
 -Pete

Tips for getting unique results?

2011-04-06 Thread Peter Spam

Hi,

I have documents with a field that has 1A2B3C alphanumeric characters.  I can 
query for * and sort results based on this field, however I'd like to uniq 
these results (remove duplicates) so that I can get the 5 largest unique 
values.  I can't use the StatsComponent because my values have letters in them 
too.

Faceting (and ignoring the counts) gets me half of the way there, but I can 
only sort ascending.  If I could also sort facet results descending, I'd be 
done.  I'd rather not return all documents and just parse the last few results 
to work around this.

Any ideas?


-Pete

Re: Solr searching performance issues, using large documents (now 1MB documents)

2010-08-25 Thread Peter Spam

This is a very small number of documents (7000), so I am surprised Solr is 
having such a hard time with it!!

I do facet on 3 terms.

Subsequent hello searches are faster, but still well over a second.  This is 
a very fast Mac Pro, with 6GB of RAM.


Thanks,
Peter

On Aug 25, 2010, at 9:52 AM, Yonik Seeley wrote:

 On Wed, Aug 25, 2010 at 11:29 AM, Peter Spam ps...@mac.com wrote:
 So, I went through all the effort to break my documents into max 1 MB 
 chunks, and searching for hello still takes over 40 seconds (searching 
 across 7433 documents):
 
8 results (41980 ms)
 
 What is going on???  (scroll down for my config).
 
 Are you still faceting on that query also?
 Breaking your docs into many chunks means inflating the doc count and
 will make faceting slower.
 Also, first-time faceting (as with sorting) is slow... did you try
 another query after  hello (and without a commit happening
 inbetween) to see if it was faster?
 
 -Yonik
 http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8

Re: Solr searching performance issues, using large documents

2010-08-16 Thread Peter Spam

Still stuck on this - any hints on how to write the JavaScript to split a 
document?  Thanks!


-Pete

On Aug 5, 2010, at 8:10 PM, Lance Norskog wrote:

 You may have to write your own javascript to read in the giant field
 and split it up.
 
 On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam ps...@mac.com wrote:
 I've read through the DataImportHandler page a few times, and still can't 
 figure out how to separate a large document into smaller documents.  Any 
 hints? :-)  Thanks!
 
 -Peter
 
 On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:
 
 Spanning won't work- you would have to make overlapping mini-documents
 if you want to support this.
 
 I don't know how big the chunks should be- you'll have to experiment.
 
 Lance
 
 On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam ps...@mac.com wrote:
 What would happen if the search query phrase spanned separate document 
 chunks?
 
 Also, what would the optimal size of chunks be?
 
 Thanks!
 
 
 -Peter
 
 On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
 
 Not that I know of.
 
 The DataImportHandler has the ability to create multiple documents
 from one input stream. It is possible to create a DIH file that reads
 large log files and splits each one into N documents, with the file
 name as a common field. The DIH wiki page tells you in general how to
 make a DIH file.
 
 http://wiki.apache.org/solr/DataImportHandler
 
 From this, you should be able to make a DIH file that puts log files
 in as separate documents. As to splitting files up into
 mini-documents, you might have to write a bit of Javascript to achieve
 this. There is no data structure or software that implements
 structured documents.
 
 On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam ps...@mac.com wrote:
 Thanks for the pointer, Lance!  Is there an example of this somewhere?
 
 
 -Peter
 
 On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
 
 Ah! You're not just highlighting, you're snippetizing. This makes it 
 easier.
 
 Highlighting does not stream- it pulls the entire stored contents into
 one string and then pulls out the snippet.  If you want this to be
 fast, you have to split up the text into small pieces and only
 snippetize from the most relevant text. So, separate documents with a
 common group id for the document it came from. You might have to do 2
 queries to achieve what you want, but the second query for the same
 query will be blindingly fast. Often 1ms.
 
 Good luck!
 
 Lance
 
 On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam ps...@mac.com wrote:
 However, I do need to search the entire document, or else the 
 highlighting will sometimes be blank :-(
 Thanks!
 
 - Peter
 
 ps. sorry for the many responses - I'm rushing around trying to get 
 this working.
 
 On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
 
 Correction - it went from 17 seconds to 10 seconds - I was changing 
 the hl.regex.maxAnalyzedChars the first time.
 Thanks!
 
 -Peter
 
 On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
 
 On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
 
 did you already try other values for hl.maxAnalyzedChars=2147483647
 
 Yes, I tried dropping it down to 21, but it didn't have much of an 
 impact (one search I just tried went from 17 seconds to 15.8 
 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
 
 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for 
 fuzzy or
 other wildcard search.
 
 fuzzy could be set to * but isn't right now.
 
 Thanks for the tips, Peter - this has been very frustrating!
 
 
 - Peter
 
 Regards,
 Peter.
 
 Data set: About 4,000 log files (will eventually grow to 
 millions).  Average log file is 850k.  Largest log file (so far) 
 is about 70MB.
 
 Problem: When I search for common terms, the query time goes from 
 under 2-3 seconds to about 60 seconds.  TermVectors etc are 
 enabled.  When I disable highlighting, performance improves a lot, 
 but is still slow for some queries (7 seconds).  Thanks in advance 
 for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
  fieldType name=text_pl class=solr.TextField
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=0 generateNumberParts=0 catenateWords=0 
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
/analyzer
  /fieldType
 
 ...
 
 field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
  field name

Re: Solr searching performance issues, using large documents

2010-08-05 Thread Peter Spam

I've read through the DataImportHandler page a few times, and still can't 
figure out how to separate a large document into smaller documents.  Any hints? 
:-)  Thanks!

-Peter

On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:

 Spanning won't work- you would have to make overlapping mini-documents
 if you want to support this.
 
 I don't know how big the chunks should be- you'll have to experiment.
 
 Lance
 
 On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam ps...@mac.com wrote:
 What would happen if the search query phrase spanned separate document 
 chunks?
 
 Also, what would the optimal size of chunks be?
 
 Thanks!
 
 
 -Peter
 
 On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
 
 Not that I know of.
 
 The DataImportHandler has the ability to create multiple documents
 from one input stream. It is possible to create a DIH file that reads
 large log files and splits each one into N documents, with the file
 name as a common field. The DIH wiki page tells you in general how to
 make a DIH file.
 
 http://wiki.apache.org/solr/DataImportHandler
 
 From this, you should be able to make a DIH file that puts log files
 in as separate documents. As to splitting files up into
 mini-documents, you might have to write a bit of Javascript to achieve
 this. There is no data structure or software that implements
 structured documents.
 
 On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam ps...@mac.com wrote:
 Thanks for the pointer, Lance!  Is there an example of this somewhere?
 
 
 -Peter
 
 On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
 
 Ah! You're not just highlighting, you're snippetizing. This makes it 
 easier.
 
 Highlighting does not stream- it pulls the entire stored contents into
 one string and then pulls out the snippet.  If you want this to be
 fast, you have to split up the text into small pieces and only
 snippetize from the most relevant text. So, separate documents with a
 common group id for the document it came from. You might have to do 2
 queries to achieve what you want, but the second query for the same
 query will be blindingly fast. Often 1ms.
 
 Good luck!
 
 Lance
 
 On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam ps...@mac.com wrote:
 However, I do need to search the entire document, or else the 
 highlighting will sometimes be blank :-(
 Thanks!
 
 - Peter
 
 ps. sorry for the many responses - I'm rushing around trying to get this 
 working.
 
 On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
 
 Correction - it went from 17 seconds to 10 seconds - I was changing the 
 hl.regex.maxAnalyzedChars the first time.
 Thanks!
 
 -Peter
 
 On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
 
 On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
 
 did you already try other values for hl.maxAnalyzedChars=2147483647
 
 Yes, I tried dropping it down to 21, but it didn't have much of an 
 impact (one search I just tried went from 17 seconds to 15.8 seconds, 
 and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
 
 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for fuzzy 
 or
 other wildcard search.
 
 fuzzy could be set to * but isn't right now.
 
 Thanks for the tips, Peter - this has been very frustrating!
 
 
 - Peter
 
 Regards,
 Peter.
 
 Data set: About 4,000 log files (will eventually grow to millions).  
 Average log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from 
 under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled. 
  When I disable highlighting, performance improves a lot, but is 
 still slow for some queries (7 seconds).  Thanks in advance for any 
 ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
  fieldType name=text_pl class=solr.TextField
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=0 generateNumberParts=0 catenateWords=0 
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
/analyzer
  /fieldType
 
 ...
 
 field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
  field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
 field name=version type=string indexed=true stored=true 
 multiValued=false/
 field name=device type=string indexed=true stored=true 
 multiValued=false/
 field name=filename type=string indexed=true stored=true 
 multiValued=false/
 field

Re: Solr searching performance issues, using large documents

2010-08-02 Thread Peter Spam

What would happen if the search query phrase spanned separate document chunks?

Also, what would the optimal size of chunks be?

Thanks!


-Peter

On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:

 Not that I know of.
 
 The DataImportHandler has the ability to create multiple documents
 from one input stream. It is possible to create a DIH file that reads
 large log files and splits each one into N documents, with the file
 name as a common field. The DIH wiki page tells you in general how to
 make a DIH file.
 
 http://wiki.apache.org/solr/DataImportHandler
 
 From this, you should be able to make a DIH file that puts log files
 in as separate documents. As to splitting files up into
 mini-documents, you might have to write a bit of Javascript to achieve
 this. There is no data structure or software that implements
 structured documents.
 
 On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam ps...@mac.com wrote:
 Thanks for the pointer, Lance!  Is there an example of this somewhere?
 
 
 -Peter
 
 On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
 
 Ah! You're not just highlighting, you're snippetizing. This makes it easier.
 
 Highlighting does not stream- it pulls the entire stored contents into
 one string and then pulls out the snippet.  If you want this to be
 fast, you have to split up the text into small pieces and only
 snippetize from the most relevant text. So, separate documents with a
 common group id for the document it came from. You might have to do 2
 queries to achieve what you want, but the second query for the same
 query will be blindingly fast. Often 1ms.
 
 Good luck!
 
 Lance
 
 On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam ps...@mac.com wrote:
 However, I do need to search the entire document, or else the highlighting 
 will sometimes be blank :-(
 Thanks!
 
 - Peter
 
 ps. sorry for the many responses - I'm rushing around trying to get this 
 working.
 
 On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
 
 Correction - it went from 17 seconds to 10 seconds - I was changing the 
 hl.regex.maxAnalyzedChars the first time.
 Thanks!
 
 -Peter
 
 On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
 
 On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
 
 did you already try other values for hl.maxAnalyzedChars=2147483647
 
 Yes, I tried dropping it down to 21, but it didn't have much of an 
 impact (one search I just tried went from 17 seconds to 15.8 seconds, 
 and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
 
 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for fuzzy or
 other wildcard search.
 
 fuzzy could be set to * but isn't right now.
 
 Thanks for the tips, Peter - this has been very frustrating!
 
 
 - Peter
 
 Regards,
 Peter.
 
 Data set: About 4,000 log files (will eventually grow to millions).  
 Average log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from 
 under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  
 When I disable highlighting, performance improves a lot, but is still 
 slow for some queries (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
  fieldType name=text_pl class=solr.TextField
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=0 generateNumberParts=0 catenateWords=0 
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
/analyzer
  /fieldType
 
 ...
 
 field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
  field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
 field name=version type=string indexed=true stored=true 
 multiValued=false/
 field name=device type=string indexed=true stored=true 
 multiValued=false/
 field name=filename type=string indexed=true stored=true 
 multiValued=false/
 field name=filesize type=long indexed=true stored=true 
 multiValued=false/
 field name=pversion type=int indexed=true stored=true 
 multiValued=false/
 field name=first2md5 type=string indexed=false stored=true 
 multiValued=false/
 field name=ckey type=string indexed=true stored=true 
 multiValued=false/
 
 ...
 
 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator

Re: Solr searching performance issues, using large documents

2010-08-01 Thread Peter Spam

Thanks for the pointer, Lance!  Is there an example of this somewhere?


-Peter

On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:

 Ah! You're not just highlighting, you're snippetizing. This makes it easier.
 
 Highlighting does not stream- it pulls the entire stored contents into
 one string and then pulls out the snippet.  If you want this to be
 fast, you have to split up the text into small pieces and only
 snippetize from the most relevant text. So, separate documents with a
 common group id for the document it came from. You might have to do 2
 queries to achieve what you want, but the second query for the same
 query will be blindingly fast. Often 1ms.
 
 Good luck!
 
 Lance
 
 On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam ps...@mac.com wrote:
 However, I do need to search the entire document, or else the highlighting 
 will sometimes be blank :-(
 Thanks!
 
 - Peter
 
 ps. sorry for the many responses - I'm rushing around trying to get this 
 working.
 
 On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
 
 Correction - it went from 17 seconds to 10 seconds - I was changing the 
 hl.regex.maxAnalyzedChars the first time.
 Thanks!
 
 -Peter
 
 On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
 
 On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
 
 did you already try other values for hl.maxAnalyzedChars=2147483647
 
 Yes, I tried dropping it down to 21, but it didn't have much of an impact 
 (one search I just tried went from 17 seconds to 15.8 seconds, and this is 
 an 8-core Mac Pro with 6GB RAM - 4GB for java).
 
 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for fuzzy or
 other wildcard search.
 
 fuzzy could be set to * but isn't right now.
 
 Thanks for the tips, Peter - this has been very frustrating!
 
 
 - Peter
 
 Regards,
 Peter.
 
 Data set: About 4,000 log files (will eventually grow to millions).  
 Average log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from under 
 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I 
 disable highlighting, performance improves a lot, but is still slow for 
 some queries (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
  fieldType name=text_pl class=solr.TextField
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
 generateNumberParts=0 catenateWords=0 catenateNumbers=0 
 catenateAll=0 splitOnCaseChange=0/
/analyzer
  /fieldType
 
 ...
 
 field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
  field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
 field name=version type=string indexed=true stored=true 
 multiValued=false/
 field name=device type=string indexed=true stored=true 
 multiValued=false/
 field name=filename type=string indexed=true stored=true 
 multiValued=false/
 field name=filesize type=long indexed=true stored=true 
 multiValued=false/
 field name=pversion type=int indexed=true stored=true 
 multiValued=false/
 field name=first2md5 type=string indexed=false stored=true 
 multiValued=false/
 field name=ckey type=string indexed=true stored=true 
 multiValued=false/
 
 ...
 
 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator=AND/
 
 -
 
 solrconfig.xml changes:
 
  maxFieldLength2147483647/maxFieldLength
  ramBufferSizeMB128/ramBufferSizeMB
 
 -
 
 The query:
 
 rowStr = rows=10
 facet = 
 facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version
 fields = fl=id,score,filename,version,device,first2md5,filesize,ckey
 termvectors = tv=trueqt=tvrhtv.all=true
 hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
 regexv = (?m)^.*\n.*\n.*$
 hl_regex = hl.regex.pattern= + CGI::escape(regexv) + 
 hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647
 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
 '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr

Re: Solr searching performance issues, using large documents

2010-07-31 Thread Peter Spam

On Jul 30, 2010, at 7:04 PM, Lance Norskog wrote:

 Wait- how much text are you highlighting? You say these logfiles are X
 big- how big are the actual documents you are storing?

I want it to be like google - I put the entire (sometimes 60MB) doc in a field, 
and then just highlight 2-4 lines of it.


Thanks,
Peter


 On Fri, Jul 30, 2010 at 1:16 PM, Peter Karich peat...@yahoo.de wrote:
 Hi Peter :-),
 
 did you already try other values for
 
 hl.maxAnalyzedChars=2147483647
 
 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for fuzzy or
 other wildcard search.
 
 Regards,
 Peter.
 
 Data set: About 4,000 log files (will eventually grow to millions).  
 Average log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from under 2-3 
 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
 highlighting, performance improves a lot, but is still slow for some 
 queries (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
 fieldType name=text_pl class=solr.TextField
   analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
 generateNumberParts=0 catenateWords=0 catenateNumbers=0 
 catenateAll=0 splitOnCaseChange=0/
   /analyzer
 /fieldType
 
 ...
 
field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
 field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
field name=version type=string indexed=true stored=true 
 multiValued=false/
field name=device type=string indexed=true stored=true 
 multiValued=false/
field name=filename type=string indexed=true stored=true 
 multiValued=false/
field name=filesize type=long indexed=true stored=true 
 multiValued=false/
field name=pversion type=int indexed=true stored=true 
 multiValued=false/
field name=first2md5 type=string indexed=false stored=true 
 multiValued=false/
field name=ckey type=string indexed=true stored=true 
 multiValued=false/
 
 ...
 
  dynamicField name=* type=ignored multiValued=true /
  defaultSearchFieldbody/defaultSearchField
  solrQueryParser defaultOperator=AND/
 
 -
 
 solrconfig.xml changes:
 
 maxFieldLength2147483647/maxFieldLength
 ramBufferSizeMB128/ramBufferSizeMB
 
 -
 
 The query:
 
 rowStr = rows=10
 facet = 
 facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version
 fields = fl=id,score,filename,version,device,first2md5,filesize,ckey
 termvectors = tv=trueqt=tvrhtv.all=true
 hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
 regexv = (?m)^.*\n.*\n.*$
 hl_regex = hl.regex.pattern= + CGI::escape(regexv) + 
 hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647
 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
 '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr)
 
 thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : 
 ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + 
 hl + hl_regex
 
 baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + 
 p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s
 
 
 
 
 
 --
 http://karussell.wordpress.com/
 
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com

Re: Solr searching performance issues, using large documents

2010-07-31 Thread Peter Spam

On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:

 did you already try other values for hl.maxAnalyzedChars=2147483647

Yes, I tried dropping it down to 21, but it didn't have much of an impact (one 
search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core 
Mac Pro with 6GB RAM - 4GB for java).

 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for fuzzy or
 other wildcard search.

fuzzy could be set to * but isn't right now.

Thanks for the tips, Peter - this has been very frustrating!


- Peter

 Regards,
 Peter.
 
 Data set: About 4,000 log files (will eventually grow to millions).  Average 
 log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from under 2-3 
 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
 highlighting, performance improves a lot, but is still slow for some queries 
 (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
fieldType name=text_pl class=solr.TextField
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/ 
  filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
 generateNumberParts=0 catenateWords=0 catenateNumbers=0 
 catenateAll=0 splitOnCaseChange=0/
  /analyzer
/fieldType
 
 ...
 
   field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
   field name=version type=string indexed=true stored=true 
 multiValued=false/
   field name=device type=string indexed=true stored=true 
 multiValued=false/
   field name=filename type=string indexed=true stored=true 
 multiValued=false/
   field name=filesize type=long indexed=true stored=true 
 multiValued=false/
   field name=pversion type=int indexed=true stored=true 
 multiValued=false/
   field name=first2md5 type=string indexed=false stored=true 
 multiValued=false/
   field name=ckey type=string indexed=true stored=true 
 multiValued=false/
 
 ...
 
 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator=AND/
 
 -
 
 solrconfig.xml changes:
 
maxFieldLength2147483647/maxFieldLength
ramBufferSizeMB128/ramBufferSizeMB
 
 -
 
 The query:
 
 rowStr = rows=10
 facet = 
 facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version
 fields = fl=id,score,filename,version,device,first2md5,filesize,ckey
 termvectors = tv=trueqt=tvrhtv.all=true
 hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
 regexv = (?m)^.*\n.*\n.*$
 hl_regex = hl.regex.pattern= + CGI::escape(regexv) + 
 hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647
 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
 '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr)
 
 thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : 
 ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl 
 + hl_regex
 
 baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + 
 p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s
 
 
 
 
 
 -- 
 http://karussell.wordpress.com/

Re: Solr searching performance issues, using large documents

2010-07-31 Thread Peter Spam

Correction - it went from 17 seconds to 10 seconds - I was changing the 
hl.regex.maxAnalyzedChars the first time.
Thanks!

-Peter

On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:

 On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
 
 did you already try other values for hl.maxAnalyzedChars=2147483647
 
 Yes, I tried dropping it down to 21, but it didn't have much of an impact 
 (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 
 8-core Mac Pro with 6GB RAM - 4GB for java).
 
 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for fuzzy or
 other wildcard search.
 
 fuzzy could be set to * but isn't right now.
 
 Thanks for the tips, Peter - this has been very frustrating!
 
 
 - Peter
 
 Regards,
 Peter.
 
 Data set: About 4,000 log files (will eventually grow to millions).  
 Average log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from under 2-3 
 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
 highlighting, performance improves a lot, but is still slow for some 
 queries (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
   fieldType name=text_pl class=solr.TextField
 analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/ 
 filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
 generateNumberParts=0 catenateWords=0 catenateNumbers=0 
 catenateAll=0 splitOnCaseChange=0/
 /analyzer
   /fieldType
 
 ...
 
  field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
   field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
  field name=version type=string indexed=true stored=true 
 multiValued=false/
  field name=device type=string indexed=true stored=true 
 multiValued=false/
  field name=filename type=string indexed=true stored=true 
 multiValued=false/
  field name=filesize type=long indexed=true stored=true 
 multiValued=false/
  field name=pversion type=int indexed=true stored=true 
 multiValued=false/
  field name=first2md5 type=string indexed=false stored=true 
 multiValued=false/
  field name=ckey type=string indexed=true stored=true 
 multiValued=false/
 
 ...
 
 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator=AND/
 
 -
 
 solrconfig.xml changes:
 
   maxFieldLength2147483647/maxFieldLength
   ramBufferSizeMB128/ramBufferSizeMB
 
 -
 
 The query:
 
 rowStr = rows=10
 facet = 
 facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version
 fields = fl=id,score,filename,version,device,first2md5,filesize,ckey
 termvectors = tv=trueqt=tvrhtv.all=true
 hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
 regexv = (?m)^.*\n.*\n.*$
 hl_regex = hl.regex.pattern= + CGI::escape(regexv) + 
 hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647
 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
 '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr)
 
 thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : 
 ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + 
 hl + hl_regex
 
 baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + 
 p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s
 
 
 
 
 
 -- 
 http://karussell.wordpress.com/

Re: Solr searching performance issues, using large documents

2010-07-31 Thread Peter Spam

However, I do need to search the entire document, or else the highlighting will 
sometimes be blank :-(
Thanks!

- Peter

ps. sorry for the many responses - I'm rushing around trying to get this 
working.

On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:

 Correction - it went from 17 seconds to 10 seconds - I was changing the 
 hl.regex.maxAnalyzedChars the first time.
 Thanks!
 
 -Peter
 
 On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
 
 On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
 
 did you already try other values for hl.maxAnalyzedChars=2147483647
 
 Yes, I tried dropping it down to 21, but it didn't have much of an impact 
 (one search I just tried went from 17 seconds to 15.8 seconds, and this is 
 an 8-core Mac Pro with 6GB RAM - 4GB for java).
 
 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for fuzzy or
 other wildcard search.
 
 fuzzy could be set to * but isn't right now.
 
 Thanks for the tips, Peter - this has been very frustrating!
 
 
 - Peter
 
 Regards,
 Peter.
 
 Data set: About 4,000 log files (will eventually grow to millions).  
 Average log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from under 
 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I 
 disable highlighting, performance improves a lot, but is still slow for 
 some queries (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
  fieldType name=text_pl class=solr.TextField
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/ 
filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
 generateNumberParts=0 catenateWords=0 catenateNumbers=0 
 catenateAll=0 splitOnCaseChange=0/
/analyzer
  /fieldType
 
 ...
 
 field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
  field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
 field name=version type=string indexed=true stored=true 
 multiValued=false/
 field name=device type=string indexed=true stored=true 
 multiValued=false/
 field name=filename type=string indexed=true stored=true 
 multiValued=false/
 field name=filesize type=long indexed=true stored=true 
 multiValued=false/
 field name=pversion type=int indexed=true stored=true 
 multiValued=false/
 field name=first2md5 type=string indexed=false stored=true 
 multiValued=false/
 field name=ckey type=string indexed=true stored=true 
 multiValued=false/
 
 ...
 
 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator=AND/
 
 -
 
 solrconfig.xml changes:
 
  maxFieldLength2147483647/maxFieldLength
  ramBufferSizeMB128/ramBufferSizeMB
 
 -
 
 The query:
 
 rowStr = rows=10
 facet = 
 facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version
 fields = fl=id,score,filename,version,device,first2md5,filesize,ckey
 termvectors = tv=trueqt=tvrhtv.all=true
 hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
 regexv = (?m)^.*\n.*\n.*$
 hl_regex = hl.regex.pattern= + CGI::escape(regexv) + 
 hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647
 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
 '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr)
 
 thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' 
 : ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors 
 + hl + hl_regex
 
 baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + 
 p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s
 
 
 
 
 
 -- 
 http://karussell.wordpress.com/

Re: Solr searching performance issues, using large documents

2010-07-30 Thread Peter Spam

I do store term vector:

field name=body type=text_pl indexed=true stored=true 
multiValued=false termVectors=true termPositions=true termOffsets=true 
/

-Pete

On Jul 30, 2010, at 7:30 AM, Li Li wrote:

 hightlight's time is mainly spent on getting the field which you want
 to highlight and tokenize this field(If you don't store term vector) .
 you can check what's wrong,
 
 2010/7/30 Peter Spam ps...@mac.com:
 If I don't do highlighting, it's really fast.  Optimize has no effect.
 
 -Peter
 
 On Jul 29, 2010, at 11:54 AM, dc tech wrote:
 
 Are you storing the entire log file text in SOLR? That's almost 3gb of
 text that you are storing in the SOLR. Try to
 1) Is this first time performance or on repaat queries with the same fields?
 2) Optimze the index and test performance again
 3) index without storing the text and see what the performance looks like.
 
 
 On 7/29/10, Peter Spam ps...@mac.com wrote:
 Any ideas?  I've got 5000 documents with an average size of 850k each, and
 it sometimes takes 2 minutes for a query to come back when highlighting is
 turned on!  Help!
 
 
 -Pete
 
 On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
 
 From the mailing list archive, Koji wrote:
 
 1. Provide another field for highlighting and use copyField to copy
 plainText to the highlighting field.
 
 and Lance wrote:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
 
 If you want to highlight field X, doing the
 termOffsets/termPositions/termVectors will make highlighting that field
 faster. You should make a separate field and apply these options to that
 field.
 
 Now: doing a copyfield adds a value to a multiValued field. For a text
 field, you get a multi-valued text field. You should only copy one value
 to the highlighted field, so just copyField the document to your special
 field. To enforce this, I would add multiValued=false to that field,
 just to avoid mistakes.
 
 So, all_text should be indexed without the term* attributes, and should
 not be stored. Then your document stored in a separate field that you use
 for highlighting and has the term* attributes.
 
 I've been experimenting with this, and here's what I've tried:
 
  field name=body type=text_pl indexed=true stored=false
 multiValued=true termVectors=true termPositions=true termOff
 sets=true /
  field name=body_all type=text_pl indexed=false stored=true
 multiValued=true /
  copyField source=body dest=body_all/
 
 ... but it's still very slow (10+ seconds).  Why is it better to have two
 fields (one indexed but not stored, and the other not indexed but stored)
 rather than just one field that's both indexed and stored?
 
 
 From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
 
 If you aren't always using all the stored fields, then enabling lazy
 field loading can be a huge boon, especially if compressed fields are
 used.
 
 What does this mean?  How do you load a field lazily?
 
 Thanks for your time, guys - this has started to become frustrating, since
 it works so well, but is very slow!
 
 
 -Pete
 
 On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
 
 Data set: About 4,000 log files (will eventually grow to millions).
 Average log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from under
 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
 disable highlighting, performance improves a lot, but is still slow for
 some queries (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
  fieldType name=text_pl class=solr.TextField
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=0
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0/
/analyzer
  /fieldType
 
 ...
 
 field name=body type=text_pl indexed=true stored=true
 multiValued=false termVectors=true termPositions=true
 termOffsets=true /
  field name=timestamp type=date indexed=true stored=true
 default=NOW multiValued=false/
 field name=version type=string indexed=true stored=true
 multiValued=false/
 field name=device type=string indexed=true stored=true
 multiValued=false/
 field name=filename type=string indexed=true stored=true
 multiValued=false/
 field name=filesize type=long indexed=true stored=true
 multiValued=false/
 field name=pversion type=int indexed=true stored=true
 multiValued=false/
 field name=first2md5 type=string indexed=false stored=true
 multiValued=false/
 field name=ckey type=string indexed=true stored=true

Re: Solr searching performance issues, using large documents

2010-07-29 Thread Peter Spam

Any ideas?  I've got 5000 documents with an average size of 850k each, and it 
sometimes takes 2 minutes for a query to come back when highlighting is turned 
on!  Help!


-Pete

On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:

 From the mailing list archive, Koji wrote:
 
 1. Provide another field for highlighting and use copyField to copy 
 plainText to the highlighting field.
 
 and Lance wrote: 
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
 
 If you want to highlight field X, doing the 
 termOffsets/termPositions/termVectors will make highlighting that field 
 faster. You should make a separate field and apply these options to that 
 field.
 
 Now: doing a copyfield adds a value to a multiValued field. For a text 
 field, you get a multi-valued text field. You should only copy one value to 
 the highlighted field, so just copyField the document to your special field. 
 To enforce this, I would add multiValued=false to that field, just to 
 avoid mistakes.
 
 So, all_text should be indexed without the term* attributes, and should not 
 be stored. Then your document stored in a separate field that you use for 
 highlighting and has the term* attributes.
 
 I've been experimenting with this, and here's what I've tried:
 
   field name=body type=text_pl indexed=true stored=false 
 multiValued=true termVectors=true termPositions=true termOff
 sets=true /
   field name=body_all type=text_pl indexed=false stored=true 
 multiValued=true /
   copyField source=body dest=body_all/
 
 ... but it's still very slow (10+ seconds).  Why is it better to have two 
 fields (one indexed but not stored, and the other not indexed but stored) 
 rather than just one field that's both indexed and stored?
 
 
 From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
 
 If you aren't always using all the stored fields, then enabling lazy field 
 loading can be a huge boon, especially if compressed fields are used.
 
 What does this mean?  How do you load a field lazily?
 
 Thanks for your time, guys - this has started to become frustrating, since it 
 works so well, but is very slow!
 
 
 -Pete
 
 On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
 
 Data set: About 4,000 log files (will eventually grow to millions).  Average 
 log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from under 2-3 
 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
 highlighting, performance improves a lot, but is still slow for some queries 
 (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
   fieldType name=text_pl class=solr.TextField
 analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/ 
  filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
 generateNumberParts=0 catenateWords=0 catenateNumbers=0 
 catenateAll=0 splitOnCaseChange=0/
 /analyzer
   /fieldType
 
 ...
 
  field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
   field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
  field name=version type=string indexed=true stored=true 
 multiValued=false/
  field name=device type=string indexed=true stored=true 
 multiValued=false/
  field name=filename type=string indexed=true stored=true 
 multiValued=false/
  field name=filesize type=long indexed=true stored=true 
 multiValued=false/
  field name=pversion type=int indexed=true stored=true 
 multiValued=false/
  field name=first2md5 type=string indexed=false stored=true 
 multiValued=false/
  field name=ckey type=string indexed=true stored=true 
 multiValued=false/
 
 ...
 
 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator=AND/
 
 -
 
 solrconfig.xml changes:
 
   maxFieldLength2147483647/maxFieldLength
   ramBufferSizeMB128/ramBufferSizeMB
 
 -
 
 The query:
 
 rowStr = rows=10
 facet = 
 facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version
 fields = fl=id,score,filename,version,device,first2md5,filesize,ckey
 termvectors = tv=trueqt=tvrhtv.all=true
 hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
 regexv = (?m)^.*\n.*\n.*$
 hl_regex

Re: Solr searching performance issues, using large documents

2010-07-29 Thread Peter Spam

If I don't do highlighting, it's really fast.  Optimize has no effect.

-Peter

On Jul 29, 2010, at 11:54 AM, dc tech wrote:

 Are you storing the entire log file text in SOLR? That's almost 3gb of
 text that you are storing in the SOLR. Try to
 1) Is this first time performance or on repaat queries with the same fields?
 2) Optimze the index and test performance again
 3) index without storing the text and see what the performance looks like.
 
 
 On 7/29/10, Peter Spam ps...@mac.com wrote:
 Any ideas?  I've got 5000 documents with an average size of 850k each, and
 it sometimes takes 2 minutes for a query to come back when highlighting is
 turned on!  Help!
 
 
 -Pete
 
 On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
 
 From the mailing list archive, Koji wrote:
 
 1. Provide another field for highlighting and use copyField to copy
 plainText to the highlighting field.
 
 and Lance wrote:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
 
 If you want to highlight field X, doing the
 termOffsets/termPositions/termVectors will make highlighting that field
 faster. You should make a separate field and apply these options to that
 field.
 
 Now: doing a copyfield adds a value to a multiValued field. For a text
 field, you get a multi-valued text field. You should only copy one value
 to the highlighted field, so just copyField the document to your special
 field. To enforce this, I would add multiValued=false to that field,
 just to avoid mistakes.
 
 So, all_text should be indexed without the term* attributes, and should
 not be stored. Then your document stored in a separate field that you use
 for highlighting and has the term* attributes.
 
 I've been experimenting with this, and here's what I've tried:
 
  field name=body type=text_pl indexed=true stored=false
 multiValued=true termVectors=true termPositions=true termOff
 sets=true /
  field name=body_all type=text_pl indexed=false stored=true
 multiValued=true /
  copyField source=body dest=body_all/
 
 ... but it's still very slow (10+ seconds).  Why is it better to have two
 fields (one indexed but not stored, and the other not indexed but stored)
 rather than just one field that's both indexed and stored?
 
 
 From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
 
 If you aren't always using all the stored fields, then enabling lazy
 field loading can be a huge boon, especially if compressed fields are
 used.
 
 What does this mean?  How do you load a field lazily?
 
 Thanks for your time, guys - this has started to become frustrating, since
 it works so well, but is very slow!
 
 
 -Pete
 
 On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
 
 Data set: About 4,000 log files (will eventually grow to millions).
 Average log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from under
 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
 disable highlighting, performance improves a lot, but is still slow for
 some queries (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
  fieldType name=text_pl class=solr.TextField
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=0
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0/
/analyzer
  /fieldType
 
 ...
 
 field name=body type=text_pl indexed=true stored=true
 multiValued=false termVectors=true termPositions=true
 termOffsets=true /
  field name=timestamp type=date indexed=true stored=true
 default=NOW multiValued=false/
 field name=version type=string indexed=true stored=true
 multiValued=false/
 field name=device type=string indexed=true stored=true
 multiValued=false/
 field name=filename type=string indexed=true stored=true
 multiValued=false/
 field name=filesize type=long indexed=true stored=true
 multiValued=false/
 field name=pversion type=int indexed=true stored=true
 multiValued=false/
 field name=first2md5 type=string indexed=false stored=true
 multiValued=false/
 field name=ckey type=string indexed=true stored=true
 multiValued=false/
 
 ...
 
 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator=AND/
 
 -
 
 solrconfig.xml changes:
 
  maxFieldLength2147483647/maxFieldLength
  ramBufferSizeMB128/ramBufferSizeMB

Re: Solr searching performance issues, using large documents

2010-07-21 Thread Peter Spam

From the mailing list archive, Koji wrote:

 1. Provide another field for highlighting and use copyField to copy plainText 
 to the highlighting field.

and Lance wrote: 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html

 If you want to highlight field X, doing the 
 termOffsets/termPositions/termVectors will make highlighting that field 
 faster. You should make a separate field and apply these options to that 
 field.
 
 Now: doing a copyfield adds a value to a multiValued field. For a text 
 field, you get a multi-valued text field. You should only copy one value to 
 the highlighted field, so just copyField the document to your special field. 
 To enforce this, I would add multiValued=false to that field, just to avoid 
 mistakes.
 
 So, all_text should be indexed without the term* attributes, and should not 
 be stored. Then your document stored in a separate field that you use for 
 highlighting and has the term* attributes.

I've been experimenting with this, and here's what I've tried:

   field name=body type=text_pl indexed=true stored=false 
multiValued=true termVectors=true termPositions=true termOff
sets=true /
   field name=body_all type=text_pl indexed=false stored=true 
multiValued=true /
   copyField source=body dest=body_all/

... but it's still very slow (10+ seconds).  Why is it better to have two 
fields (one indexed but not stored, and the other not indexed but stored) 
rather than just one field that's both indexed and stored?


From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors

 If you aren't always using all the stored fields, then enabling lazy field 
 loading can be a huge boon, especially if compressed fields are used.

What does this mean?  How do you load a field lazily?

Thanks for your time, guys - this has started to become frustrating, since it 
works so well, but is very slow!


-Pete

On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:

 Data set: About 4,000 log files (will eventually grow to millions).  Average 
 log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from under 2-3 
 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
 highlighting, performance improves a lot, but is still slow for some queries 
 (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
fieldType name=text_pl class=solr.TextField
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/ 
   filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 
 splitOnCaseChange=0/
  /analyzer
/fieldType
 
 ...
 
   field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
   field name=version type=string indexed=true stored=true 
 multiValued=false/
   field name=device type=string indexed=true stored=true 
 multiValued=false/
   field name=filename type=string indexed=true stored=true 
 multiValued=false/
   field name=filesize type=long indexed=true stored=true 
 multiValued=false/
   field name=pversion type=int indexed=true stored=true 
 multiValued=false/
   field name=first2md5 type=string indexed=false stored=true 
 multiValued=false/
   field name=ckey type=string indexed=true stored=true 
 multiValued=false/
 
 ...
 
 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator=AND/
 
 -
 
 solrconfig.xml changes:
 
maxFieldLength2147483647/maxFieldLength
ramBufferSizeMB128/ramBufferSizeMB
 
 -
 
 The query:
 
 rowStr = rows=10
 facet = 
 facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version
 fields = fl=id,score,filename,version,device,first2md5,filesize,ckey
 termvectors = tv=trueqt=tvrhtv.all=true
 hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
 regexv = (?m)^.*\n.*\n.*$
 hl_regex = hl.regex.pattern= + CGI::escape(regexv) + 
 hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647
 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
 '').gsub

Count hits per document?

2010-07-21 Thread Peter Spam

If I search for foo, I get back a list of documents.  Any way to get a 
per-document hit count?  Thanks!


-Pete

Re: Using hl.regex.pattern to print complete lines

2010-07-21 Thread Peter Spam

Still not working ... any ideas?


-Pete

On Jul 14, 2010, at 11:56 AM, Peter Spam wrote:

 Any other thoughts, Chris?  I've been messing with this a bit, and can't seem 
 to get (?m)^.*$ to do what I want.
 
 1) I don't care how many characters it returns, I'd like entire lines all the 
 time
 2) I just want it to always return 3 lines: the line before, the actual line, 
 and the line after.
 3) This should be like grep -C1
 
 Thanks for your time!
 
 
 -Pete
 
 On Jul 9, 2010, at 12:08 AM, Peter Spam wrote:
 
 Ah, this makes sense.  I've changed my regex to (?m)^.*$, and it works 
 better, but I still get fragments before and after some returns.
 Thanks for the hint!
 
 
 -Pete
 
 On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote:
 
 
 : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single
 : is available that is for getting entire field contents with search terms
 : highlighted. To use it, set hl.useFastVectorHighlighter to true.
 
 He doesn't want the entire field -- his stored field values contain 
 multi-line strings (using newline characters) and he wants to make 
 fragments per line (ie: bounded by newline characters, or the start/end 
 of the entire field value)
 
 Peter: i haven't looked at the code, but i expect that the problem is that 
 the java regex engine isn't being used in a way that makes ^ and $ match 
 any line boundary -- they are probably only matching the start/end of the 
 field (and . is probably only matching non-newline characters)
 
 java regexes support embedded flags (ie: (?xyz)your regex) so you might 
 try that (i don't remember what the correct modifier flag is for the 
 multiline mode off the top of my head)
 
 -Hoss

Solr searching performance issues, using large documents

2010-07-20 Thread Peter Spam

Data set: About 4,000 log files (will eventually grow to millions).  Average 
log file is 850k.  Largest log file (so far) is about 70MB.

Problem: When I search for common terms, the query time goes from under 2-3 
seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
highlighting, performance improves a lot, but is still slow for some queries (7 
seconds).  Thanks in advance for any ideas!


-Peter


-

4GB RAM server
% java -Xms2048M -Xmx3072M -jar start.jar

-

schema.xml changes:

fieldType name=text_pl class=solr.TextField
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/ 
filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0/
  /analyzer
/fieldType

...

   field name=body type=text_pl indexed=true stored=true 
multiValued=false termVectors=true termPositions=true termOffsets=true 
/
field name=timestamp type=date indexed=true stored=true 
default=NOW multiValued=false/
   field name=version type=string indexed=true stored=true 
multiValued=false/
   field name=device type=string indexed=true stored=true 
multiValued=false/
   field name=filename type=string indexed=true stored=true 
multiValued=false/
   field name=filesize type=long indexed=true stored=true 
multiValued=false/
   field name=pversion type=int indexed=true stored=true 
multiValued=false/
   field name=first2md5 type=string indexed=false stored=true 
multiValued=false/
   field name=ckey type=string indexed=true stored=true 
multiValued=false/

...

 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator=AND/

-

solrconfig.xml changes:

maxFieldLength2147483647/maxFieldLength
ramBufferSizeMB128/ramBufferSizeMB

-

The query:

rowStr = rows=10
facet = 
facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version
fields = fl=id,score,filename,version,device,first2md5,filesize,ckey
termvectors = tv=trueqt=tvrhtv.all=true
hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
regexv = (?m)^.*\n.*\n.*$
hl_regex = hl.regex.pattern= + CGI::escape(regexv) + 
hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647
justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
'').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr)

thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : 
('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + 
hl_regex

baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + 
p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s

Re: Using hl.regex.pattern to print complete lines

2010-07-14 Thread Peter Spam

Any other thoughts, Chris?  I've been messing with this a bit, and can't seem 
to get (?m)^.*$ to do what I want.

1) I don't care how many characters it returns, I'd like entire lines all the 
time
2) I just want it to always return 3 lines: the line before, the actual line, 
and the line after.
3) This should be like grep -C1

Thanks for your time!


-Pete

On Jul 9, 2010, at 12:08 AM, Peter Spam wrote:

 Ah, this makes sense.  I've changed my regex to (?m)^.*$, and it works 
 better, but I still get fragments before and after some returns.
 Thanks for the hint!
 
 
 -Pete
 
 On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote:
 
 
 : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single
 : is available that is for getting entire field contents with search terms
 : highlighted. To use it, set hl.useFastVectorHighlighter to true.
 
 He doesn't want the entire field -- his stored field values contain 
 multi-line strings (using newline characters) and he wants to make 
 fragments per line (ie: bounded by newline characters, or the start/end 
 of the entire field value)
 
 Peter: i haven't looked at the code, but i expect that the problem is that 
 the java regex engine isn't being used in a way that makes ^ and $ match 
 any line boundary -- they are probably only matching the start/end of the 
 field (and . is probably only matching non-newline characters)
 
 java regexes support embedded flags (ie: (?xyz)your regex) so you might 
 try that (i don't remember what the correct modifier flag is for the 
 multiline mode off the top of my head)
 
 -Hoss

Re: Using hl.regex.pattern to print complete lines

2010-07-09 Thread Peter Spam

Ah, this makes sense.  I've changed my regex to (?m)^.*$, and it works 
better, but I still get fragments before and after some returns.
Thanks for the hint!


-Pete

On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote:

 
 : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single
 : is available that is for getting entire field contents with search terms
 : highlighted. To use it, set hl.useFastVectorHighlighter to true.
 
 He doesn't want the entire field -- his stored field values contain 
 multi-line strings (using newline characters) and he wants to make 
 fragments per line (ie: bounded by newline characters, or the start/end 
 of the entire field value)
 
 Peter: i haven't looked at the code, but i expect that the problem is that 
 the java regex engine isn't being used in a way that makes ^ and $ match 
 any line boundary -- they are probably only matching the start/end of the 
 field (and . is probably only matching non-newline characters)
 
 java regexes support embedded flags (ie: (?xyz)your regex) so you might 
 try that (i don't remember what the correct modifier flag is for the 
 multiline mode off the top of my head)
 
 -Hoss

Re: Using hl.regex.pattern to print complete lines

2010-07-08 Thread Peter Spam

To clarify, I never want a snippet, I always want a whole line returned.  Is 
this possible?  Thanks!


-Pete

On Jul 7, 2010, at 5:33 PM, Peter Spam wrote:

 Hi,
 
 I have a text file broken apart by carriage returns, and I'd like to only 
 return entire lines.  So, I'm trying to use this:
 
   hl.fragmenter=regex
   hl.regex.pattern=^.*$
 
 ... but I still get fragments, even if I crank up the hl.regex.slop to 3 or 
 so.  I also tried a pattern of \n.*\n which seems to work better, but still 
 isn't right.  Any ideas?
 
 
 -Pete

Re: Using hl.regex.pattern to print complete lines

2010-07-08 Thread Peter Spam

Thanks for the note, Koji.  However, hl.fragsize=0 seems to return the entire 
document, rather than just one single line.

Here's what I tried (what I previously had was commented out):

regexv = ^.*$
thequery = 
'/solr/select?facet=truefacet.limit=10fl=id,score,filenametv=truetimeAllowed=3000facet.field=filenameqt=tvrhwt=ruby'
 + (p['fq'].empty? ? '' : ('fq='+p['fq'].to_s) ) + 'q=' + 
CGI::escape(p['q'].to_s) + 'rows=' + p['rows'].to_s + 
hl=truehl.snippets=1hl.fragsize=0 
#hl.regex.slop=.8hl.fragsize=200hl.fragmenter=regexhl.regex.pattern= + 
CGI::escape(regexv)

Thanks for your help.


-Peter

On Jul 8, 2010, at 3:47 PM, Koji Sekiguchi wrote:

 (10/07/09 2:44), Peter Spam wrote:
 To clarify, I never want a snippet, I always want a whole line returned.  Is 
 this possible?  Thanks!
 
 
 -Pete
 
   
 Hello Pete,
 
 Use NullFragmenter. It can be used via GapFragmenter with
 hl.fragsize=0.
 
 Koji
 
 -- 
 http://www.rondhuit.com/en/

Using hl.regex.pattern to print complete lines

2010-07-07 Thread Peter Spam

Hi,

I have a text file broken apart by carriage returns, and I'd like to only 
return entire lines.  So, I'm trying to use this:

hl.fragmenter=regex
hl.regex.pattern=^.*$

... but I still get fragments, even if I crank up the hl.regex.slop to 3 or so. 
 I also tried a pattern of \n.*\n which seems to work better, but still isn't 
right.  Any ideas?


-Pete

Re: Very basic questions: Faceted front-end?

2010-06-30 Thread Peter Spam

Wow, thanks Lance - it's really fast now!

The last piece of the puzzle is setting up a nice front-end.  Are there any 
pre-built front-ends available, that mimic Google (for example), with facets?


-Peter

On Jun 29, 2010, at 9:04 PM, Lance Norskog wrote:

 To highlight a field, Solr needs some extra Lucene values. If these
 are not configured for the field in the schema, Solr has to re-analyze
 the field to highlight it. If you want faster highlighting, you have
 to add term vectors to the schema. Here is the grand map of such
 things:
 
 http://wiki.apache.org/solr/FieldOptionsByUseCase
 
 On Tue, Jun 29, 2010 at 6:29 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 What are you actual highlighting requirements? you could try
 things like maxAnalyzedChars, requireFieldMatch, etc
 
 http://wiki.apache.org/solr/HighlightingParameters
 has a good list, but you've probably already seen that page
 
 Best
 Erick
 
 On Tue, Jun 29, 2010 at 9:11 PM, Peter Spam ps...@mac.com wrote:
 
 To follow up, I've found that my queries are very fast (even with fq=),
 until I add hl=true.  What can I do to speed up highlighting?  Should I
 consider injecting a line at a time, rather than the entire file as a field?
 
 
 -Pete
 
 On Jun 29, 2010, at 11:07 AM, Peter Spam wrote:
 
 Thanks for everyone's help - I have this working now, but sometimes the
 queries are incredibly slow!!  For example, int name=QTime461360/int.
  Also, I had to bump up the min/max RAM size to 1GB/3.5GB for things to
 inject without throwing heap memory errors.  However, my data set is very
 small!  36 text files, for a total of 113MB.  (It will grow to many TB, but
 for now, this is a test).  The largest file is 34MB.
 
 Therefore, I'm sure I'm doing something wrong :-)  Here's my config:
 
 
 ---
 
 For the schema.xml, types is all default.  For fields, here are the
 only lines that aren't commented out:
 
   field name=id type=string indexed=true stored=true
 required=true /
   field name=body type=text indexed=true stored=true
 multiValued=true/
   field name=timestamp type=date indexed=true stored=true
 default=NOW multiValued=false/
   field name=build type=string indexed=true stored=true
 multiValued=false/
   field name=device type=string indexed=true stored=true
 multiValued=false/
   dynamicField name=* type=ignored multiValued=true /
 
 ... then, for the rest:
 
 uniqueKeyid/uniqueKey
 
 !-- field for the QueryParser to use when an explicit fieldname is
 absent --
 defaultSearchFieldbody/defaultSearchField
 
 !-- SolrQueryParser configuration: defaultOperator=AND|OR --
 solrQueryParser defaultOperator=AND/
 
 
 
 ---
 
 
 Invoking:  java -Xmx3584M -Xms1024M -jar start.jar
 
 
 
 ---
 
 
 Injecting:
 
 #!/bin/sh
 
 J=0
 for i in `find . -name \*.txt`; do
   (( J++ ))
   curl 
 http://localhost:8983/solr/update/extract?literal.id=doc$Jfmap.content=body;
 -F myfi...@$i;
 done;
 
 
 echo - Committing
 curl http://localhost:8983/solr/update/extract?commit=true;
 
 
 
 ---
 
 
 Searching:
 
 
 http://localhost:8983/solr/select?q=testinghl=truefl=id,scorehl.snippets=5hl.mergeContiguous=true
 
 
 
 
 
 -Pete
 
 On Jun 28, 2010, at 5:22 PM, Erick Erickson wrote:
 
 try adding hl.fl=text
 to specify your highlight field. I don't understand why you're only
 getting the ID field back though. Do note that the highlighting
 is after the docs, related by the ID.
 
 Try a (non highlighting) query of just * to verify that you're
 pointing at the index you think you are. It's possible that
 you've modified a different index with SolrJ than your web
 server is pointing at.
 
 Also, SOLR has no way of knowing you're modified your index
 with SolrJ, so it may not be automatically reopening an
 IndexReader so your recent changes may not be visible
 until you force the SOLR reader to reopen.
 
 HTH
 Erick
 
 On Mon, Jun 28, 2010 at 6:49 PM, Peter Spam ps...@mac.com wrote:
 
 On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote:
 
 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.
 
 Solr can generate Google-like snippets as you describe.
 http://wiki.apache.org/solr/HighlightingParameters
 
 Here's how I commit my documents:
 
 J=0;
 for i in `find . -name \*.txt`; do
  (( J++ ))
  curl http://localhost:8983/solr/update/extract?literal.id=doc$J;
 -F myfi...@$i;
 done;
 
 echo - Committing
 curl http://localhost:8983/solr/update/extract?commit=true;
 
 
 Then, I try to query using
 
 http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl

Re: Very basic questions: Faceted front-end?

2010-06-30 Thread Peter Spam

Ah, I found this:

https://issues.apache.org/jira/browse/SOLR-634

... aka solr-ui.  Is there anything else along these lines?  Thanks!


-Peter

On Jun 30, 2010, at 3:59 PM, Peter Spam wrote:

 Wow, thanks Lance - it's really fast now!
 
 The last piece of the puzzle is setting up a nice front-end.  Are there any 
 pre-built front-ends available, that mimic Google (for example), with facets?
 
 
 -Peter
 
 On Jun 29, 2010, at 9:04 PM, Lance Norskog wrote:
 
 To highlight a field, Solr needs some extra Lucene values. If these
 are not configured for the field in the schema, Solr has to re-analyze
 the field to highlight it. If you want faster highlighting, you have
 to add term vectors to the schema. Here is the grand map of such
 things:
 
 http://wiki.apache.org/solr/FieldOptionsByUseCase
 
 On Tue, Jun 29, 2010 at 6:29 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 What are you actual highlighting requirements? you could try
 things like maxAnalyzedChars, requireFieldMatch, etc
 
 http://wiki.apache.org/solr/HighlightingParameters
 has a good list, but you've probably already seen that page
 
 Best
 Erick
 
 On Tue, Jun 29, 2010 at 9:11 PM, Peter Spam ps...@mac.com wrote:
 
 To follow up, I've found that my queries are very fast (even with fq=),
 until I add hl=true.  What can I do to speed up highlighting?  Should I
 consider injecting a line at a time, rather than the entire file as a 
 field?
 
 
 -Pete
 
 On Jun 29, 2010, at 11:07 AM, Peter Spam wrote:
 
 Thanks for everyone's help - I have this working now, but sometimes the
 queries are incredibly slow!!  For example, int name=QTime461360/int.
 Also, I had to bump up the min/max RAM size to 1GB/3.5GB for things to
 inject without throwing heap memory errors.  However, my data set is very
 small!  36 text files, for a total of 113MB.  (It will grow to many TB, but
 for now, this is a test).  The largest file is 34MB.
 
 Therefore, I'm sure I'm doing something wrong :-)  Here's my config:
 
 
 ---
 
 For the schema.xml, types is all default.  For fields, here are the
 only lines that aren't commented out:
 
  field name=id type=string indexed=true stored=true
 required=true /
  field name=body type=text indexed=true stored=true
 multiValued=true/
  field name=timestamp type=date indexed=true stored=true
 default=NOW multiValued=false/
  field name=build type=string indexed=true stored=true
 multiValued=false/
  field name=device type=string indexed=true stored=true
 multiValued=false/
  dynamicField name=* type=ignored multiValued=true /
 
 ... then, for the rest:
 
 uniqueKeyid/uniqueKey
 
 !-- field for the QueryParser to use when an explicit fieldname is
 absent --
 defaultSearchFieldbody/defaultSearchField
 
 !-- SolrQueryParser configuration: defaultOperator=AND|OR --
 solrQueryParser defaultOperator=AND/
 
 
 
 ---
 
 
 Invoking:  java -Xmx3584M -Xms1024M -jar start.jar
 
 
 
 ---
 
 
 Injecting:
 
 #!/bin/sh
 
 J=0
 for i in `find . -name \*.txt`; do
  (( J++ ))
  curl 
 http://localhost:8983/solr/update/extract?literal.id=doc$Jfmap.content=body;
 -F myfi...@$i;
 done;
 
 
 echo - Committing
 curl http://localhost:8983/solr/update/extract?commit=true;
 
 
 
 ---
 
 
 Searching:
 
 
 http://localhost:8983/solr/select?q=testinghl=truefl=id,scorehl.snippets=5hl.mergeContiguous=true
 
 
 
 
 
 -Pete
 
 On Jun 28, 2010, at 5:22 PM, Erick Erickson wrote:
 
 try adding hl.fl=text
 to specify your highlight field. I don't understand why you're only
 getting the ID field back though. Do note that the highlighting
 is after the docs, related by the ID.
 
 Try a (non highlighting) query of just * to verify that you're
 pointing at the index you think you are. It's possible that
 you've modified a different index with SolrJ than your web
 server is pointing at.
 
 Also, SOLR has no way of knowing you're modified your index
 with SolrJ, so it may not be automatically reopening an
 IndexReader so your recent changes may not be visible
 until you force the SOLR reader to reopen.
 
 HTH
 Erick
 
 On Mon, Jun 28, 2010 at 6:49 PM, Peter Spam ps...@mac.com wrote:
 
 On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote:
 
 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.
 
 Solr can generate Google-like snippets as you describe.
 http://wiki.apache.org/solr/HighlightingParameters
 
 Here's how I commit my documents:
 
 J=0;
 for i in `find . -name \*.txt`; do
 (( J++ ))
 curl http://localhost:8983/solr/update/extract?literal.id=doc$J;
 -F myfi

Re: Very basic questions: Indexing text - working, but slow!

2010-06-29 Thread Peter Spam

Thanks for everyone's help - I have this working now, but sometimes the queries 
are incredibly slow!!  For example, int name=QTime461360/int.  Also, I 
had to bump up the min/max RAM size to 1GB/3.5GB for things to inject without 
throwing heap memory errors.  However, my data set is very small!  36 text 
files, for a total of 113MB.  (It will grow to many TB, but for now, this is a 
test).  The largest file is 34MB.

Therefore, I'm sure I'm doing something wrong :-)  Here's my config:

---

For the schema.xml, types is all default.  For fields, here are the only 
lines that aren't commented out:

   field name=id type=string indexed=true stored=true required=true 
/
   field name=body type=text indexed=true stored=true 
multiValued=true/
   field name=timestamp type=date indexed=true stored=true 
default=NOW multiValued=false/
   field name=build type=string indexed=true stored=true 
multiValued=false/
   field name=device type=string indexed=true stored=true 
multiValued=false/
   dynamicField name=* type=ignored multiValued=true /

... then, for the rest:

 uniqueKeyid/uniqueKey

 !-- field for the QueryParser to use when an explicit fieldname is absent --
 defaultSearchFieldbody/defaultSearchField

 !-- SolrQueryParser configuration: defaultOperator=AND|OR --
 solrQueryParser defaultOperator=AND/


---


Invoking:  java -Xmx3584M -Xms1024M -jar start.jar


---


Injecting:

#!/bin/sh

J=0
for i in `find . -name \*.txt`; do 
(( J++ ))
curl 
http://localhost:8983/solr/update/extract?literal.id=doc$Jfmap.content=body; 
-F myfi...@$i; 
done;


echo - Committing
curl http://localhost:8983/solr/update/extract?commit=true;


---


Searching:

http://localhost:8983/solr/select?q=testinghl=truefl=id,scorehl.snippets=5hl.mergeContiguous=true





-Pete

On Jun 28, 2010, at 5:22 PM, Erick Erickson wrote:

 try adding hl.fl=text
 to specify your highlight field. I don't understand why you're only
 getting the ID field back though. Do note that the highlighting
 is after the docs, related by the ID.
 
 Try a (non highlighting) query of just * to verify that you're
 pointing at the index you think you are. It's possible that
 you've modified a different index with SolrJ than your web
 server is pointing at.
 
 Also, SOLR has no way of knowing you're modified your index
 with SolrJ, so it may not be automatically reopening an
 IndexReader so your recent changes may not be visible
 until you force the SOLR reader to reopen.
 
 HTH
 Erick
 
 On Mon, Jun 28, 2010 at 6:49 PM, Peter Spam ps...@mac.com wrote:
 
 On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote:
 
 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.
 
 Solr can generate Google-like snippets as you describe.
 http://wiki.apache.org/solr/HighlightingParameters
 
 Here's how I commit my documents:
 
 J=0;
 for i in `find . -name \*.txt`; do
   (( J++ ))
   curl http://localhost:8983/solr/update/extract?literal.id=doc$J;
 -F myfi...@$i;
 done;
 
 echo - Committing
 curl http://localhost:8983/solr/update/extract?commit=true;
 
 
 Then, I try to query using
 http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl=trueq=testing
 but I only get back the document ID rather than the snippet:
 
 doc
 float name=score0.05030759/float
 arr name=content_type
 strtext/plain/str
 /arr
 str name=iddoc16/str
 /doc
 
 I'm using the schema.xml from the lucid imagination: Indexing text and
 html files tutorial.
 
 
 
 -Pete

Re: Very basic questions: Indexing text - working, but slow!

2010-06-29 Thread Peter Spam

To follow up, I've found that my queries are very fast (even with fq=), until 
I add hl=true.  What can I do to speed up highlighting?  Should I consider 
injecting a line at a time, rather than the entire file as a field?


-Pete

On Jun 29, 2010, at 11:07 AM, Peter Spam wrote:

 Thanks for everyone's help - I have this working now, but sometimes the 
 queries are incredibly slow!!  For example, int name=QTime461360/int.  
 Also, I had to bump up the min/max RAM size to 1GB/3.5GB for things to inject 
 without throwing heap memory errors.  However, my data set is very small!  36 
 text files, for a total of 113MB.  (It will grow to many TB, but for now, 
 this is a test).  The largest file is 34MB.
 
 Therefore, I'm sure I'm doing something wrong :-)  Here's my config:
 
 ---
 
 For the schema.xml, types is all default.  For fields, here are the only 
 lines that aren't commented out:
 
   field name=id type=string indexed=true stored=true required=true 
 /
   field name=body type=text indexed=true stored=true 
 multiValued=true/
   field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
   field name=build type=string indexed=true stored=true 
 multiValued=false/
   field name=device type=string indexed=true stored=true 
 multiValued=false/
   dynamicField name=* type=ignored multiValued=true /
 
 ... then, for the rest:
 
 uniqueKeyid/uniqueKey
 
 !-- field for the QueryParser to use when an explicit fieldname is absent --
 defaultSearchFieldbody/defaultSearchField
 
 !-- SolrQueryParser configuration: defaultOperator=AND|OR --
 solrQueryParser defaultOperator=AND/
 
 
 ---
 
 
 Invoking:  java -Xmx3584M -Xms1024M -jar start.jar
 
 
 ---
 
 
 Injecting:
 
 #!/bin/sh
 
 J=0
 for i in `find . -name \*.txt`; do 
   (( J++ ))
   curl 
 http://localhost:8983/solr/update/extract?literal.id=doc$Jfmap.content=body;
  -F myfi...@$i; 
 done;
 
 
 echo - Committing
 curl http://localhost:8983/solr/update/extract?commit=true;
 
 
 ---
 
 
 Searching:
 
 http://localhost:8983/solr/select?q=testinghl=truefl=id,scorehl.snippets=5hl.mergeContiguous=true
 
 
 
 
 
 -Pete
 
 On Jun 28, 2010, at 5:22 PM, Erick Erickson wrote:
 
 try adding hl.fl=text
 to specify your highlight field. I don't understand why you're only
 getting the ID field back though. Do note that the highlighting
 is after the docs, related by the ID.
 
 Try a (non highlighting) query of just * to verify that you're
 pointing at the index you think you are. It's possible that
 you've modified a different index with SolrJ than your web
 server is pointing at.
 
 Also, SOLR has no way of knowing you're modified your index
 with SolrJ, so it may not be automatically reopening an
 IndexReader so your recent changes may not be visible
 until you force the SOLR reader to reopen.
 
 HTH
 Erick
 
 On Mon, Jun 28, 2010 at 6:49 PM, Peter Spam ps...@mac.com wrote:
 
 On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote:
 
 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.
 
 Solr can generate Google-like snippets as you describe.
 http://wiki.apache.org/solr/HighlightingParameters
 
 Here's how I commit my documents:
 
 J=0;
 for i in `find . -name \*.txt`; do
  (( J++ ))
  curl http://localhost:8983/solr/update/extract?literal.id=doc$J;
 -F myfi...@$i;
 done;
 
 echo - Committing
 curl http://localhost:8983/solr/update/extract?commit=true;
 
 
 Then, I try to query using
 http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl=trueq=testing
 but I only get back the document ID rather than the snippet:
 
 doc
 float name=score0.05030759/float
 arr name=content_type
 strtext/plain/str
 /arr
 str name=iddoc16/str
 /doc
 
 I'm using the schema.xml from the lucid imagination: Indexing text and
 html files tutorial.
 
 
 
 -Pete

Very basic questions: Indexing text

2010-06-28 Thread Peter Spam

Hi everyone,

I'm looking for a way to index a bunch of (potentially large) text files.  I 
would love to see results like Google, so I went through a few tutorials, but 
I've still got questions:

1) I can get my docs in the index, but when I search, it returns the entire 
document.  I'd love to have it only return the line (or two) around the search 
term.

2) There are one or two fields at the beginning of the file that I would like 
to search on, so these should be indexed differently, right?

3) Is there a nice front-end example anywhere?  Something that would return 
results kind of like Google?

Thanks for your time - Solr / Lucene seem to be very powerful.


-Pete

Re: Very basic questions: Indexing text

2010-06-28 Thread Peter Spam

Great, thanks for the pointers.


Thanks,
Peter

On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote:

 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.
 
 Solr can generate Google-like snippets as you describe. 
 http://wiki.apache.org/solr/HighlightingParameters
 
 2) There are one or two fields at the beginning of the file
 that I would like to search on, so these should be indexed
 differently, right?
 
 Probably yes. 
 
 3) Is there a nice front-end example anywhere? 
 Something that would return results kind of like Google?
 
 http://wiki.apache.org/solr/PublicServers
 http://search-lucene.com/

Re: Very basic questions: Indexing text

2010-06-28 Thread Peter Spam

On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote:

 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.
 
 Solr can generate Google-like snippets as you describe. 
 http://wiki.apache.org/solr/HighlightingParameters

Here's how I commit my documents:

J=0;
for i in `find . -name \*.txt`; do
(( J++ ))
curl http://localhost:8983/solr/update/extract?literal.id=doc$J; -F 
myfi...@$i;
done;

echo - Committing
curl http://localhost:8983/solr/update/extract?commit=true;


Then, I try to query using 
http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl=trueq=testing
but I only get back the document ID rather than the snippet:

doc
float name=score0.05030759/float
arr name=content_type
strtext/plain/str
/arr
str name=iddoc16/str
/doc

 I'm using the schema.xml from the lucid imagination: Indexing text and html 
files tutorial.



-Pete

47 matches

Mail list logo