Re: Is there any strss test tool for testing Solr?

2010-08-29 Thread 朱炎詹
Thanks to both Gora & Amit. A little information for people who concern this 
discussion, I found there's a SolrMeter open source project in Google Code - 
http://code.google.com/p/solrmeter/, it's specifically for load test of 
Solr -


I'll evaluate following tools & pick up one for my testing:

WebStress
Apache Bench
JMeter
SolrMetre

Oh, I'll correct a wrong information my post: We're builiding a 12-million 
newspaper index, rather than 1.2 million.


Scott
- Original Message - 
From: "Gora Mohanty" 

To: 
Sent: Friday, August 27, 2010 2:22 AM
Subject: Re: Is there any strss test tool for testing Solr?



On Wed, 25 Aug 2010 19:58:36 -0700
Amit Nithian  wrote:


i recommend JMeter. We use that to do load testing on a search
server.

[...]

JMeter is certainly good, but we have also found Apache bench
to also be of much use. Maybe it is just us, and what we are
familiar with, but Apache bench seemed easier to automate. Also,
much easier to get up and running with, at least IMHO.


Be careful though.. as silly as this may sound.. do NOT just
issue random queries because that won't exercise your caches...

[...]

Conversely, we are still trying to figure out how to make real-life
measurements, without having the Solr cache coming into the picture.
For querying on a known keyword, every hit after the first, with
Apache bench, is strongly affected by the Solr cache. We tried using
random strings, but at least with Apache bench, the query string is
fixed for each invocation of Apache bench. Have to investigate
whether one can do otherwise with JMeter plugins. Also, a query
that returns no result (as a random query string typically would)
seems to be significantly faster than a real query. So, I think that
in the long run, the best way is to build information about
*typical* queries that your users run; using the Solr logs, and
then use a set of such queries for benchmarking.

Regards,
Gora








¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3094 - Release Date: 08/26/10 
02:34:00




Repliaction in 1.4 "Replicate Now" works, but scheduled rep. does not

2010-08-29 Thread Leanid

Hello,
I am upgrading from 1.3 to 1.4 and setting up new replication method.

On master I added this section:


  commit
  startup
  schema.xml,stopwords.txt



On slave:



  http://localhost:8085/solr/replication
  00:15:00



They are on the same server different ports, this is QA environment.
master is on 8085, slave on 8086.
master started with -Denable.master=true -Denable.slave=false
slave started with -Denable.master=false -Denable.slave=true


Master admin page has Replication link and response on master details fine.
Slave does not have a Replicatio link on admin page, but when I go the
directly and click Replicate Now,
It does work. But it is failing when it trying to run on schedule (every 15
minutes)
Also Slave replication admin page does not have status of master.


This is detaile dinfo from slave:

If you check schedule I have only one succeful replication that I run
manually at: 
Sun Aug 29 17:52:17 EDT 2010

others are running on 15 minute all failing.


http://devslave:8086/solr/replication?command=details

−

0
11

−

504.02 MB
/usr/local/solr/solr_home_qa_slave/data/index

false
true
1283100903442
2
−

−

504.02 MB
/usr/local/solr/solr_home_qa/data/index
−

−

1283100903442
2
−

_0.nrm
_0.tis
_0.fnm
_0.tii
_0.frq
segments_2
_0.fdx
_0.prx
_0.fdt



true
false
1283100903442
2

http://SOLRMASTERDEV:8085/solr/replication
00:15:00
Sun Aug 29 18:30:00 EDT 2010
−

Sun Aug 29 18:30:00 EDT 2010
Sun Aug 29 17:52:17 EDT 2010
Sun Aug 29 17:45:00 EDT 2010
Sun Aug 29 17:30:00 EDT 2010
Sun Aug 29 17:15:00 EDT 2010
Sun Aug 29 17:00:00 EDT 2010
Sun Aug 29 16:51:26 EDT 2010
Sun Aug 29 16:45:00 EDT 2010
Sun Aug 29 16:41:34 EDT 2010
Sun Aug 29 16:30:00 EDT 2010

−

Sun Aug 29 18:30:00 EDT 2010
Sun Aug 29 17:45:00 EDT 2010
Sun Aug 29 17:30:00 EDT 2010
Sun Aug 29 17:15:00 EDT 2010
Sun Aug 29 17:00:00 EDT 2010
Sun Aug 29 16:51:26 EDT 2010
Sun Aug 29 16:45:00 EDT 2010
Sun Aug 29 16:41:34 EDT 2010
Sun Aug 29 16:30:00 EDT 2010
Sun Aug 29 16:15:00 EDT 2010

13
0
12
Sun Aug 29 18:30:00 EDT 2010
0
false
false


−

This response format is experimental.  It is likely to change in the future.







-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Repliaction-in-1-4-Replicate-Now-works-but-scheduled-rep-does-not-tp1386024p1386024.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Updating document without removing fields

2010-08-29 Thread Lance Norskog
No. Document creation is all-or-nothing, fields are not updateable.

I think you have to filter all of your field changes through a "join"
server. That is,
all field updates could go to a database and the master would read
document updates
from that database. Or, you could have one updater feed updates to the
other, The
sends all updates to the master.

Lance

On Sun, Aug 29, 2010 at 6:19 PM, Max Lynch  wrote:
> Hi,
> I have a master solr server and two slaves.  On each of the slaves I have
> programs running that read the slave index, do some processing on each
> document, add a few new fields, and commit the changes back to the master.
>
> The problem I'm running into right now is one slave will update one document
> and the other slave will eventually update the same document, but the
> changes will overwrite each other.  For example, one slave will add a field
> and commit the document, but the other slave won't have that field yet so it
> won't duplicate the document when it updates the doc with its own new
> field.  This causes the document to miss one set of fields from one of the
> slaves.
>
> Can I update a document without having to recreate it?  Is there a way to
> update the slave and then have the slave commit the changes to the master
> (adding new fields in the process?)
>
> Thanks.
>



-- 
Lance Norskog
goks...@gmail.com


Updating document without removing fields

2010-08-29 Thread Max Lynch
Hi,
I have a master solr server and two slaves.  On each of the slaves I have
programs running that read the slave index, do some processing on each
document, add a few new fields, and commit the changes back to the master.

The problem I'm running into right now is one slave will update one document
and the other slave will eventually update the same document, but the
changes will overwrite each other.  For example, one slave will add a field
and commit the document, but the other slave won't have that field yet so it
won't duplicate the document when it updates the doc with its own new
field.  This causes the document to miss one set of fields from one of the
slaves.

Can I update a document without having to recreate it?  Is there a way to
update the slave and then have the slave commit the changes to the master
(adding new fields in the process?)

Thanks.


Re: Multiple passes with WordDelimiterFilterFactory

2010-08-29 Thread Erick Erickson
There's nothing built into SOLR that I know of that'll deal with
auto-detecting
multiple languages and "doing the right thing". I know there's been
discussion
of that, searching the users' list might help... You may have to write your
own
analyzer that tries to do this, but I have no clue how you'd go about it.

<<>>
Try putting this after any instances of, say, WhiteSpaceTokenizerFactory
in your analyzser definition, and I believe you'll see that this is not
true.
At least looking at this in the analysis page from SOLR admin sure doesn't
seem to support that assertion.

This last doesn't help much with the different character sets though..

I'll have to leave any other insights to wiser heads than mine though..

Best
Erick

On Sun, Aug 29, 2010 at 12:44 PM, Shawn Heisey  wrote:

>  Thank you for taking the time to help.  The way I've got the word
> delimiter index filter set up with only one pass, "wolf-biederman" will
> result in wolf, biederman, wolfbiederman, and wolf-biederman.  With two
> passes, the last one is not present.  One pass changes "gremlin's" to
> gremlin and gremlin's.  Two passes results in gremlin and gremlins.
>
> I was trying to use the PatternReplaceCharFilterFactory to strip leading
> and trailing punctuation, but it didn't work.  It seems that charFilters are
> applied even before the tokenizer, which will not produce the results I
> want, and the filter I'd come up with was eating everything, producing no
> results.  I later realized that it would not work with radically different
> character sets like Arabic and Cyrillic, even if I solved those problems.
>  Is there a regular filter that could strip leading/trailing punctuation?
>
> As for stemming, we have no effective way to separate the languages.  Most
> of the content is English, but we also have Spanish, Arabic, Russian,
> German, French, and possibly a few others.  For that reason, I'm not using
> stemming.  I've been thinking that I might want to use an English stemmer
> anyway to improve results on most of the content, but I haven't done any
> testing yet.
>
> Thanks,
> Shawn
>
>
>
> On 8/29/2010 12:28 PM, Erick Erickson wrote:
>
>> Look at the tokenizer/filter chain that makes up your analyzers, and see:
>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>
>> for other tokenizer/analyzer/filter options.
>>
>> You're on the right track looking at the various choices provided, and
>> I suspect you'll find what you need...
>>
>> Be a little cautious about preserving things. Your users will often be
>> more
>> confused than helped if you require hyphens for a match. Ditto with
>> possessives, plurals, etc. You might want to look at stemmers
>>
>
>


Re: Search Results optimization

2010-08-29 Thread Hasnain

also my request handler looks like this



dismax
name ^2.4
0.1



I really need some help on this,
again, what I want is...if I search for "swingline red stapler", In results,
docs that have all three keywords should come on top, then docs that have
any 2 keywords and then docs with 1 keyword, i mean in my sorted order.
thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-Results-optimization-tp1129374p1385572.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multiple passes with WordDelimiterFilterFactory

2010-08-29 Thread Shawn Heisey
 Thank you for taking the time to help.  The way I've got the word 
delimiter index filter set up with only one pass, "wolf-biederman" will 
result in wolf, biederman, wolfbiederman, and wolf-biederman.  With two 
passes, the last one is not present.  One pass changes "gremlin's" to 
gremlin and gremlin's.  Two passes results in gremlin and gremlins.


I was trying to use the PatternReplaceCharFilterFactory to strip leading 
and trailing punctuation, but it didn't work.  It seems that charFilters 
are applied even before the tokenizer, which will not produce the 
results I want, and the filter I'd come up with was eating everything, 
producing no results.  I later realized that it would not work with 
radically different character sets like Arabic and Cyrillic, even if I 
solved those problems.  Is there a regular filter that could strip 
leading/trailing punctuation?


As for stemming, we have no effective way to separate the languages.  
Most of the content is English, but we also have Spanish, Arabic, 
Russian, German, French, and possibly a few others.  For that reason, 
I'm not using stemming.  I've been thinking that I might want to use an 
English stemmer anyway to improve results on most of the content, but I 
haven't done any testing yet.


Thanks,
Shawn


On 8/29/2010 12:28 PM, Erick Erickson wrote:

Look at the tokenizer/filter chain that makes up your analyzers, and see:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

for other tokenizer/analyzer/filter options.

You're on the right track looking at the various choices provided, and
I suspect you'll find what you need...

Be a little cautious about preserving things. Your users will often be more
confused than helped if you require hyphens for a match. Ditto with
possessives, plurals, etc. You might want to look at stemmers




Re: ExternalFileField best practices

2010-08-29 Thread simon
The extended dismax parser (see SOLR-1553) may do what you are looking for

 From its feature list..

'Supports the "boost" parameter.. like the dismax bf param, but multiplies
the function query instead of adding it in'

On Sun, Aug 29, 2010 at 12:27 AM, Andy  wrote:

> But isn't it the case that bf adds the boost value while {!boost} multiply
> the boost value? In my case I think a multiplication is more appropriate.
>
> So there's no way to use ExternalFileField in {!boost}?
>
> --- On Sat, 8/28/10, Lance Norskog  wrote:
>
> > From: Lance Norskog 
> > Subject: Re: ExternalFileField best practices
> > To: solr-user@lucene.apache.org
> > Date: Saturday, August 28, 2010, 11:55 PM
> > You want the boost function bf=
> > parameter.
> >
> > On Sat, Aug 28, 2010 at 5:32 PM, Andy 
> > wrote:
> > > Lance,
> > >
> > > Thanks for the response.
> > >
> > > Can I use an ExternalFileField as an input to a boost
> > query?
> > >
> > > For example, if I put the field "popularity" in an
> > ExternalFileField, can I still use "popularity" in a boosted
> > query such as:
> > >
> > > {!boost b=log(popularity)}foo
> > >
> > > The doc says ExternalFileField can only be used in
> > FunctionQuery. Does that include a boost query like {!boost
> > b=log(popularity)}?
> > >
> > >
> > > --- On Sat, 8/28/10, Lance Norskog 
> > wrote:
> > >
> > >> From: Lance Norskog 
> > >> Subject: Re: ExternalFileField best practices
> > >> To: solr-user@lucene.apache.org
> > >> Date: Saturday, August 28, 2010, 5:16 PM
> > >> The file is completely reloaded when
> > >> you commit or optimize. There is
> > >> no incremental update available. And, yes, this
> > could be a
> > >> scaling
> > >> problem.
> > >>
> > >> How you update it is completely external to Solr.
> > >>
> > >> On Sat, Aug 28, 2010 at 2:50 AM, Andy 
> > >> wrote:
> > >> > I'm interested in using ExternalFileField to
> > store a
> > >> field "popularity" that is being updated
> > frequently.
> > >> >
> > >> > However ExternalFileField seems to be a
> > pretty obscure
> > >> feature. Have a few questions:
> > >> >
> > >> > 1) Can anyone share your experience using
> > it?
> > >> >
> > >> > 2) What is the most efficient way to update
> > the
> > >> external file?
> > >> > For example, the file could look like:
> > >> >
> > >> > 1=12  // the document with uniqueKey 1
> > has a
> > >> popularity of 12//
> > >> > 2=4
> > >> > 3=45
> > >> > 5=78
> > >> >
> > >> > Now the popularity of document 1 is updated
> > to 13:
> > >> >
> > >> > - What is the best way to update the file to
> > reflect
> > >> the change? Isn't this an O(n) operation?
> > >> > - How to deal with concurrent updates to the
> > file by
> > >> multiple threads?
> > >> >
> > >> > Would this method of using an external file
> > scale?
>
>


Re: Multiple passes with WordDelimiterFilterFactory

2010-08-29 Thread Erick Erickson
Look at the tokenizer/filter chain that makes up your analyzers, and see:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

for other tokenizer/analyzer/filter options.

You're on the right track looking at the various choices provided, and
I suspect you'll find what you need...

Be a little cautious about preserving things. Your users will often be more
confused than helped if you require hyphens for a match. Ditto with
possessives, plurals, etc. You might want to look at stemmers

Best
Erick

On Sat, Aug 28, 2010 at 6:20 PM, Shawn Heisey  wrote:

>  It's metadata for a collection of 45 million documents that is mostly
> photos, with some videos and text.  The data is imported from a MySQL
> database and split among six large shards (each nearly 13GB) and a small
> shard with data added in the last week.  That works out to between 300,000
> and 500,000 documents.
>
> I am mostly trying to think of ways to drastically reduce the index size
> without reducing the functionality.  Using copyField would just make it
> larger.
>
> I would like to make it so that I don't have two terms when there's a
> punctuation character at the beginning or end of a word.  For intstance, one
> field value that I just analyzed ends up with terms like the following,
> which are unneeded duplicates:
>
>
> championship.
> championship
> '04
> 04
> wisconsin.
> wisconsin
>
> Since I was already toying around, I just tested the whole notion.  I ran
> it through once with just generateWordParts and catenateWords enabled, then
> again with all the options including preserveOriginal enabled.  A test
> analysis of input with 59 whitespace separated words showed 93 terms with
> the single filter and 77 with two.  The only drop in term quality that I
> noticed was that possessive words (apostrophe-s) no longer have the original
> preserved.  I haven't yet decided whether that's a problem.
>
>
> Shawn
>
>
> On 8/27/2010 11:00 AM, Erick Erickson wrote:
>
>> I agree with Marcus, the usefulness of passing through WDF twice
>> is suspect. You can always do a copyfield to a completely different
>> field and do whatever you want there, copyfield forks the raw input
>> to the second field, not the analyzed stream...
>>
>> What is it you're really trying to accomplish? Your use-case would
>> help us help you.
>>
>> About defining things differently in index and analysis. Sure, it can
>> make sense. But, especially with WDF it's tricky. Spend some
>> significant time in the admin analysis page looking at the effects
>> of various configurations before you decide.
>>
>> Best
>> Erick
>>
>
>


Re: Multiple passes with WordDelimiterFilterFactory

2010-08-29 Thread Shawn Heisey

 On 8/28/2010 7:59 PM, Shawn Heisey wrote:
The only drop in term quality that I noticed was that possessive words 
(apostrophe-s) no longer have the original preserved.  I haven't yet 
decided whether that's a problem.


I finally did notice another drop in term quality from the dual pass - 
words with punctuation in the middle (like wolf-biederman) are not 
preserved with that punctuation intact.  I need a different filter to 
strip non-alphanumerics from the beginning and end of terms, that gets 
run after the tokenizer and the ASCII folding filter but before the word 
delimeter filter.  Does such a thing already exist, or do I just need to 
use something that does regex? Are there any recommended regex patterns 
out there for this?


Thanks,
Shawn



anybody using solr with Cassandra?

2010-08-29 Thread Siju George
Hi,

Is anybody using Solr with Cassandra?
Are there any Gotcha's?

Thanks

--Siju


Re: Multiple passes with WordDelimiterFilterFactory

2010-08-29 Thread Shawn Heisey
 It's metadata for a collection of 45 million documents that is mostly 
photos, with some videos and text.  The data is imported from a MySQL 
database and split among six large shards (each nearly 13GB) and a small 
shard with data added in the last week.  That works out to between 
300,000 and 500,000 documents.


I am mostly trying to think of ways to drastically reduce the index size 
without reducing the functionality.  Using copyField would just make it 
larger.


I would like to make it so that I don't have two terms when there's a 
punctuation character at the beginning or end of a word.  For intstance, 
one field value that I just analyzed ends up with terms like the 
following, which are unneeded duplicates:


championship.
championship
'04
04
wisconsin.
wisconsin

Since I was already toying around, I just tested the whole notion.  I 
ran it through once with just generateWordParts and catenateWords 
enabled, then again with all the options including preserveOriginal 
enabled.  A test analysis of input with 59 whitespace separated words 
showed 93 terms with the single filter and 77 with two.  The only drop 
in term quality that I noticed was that possessive words (apostrophe-s) 
no longer have the original preserved.  I haven't yet decided whether 
that's a problem.


Shawn


On 8/27/2010 11:00 AM, Erick Erickson wrote:

I agree with Marcus, the usefulness of passing through WDF twice
is suspect. You can always do a copyfield to a completely different
field and do whatever you want there, copyfield forks the raw input
to the second field, not the analyzed stream...

What is it you're really trying to accomplish? Your use-case would
help us help you.

About defining things differently in index and analysis. Sure, it can
make sense. But, especially with WDF it's tricky. Spend some
significant time in the admin analysis page looking at the effects
of various configurations before you decide.

Best
Erick