date:20100913

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Dennis Gearon

BTW, what is a segment?

I've only heard about them in the last 2 weeks here on the list.
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Sun, 9/12/10, Jason Rutherglen jason.rutherg...@gmail.com wrote:

 From: Jason Rutherglen jason.rutherg...@gmail.com
 Subject: Re: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Sunday, September 12, 2010, 7:52 PM
 Yeah there's no patch... I think
 Yonik can write it. :-)  Yah... The
 Lucene version shouldn't matter.  The distributed
 faceting
 theoretically can easily be applied to multiple segments,
 however the
 way it's written for me is a challenge to untangle and
 apply
 successfully to a working patch.  Also I don't have
 this as an itch to
 scratch at the moment.
 
 On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge peter.stu...@gmail.com
 wrote:
  Hi Jason,
 
  I've tried some limited testing with the 4.x trunk
 using fcs, and I
  must say, I really like the idea of per-segment
 faceting.
  I was hoping to see it in 3.x, but I don't see this
 option in the
  branch_3x trunk. Is your SOLR-1606 patch referred to
 in SOLR-1617 the
  one to use with 3.1?
  There seems to be a number of Solr issues tied to this
 - one of them
  being Lucene-1785. Can the per-segment faceting patch
 work with Lucene
  2.9/branch_3x?
 
  Thanks,
  Peter
 
 
 
  On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
  jason.rutherg...@gmail.com
 wrote:
  Peter,
 
  Are you using per-segment faceting, eg, SOLR-1617?
  That could help
  your situation.
 
  On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge
 peter.stu...@gmail.com
 wrote:
  Hi,
 
  Below are some notes regarding Solr cache
 tuning that should prove
  useful for anyone who uses Solr with frequent
 commits (e.g. 5min).
 
  Environment:
  Solr 1.4.1 or branch_3x trunk.
  Note the 4.x trunk has lots of neat new
 features, so the notes here
  are likely less relevant to the 4.x
 environment.
 
  Overview:
  Our Solr environment makes extensive use of
 faceting, we perform
  commits every 30secs, and the indexes tend be
 on the large-ish side
  (20million docs).
  Note: For our data, when we commit, we are
 always adding new data,
  never changing existing data.
  This type of environment can be tricky to
 tune, as Solr is more geared
  toward fast reads than frequent writes.
 
  Symptoms:
  If anyone has used faceting in searches where
 you are also performing
  frequent commits, you've likely encountered
 the dreaded OutOfMemory or
  GC Overhead Exeeded errors.
  In high commit rate environments, this is
 almost always due to
  multiple 'onDeck' searchers and autowarming -
 i.e. new searchers don't
  finish autowarming their caches before the
 next commit()
  comes along and invalidates them.
  Once this starts happening on a regular basis,
 it is likely your
  Solr's JVM will run out of memory eventually,
 as the number of
  searchers (and their cache arrays) will keep
 growing until the JVM
  dies of thirst.
  To check if your Solr environment is suffering
 from this, turn on INFO
  level logging, and look for: 'PERFORMANCE
 WARNING: Overlapping
  onDeckSearchers=x'.
 
  In tests, we've only ever seen this problem
 when using faceting, and
  facet.method=fc.
 
  Some solutions to this are:
     Reduce the commit rate to allow searchers
 to fully warm before the
  next commit
     Reduce or eliminate the autowarming in
 caches
     Both of the above
 
  The trouble is, if you're doing NRT commits,
 you likely have a good
  reason for it, and reducing/elimintating
 autowarming will very
  significantly impact search performance in
 high commit rate
  environments.
 
  Solution:
  Here are some setup steps we've used that
 allow lots of faceting (we
  typically search with at least 20-35 different
 facet fields, and date
  faceting/sorting) on large indexes, and still
 keep decent search
  performance:
 
  1. Firstly, you should consider using the enum
 method for facet
  searches (facet.method=enum) unless you've got
 A LOT of memory on your
  machine. In our tests, this method uses a lot
 less memory and
  autowarms more quickly than fc. (Note, I've
 not tried the new
  segement-based 'fcs' option, as I can't find
 support for it in
  branch_3x - looks nice for 4.x though)
  Admittedly, for our data, enum is not quite as
 fast for searching as
  fc, but short of purchsing a Thaiwanese RAM
 factory, it's a worthwhile
  tradeoff.
  If you do have access to LOTS of memory, AND
 you can guarantee that
  the index won't grow beyond the memory
 capacity (i.e. you have some
  sort of deletion policy in place), fc can be a
 lot faster than enum
  when searching with lots of facets across many
 terms.
 
  2. Secondly, we've found that LRUCache is
 faster at autowarming than
  FastLRUCache - in our tests, about 20% faster.
 Maybe this is just our
  environment - your mileage may

what differents between SolrCloud and Solr+Hadoop

2010-09-13 Thread 郭芸

Dear All:
now,i need solr to distributed search.and i found there are two choices: 
SolrCloud and Solr+Hadoop
So i want to know what differents between them?
and we can download SolrCloud from svn,and how can we get the Solr+Hadoop?
please help me!Thank you!

2010-09-13 



郭芸

Multiple sorting on text fields

2010-09-13 Thread Stanislaw

Hi all!

i found some strange behavior of solr. If I do sorting by 2 text fields in
chain, I do receive some results doubled.
The both text fields are not multivalued, one of them is string, the other
custom type based on text field and keyword analyzer.

I do this:

*CommonsHttpSolrServer server =
SolrServer.getInstance().getServer();
SolrQuery query = new SolrQuery();
query.setQuery(suchstring);
query.addSortField(type, SolrQuery.ORDER.asc);
//String field- it's only one letter
query.addSortField(sortName, SolrQuery.ORDER.asc); //text
field, not tokenized

QueryResponse rsp = new QueryResponse();
rsp = server.query(query);*

after that I extract results as a list Entity objects, the most of them are
unique, but some of them are doubled and even tripled in this list.
(Each object has a unique id and there is only one time in index)
If I'm sorting only by one text field, I'm receiving normal results w/o
problems.
Where could I do a mistake, or is it a bug?

Best regards,
Stanislaw

Re: what differents between SolrCloud and Solr+Hadoop

2010-09-13 Thread Marc Sturlese


Well these are pretty different things. SolrCloud is meant to handle
distributed search in a more easy way that raw solr distributed search.
You have to build the shards in your own way.
Solr+hadoop is a way to build these shards/indexes in paralel.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/what-differents-between-SolrCloud-and-Solr-Hadoop-tp1463809p1464106.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple sorting on text fields

2010-09-13 Thread Dennis Gearon

My guess is two things are happening:
  1/ Your combination of filters is in parallel,or an OR expression. This I 
think for sure  maybe, seen next.
  2/ To get 3 duplicate results, your custom filter AND the OR expression above 
have to be working togther, or it's possible that your customer filter is the 
WHOLE problem, supplying the duplicates and the triplicates.

A first guess  nothing more :-)
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/13/10, Stanislaw solrgeschic...@googlemail.com wrote:

 From: Stanislaw solrgeschic...@googlemail.com
 Subject: Multiple sorting on text fields
 To: solr-user@lucene.apache.org
 Date: Monday, September 13, 2010, 12:12 AM
 Hi all!
 
 i found some strange behavior of solr. If I do sorting by 2
 text fields in
 chain, I do receive some results doubled.
 The both text fields are not multivalued, one of them is
 string, the other
 custom type based on text field and keyword analyzer.
 
 I do this:
 
 *        CommonsHttpSolrServer server
 =
 SolrServer.getInstance().getServer();
         SolrQuery query = new
 SolrQuery();
         query.setQuery(suchstring);
         query.addSortField(type,
 SolrQuery.ORDER.asc);
 //String field- it's only one letter
         query.addSortField(sortName,
 SolrQuery.ORDER.asc);     //text
 field, not tokenized
 
         QueryResponse rsp = new
 QueryResponse();
         rsp = server.query(query);*
 
 after that I extract results as a list Entity objects, the
 most of them are
 unique, but some of them are doubled and even tripled in
 this list.
 (Each object has a unique id and there is only one time in
 index)
 If I'm sorting only by one text field, I'm receiving
 normal results w/o
 problems.
 Where could I do a mistake, or is it a bug?
 
 Best regards,
 Stanislaw

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge

The balanced segment merging is a really cool idea. I'll definetely
have a look at this, thanks!

One thing I forgot to mention in the original post is we use a
mergeFactor of 25. Somewhat on the high side, so that incoming commits
aren't trying to merge new data into large segments.
25 is a good balance for us between number of files and search
performance. This LinkedIn patch could come in very handy for handling
merges.


On Mon, Sep 13, 2010 at 2:20 AM, Lance Norskog goks...@gmail.com wrote:
 Bravo!

 Other tricks: here is a policy for deciding when to merge segments that
 attempts to balance merging with performance. It was contributed by
 LinkedIn- they also run indexsearch in the same instance (not Solr, a
 different Lucene app).

 lucene/contrib/misc/src/java/org/apache/lucene/index/BalancedSegmentMergePolicy.java

 The optimize command now includes a partial optimize option, so you can do
 larger controlled merges.

 Peter Sturge wrote:

 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g.5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
     Reduce the commit rate to allow searchers to fully warm before the
 next commit
     Reduce or eliminate the autowarming in caches
     Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
     filterCache
       class=solr.LRUCache
       size=3600
       initialSize=1400
       autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.

 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can handle, then a subset of the
 most common facets for the newSearcher.

 4. We also set:
    useColdSearchertrue/useColdSearcher
 just in case.

 5. Another key area for search performance with high

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge

1. You can run multiple Solr instances in separate JVMs, with both
having their solr.xml configured to use the same index folder.
You need to be careful that one and only one of these instances will
ever update the index at a time. The best way to ensure this is to use
one for writing only,
and the other is read-only and never writes to the index. This
read-only instance is the one to use for tuning for high search
performance. Even though the RO instance doesn't write to the index,
it still needs periodic (albeit empty) commits to kick off
autowarming/cache refresh.

Depending on your needs, you might not need to have 2 separate
instances. We need it because the 'write' instance is also doing a lot
of metadata pre-write operations in the same jvm as Solr, and so has
its own memory requirements.

2. We use sharding all the time, and it works just fine with this
scenario, as the RO instance is simply another shard in the pack.


On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich peat...@yahoo.de wrote:
 Peter,

 thanks a lot for your in-depth explanations!
 Your findings will be definitely helpful for my next performance
 improvement tests :-)

 Two questions:

 1. How would I do that:

 or a local read-only instance that reads the same core as the indexing
 instance (for the latter, you'll need something that periodically refreshes 
 - i.e. runs commit()).


 2. Did you try sharding with your current setup (e.g. one big,
 nearly-static index and a tiny write+read index)?

 Regards,
 Peter.

 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
     Reduce the commit rate to allow searchers to fully warm before the
 next commit
     Reduce or eliminate the autowarming in caches
     Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
     filterCache
       class=solr.LRUCache
       size=3600
       initialSize=1400
       autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge

Hi Erik,

I thought this would be good for the wiki, but I've not submitted to
the wiki before, so I thought I'd put this info out there first, then
add it if it was deemed useful.
If you could let me know the procedure for submitting, it probably
would be worth getting it into the wiki (couldn't do it straightaway,
as I have a lot of projects on at the moment). If you're able/willing
to put it on there for me, that would be very kind of you!

Thanks!
Peter


On Sun, Sep 12, 2010 at 5:43 PM, Erick Erickson erickerick...@gmail.com wrote:
 Peter:

 This kind of information is extremely useful to document, thanks! Do you
 have the time/energy to put it up on the Wiki? Anyone can edit it by
 creating
 a logon. If you don't, would it be OK if someone else did it (with
 attribution,
 of course)? I guess that by bringing it up I'm volunteering :)...

 Best
 Erick

 On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
    Reduce the commit rate to allow searchers to fully warm before the
 next commit
    Reduce or eliminate the autowarming in caches
    Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
    filterCache
      class=solr.LRUCache
      size=3600
      initialSize=1400
      autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.

 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can handle, then a subset of the
 most common facets for the newSearcher.

 4. We also set:
   useColdSearchertrue/useColdSearcher
 just in case.

 5. Another key area for search performance with high commits is to use
 2 Solr

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge

Hi Dennis,

These are the Lucene file segments that hold the index data on the file system.
Have a look at: http://wiki.apache.org/solr/SolrPerformanceFactors

Peter


On Mon, Sep 13, 2010 at 7:02 AM, Dennis Gearon gear...@sbcglobal.net wrote:
 BTW, what is a segment?

 I've only heard about them in the last 2 weeks here on the list.
 Dennis Gearon

 Signature Warning
 
 EARTH has a Right To Life,
  otherwise we all die.

 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php


 --- On Sun, 9/12/10, Jason Rutherglen jason.rutherg...@gmail.com wrote:

 From: Jason Rutherglen jason.rutherg...@gmail.com
 Subject: Re: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Sunday, September 12, 2010, 7:52 PM
 Yeah there's no patch... I think
 Yonik can write it. :-)  Yah... The
 Lucene version shouldn't matter.  The distributed
 faceting
 theoretically can easily be applied to multiple segments,
 however the
 way it's written for me is a challenge to untangle and
 apply
 successfully to a working patch.  Also I don't have
 this as an itch to
 scratch at the moment.

 On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge peter.stu...@gmail.com
 wrote:
  Hi Jason,
 
  I've tried some limited testing with the 4.x trunk
 using fcs, and I
  must say, I really like the idea of per-segment
 faceting.
  I was hoping to see it in 3.x, but I don't see this
 option in the
  branch_3x trunk. Is your SOLR-1606 patch referred to
 in SOLR-1617 the
  one to use with 3.1?
  There seems to be a number of Solr issues tied to this
 - one of them
  being Lucene-1785. Can the per-segment faceting patch
 work with Lucene
  2.9/branch_3x?
 
  Thanks,
  Peter
 
 
 
  On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
  jason.rutherg...@gmail.com
 wrote:
  Peter,
 
  Are you using per-segment faceting, eg, SOLR-1617?
  That could help
  your situation.
 
  On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge
 peter.stu...@gmail.com
 wrote:
  Hi,
 
  Below are some notes regarding Solr cache
 tuning that should prove
  useful for anyone who uses Solr with frequent
 commits (e.g. 5min).
 
  Environment:
  Solr 1.4.1 or branch_3x trunk.
  Note the 4.x trunk has lots of neat new
 features, so the notes here
  are likely less relevant to the 4.x
 environment.
 
  Overview:
  Our Solr environment makes extensive use of
 faceting, we perform
  commits every 30secs, and the indexes tend be
 on the large-ish side
  (20million docs).
  Note: For our data, when we commit, we are
 always adding new data,
  never changing existing data.
  This type of environment can be tricky to
 tune, as Solr is more geared
  toward fast reads than frequent writes.
 
  Symptoms:
  If anyone has used faceting in searches where
 you are also performing
  frequent commits, you've likely encountered
 the dreaded OutOfMemory or
  GC Overhead Exeeded errors.
  In high commit rate environments, this is
 almost always due to
  multiple 'onDeck' searchers and autowarming -
 i.e. new searchers don't
  finish autowarming their caches before the
 next commit()
  comes along and invalidates them.
  Once this starts happening on a regular basis,
 it is likely your
  Solr's JVM will run out of memory eventually,
 as the number of
  searchers (and their cache arrays) will keep
 growing until the JVM
  dies of thirst.
  To check if your Solr environment is suffering
 from this, turn on INFO
  level logging, and look for: 'PERFORMANCE
 WARNING: Overlapping
  onDeckSearchers=x'.
 
  In tests, we've only ever seen this problem
 when using faceting, and
  facet.method=fc.
 
  Some solutions to this are:
     Reduce the commit rate to allow searchers
 to fully warm before the
  next commit
     Reduce or eliminate the autowarming in
 caches
     Both of the above
 
  The trouble is, if you're doing NRT commits,
 you likely have a good
  reason for it, and reducing/elimintating
 autowarming will very
  significantly impact search performance in
 high commit rate
  environments.
 
  Solution:
  Here are some setup steps we've used that
 allow lots of faceting (we
  typically search with at least 20-35 different
 facet fields, and date
  faceting/sorting) on large indexes, and still
 keep decent search
  performance:
 
  1. Firstly, you should consider using the enum
 method for facet
  searches (facet.method=enum) unless you've got
 A LOT of memory on your
  machine. In our tests, this method uses a lot
 less memory and
  autowarms more quickly than fc. (Note, I've
 not tried the new
  segement-based 'fcs' option, as I can't find
 support for it in
  branch_3x - looks nice for 4.x though)
  Admittedly, for our data, enum is not quite as
 fast for searching as
  fc, but short of purchsing a Thaiwanese RAM
 factory, it's a worthwhile
  tradeoff.
  If you do have access to LOTS of memory, AND
 you can guarantee that
  the index won't grow beyond the memory
 capacity (i.e. you have some
  sort of deletion policy in place), fc can be a
 lot

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Simon Willnauer

On Mon, Sep 13, 2010 at 8:02 AM, Dennis Gearon gear...@sbcglobal.net wrote:
 BTW, what is a segment?

On the Lucene level an index is composed of one or more index
segments. Each segment is an index by itself and consists of several
files like doc stores, proximity data, term dictionaries etc. During
indexing Lucene / Solr creates those segments depending on ram buffer
/ document buffer settings and flushes them to disk (if you index to
disk). Once a segment has been flushed Lucene will never change the
segments (well up to a certain level - lets keep this simple) but
write new ones for new added documents. Since segments have a
write-once policy Lucene merges multiple segments into a new segment
(how and when this happens is different story) from time to time to
get rid of deleted documents and to reduce the number of overall
segments in the index.
Generally a higher number of segments will also influence you search
performance since Lucene performs almost all operations on a
per-segment level. If you want to reduce the number of segment to one
you need to call optimize and lucene will merge all existing ones into
one single segment.

hope that answers your question

simon

 I've only heard about them in the last 2 weeks here on the list.
 Dennis Gearon

 Signature Warning
 
 EARTH has a Right To Life,
  otherwise we all die.

 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php


 --- On Sun, 9/12/10, Jason Rutherglen jason.rutherg...@gmail.com wrote:

 From: Jason Rutherglen jason.rutherg...@gmail.com
 Subject: Re: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Sunday, September 12, 2010, 7:52 PM
 Yeah there's no patch... I think
 Yonik can write it. :-)  Yah... The
 Lucene version shouldn't matter.  The distributed
 faceting
 theoretically can easily be applied to multiple segments,
 however the
 way it's written for me is a challenge to untangle and
 apply
 successfully to a working patch.  Also I don't have
 this as an itch to
 scratch at the moment.

 On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge peter.stu...@gmail.com
 wrote:
  Hi Jason,
 
  I've tried some limited testing with the 4.x trunk
 using fcs, and I
  must say, I really like the idea of per-segment
 faceting.
  I was hoping to see it in 3.x, but I don't see this
 option in the
  branch_3x trunk. Is your SOLR-1606 patch referred to
 in SOLR-1617 the
  one to use with 3.1?
  There seems to be a number of Solr issues tied to this
 - one of them
  being Lucene-1785. Can the per-segment faceting patch
 work with Lucene
  2.9/branch_3x?
 
  Thanks,
  Peter
 
 
 
  On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
  jason.rutherg...@gmail.com
 wrote:
  Peter,
 
  Are you using per-segment faceting, eg, SOLR-1617?
  That could help
  your situation.
 
  On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge
 peter.stu...@gmail.com
 wrote:
  Hi,
 
  Below are some notes regarding Solr cache
 tuning that should prove
  useful for anyone who uses Solr with frequent
 commits (e.g. 5min).
 
  Environment:
  Solr 1.4.1 or branch_3x trunk.
  Note the 4.x trunk has lots of neat new
 features, so the notes here
  are likely less relevant to the 4.x
 environment.
 
  Overview:
  Our Solr environment makes extensive use of
 faceting, we perform
  commits every 30secs, and the indexes tend be
 on the large-ish side
  (20million docs).
  Note: For our data, when we commit, we are
 always adding new data,
  never changing existing data.
  This type of environment can be tricky to
 tune, as Solr is more geared
  toward fast reads than frequent writes.
 
  Symptoms:
  If anyone has used faceting in searches where
 you are also performing
  frequent commits, you've likely encountered
 the dreaded OutOfMemory or
  GC Overhead Exeeded errors.
  In high commit rate environments, this is
 almost always due to
  multiple 'onDeck' searchers and autowarming -
 i.e. new searchers don't
  finish autowarming their caches before the
 next commit()
  comes along and invalidates them.
  Once this starts happening on a regular basis,
 it is likely your
  Solr's JVM will run out of memory eventually,
 as the number of
  searchers (and their cache arrays) will keep
 growing until the JVM
  dies of thirst.
  To check if your Solr environment is suffering
 from this, turn on INFO
  level logging, and look for: 'PERFORMANCE
 WARNING: Overlapping
  onDeckSearchers=x'.
 
  In tests, we've only ever seen this problem
 when using faceting, and
  facet.method=fc.
 
  Some solutions to this are:
     Reduce the commit rate to allow searchers
 to fully warm before the
  next commit
     Reduce or eliminate the autowarming in
 caches
     Both of the above
 
  The trouble is, if you're doing NRT commits,
 you likely have a good
  reason for it, and reducing/elimintating
 autowarming will very
  significantly impact search performance in
 high commit rate
  environments.
 
  Solution:
  Here are some setup steps we've used

Re: Sorting not working on a string field

2010-09-13 Thread Jan Høydahl / Cominvent

Hi,

May you show us what result you actually get? Wouldn't it make more sense to 
choose a numeric fieldtype? To get proper sort order of numbers in a string 
field, all number need to be exactly same length since order will be 
lexiographical, i.e. 10 will come before 2, but after 02.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 10. sep. 2010, at 19.14, n...@frameweld.com wrote:

 Hello, I seem to be having a problem with sorting. I have a string field 
 (time_code) that I want to order by. When the results come up, it displays 
 the results differently from relevance which I would assume, but the results 
 aren't ordered. The data in time_code came from a numeric decimal with a six 
 digit precision if that makes a difference(ex: 1.00).
 
 Here is the query I give it:
 
 q=ceremony+AND+presentation_id%3A296+AND+type%3Ablobamp;version=1.3amp;json.nl=mapamp;rows=10amp;start=0amp;wt=jsonamp;hl=trueamp;hl.fl=textamp;hl.simple.pre=span+class%3Dhlamp;hl.simple.post=%2Fspanamp;hl.fragsize=0amp;hl.mergeContiguous=falseamp;sort=time_code+asc
 
 
 And here's the field schema:
 
 field name=presentation_id type=sint indexed=true stored=true/
 field name=asset_id type=sint indexed=true stored=true/
 field name=type type=text indexed=true stored=true/
 field name=text type=text indexed=true stored=true 
 multiValued=true/
 field name=time_code type=string indexed=true stored=true/
 field name=unique_key type=string indexed=true stored=true/
 dynamicField name=* type=text indexed=true stored=true/
 field name=all_text type=text indexed=true stored=true 
 allowDups=true multiValued=true/
 copyField source=* dest=all_text/
 field name=text_dup type=textSpell indexed=true stored=true 
 allowDups=true/
 copyField source=text dest=text_dup/
 
 Thanks for any help.

Re: mm=0?

2010-09-13 Thread Jan Høydahl / Cominvent

As Erick points out, you don't want a random doc as response!
What you're looking at is how to avoid the 0 hits problem.
You could look into one of these:
* Introduce autosuggest to avoid many 0-hits cases
* Introduce spellchecking
* Re-run the failed query with fuzzy turned on (e.g. alpha~)
* Redirect user to some other, broader source (wikipedia, google...) if 
relevant to your domain.
No matter what you do, it is important to communicate it to the user in a very 
clear way.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 11. sep. 2010, at 19.10, Satish Kumar wrote:

 Hi,
 
 We have a requirement to show at least one result every time -- i.e., even
 if user entered term is not found in any of the documents. I was hoping
 setting mm to 0 will return results in all cases, but it is not.
 
 For example, if user entered term alpha and it is *not* in any of the
 documents in the index, any document in the index can be returned. If term
 alpha is in the document set, documents having the term alpha only must
 be returned.
 
 My idea so far is to perform a search using user entered term. If there are
 any results, return them. If there are no results, perform another search
 without the query term-- this means doing two searches. Any suggestions on
 implementing this requirement using only one search?
 
 
 Thanks,
 Satish

Re: Multiple sorting on text fields

2010-09-13 Thread Stanislaw

Hi Dennis,
thanks for reply.
Please explain me what filter do you mean.

I'm searching only on one field with names:
query.setQuery(suchstring);

then I'm adding two sortings on another fields:
query.addSortField(type, SolrQuery.ORDER.asc);
query.addSortField(sortName, SolrQuery.ORDER.asc);

the results should be sorted in first queue by 'type' (only one letter 'A'
or 'B')
and then they should be sorted by names

how I can define hier 'OR' or 'AND' relations?

Best regards,
Stanislaw


2010/9/13 Dennis Gearon gear...@sbcglobal.net

 My guess is two things are happening:
  1/ Your combination of filters is in parallel,or an OR expression. This I
 think for sure  maybe, seen next.
  2/ To get 3 duplicate results, your custom filter AND the OR expression
 above have to be working togther, or it's possible that your customer filter
 is the WHOLE problem, supplying the duplicates and the triplicates.

 A first guess  nothing more :-)
 Dennis Gearon

 Signature Warning
 
 EARTH has a Right To Life,
  otherwise we all die.

 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php


 --- On Mon, 9/13/10, Stanislaw solrgeschic...@googlemail.com wrote:

  From: Stanislaw solrgeschic...@googlemail.com
  Subject: Multiple sorting on text fields
  To: solr-user@lucene.apache.org
  Date: Monday, September 13, 2010, 12:12 AM
  Hi all!
 
  i found some strange behavior of solr. If I do sorting by 2
  text fields in
  chain, I do receive some results doubled.
  The both text fields are not multivalued, one of them is
  string, the other
  custom type based on text field and keyword analyzer.
 
  I do this:
 
  *CommonsHttpSolrServer server
  =
  SolrServer.getInstance().getServer();
  SolrQuery query = new
  SolrQuery();
  query.setQuery(suchstring);
  query.addSortField(type,
  SolrQuery.ORDER.asc);
  //String field- it's only one letter
  query.addSortField(sortName,
  SolrQuery.ORDER.asc); //text
  field, not tokenized
 
  QueryResponse rsp = new
  QueryResponse();
  rsp = server.query(query);*
 
  after that I extract results as a list Entity objects, the
  most of them are
  unique, but some of them are doubled and even tripled in
  this list.
  (Each object has a unique id and there is only one time in
  index)
  If I'm sorting only by one text field, I'm receiving
  normal results w/o
  problems.
  Where could I do a mistake, or is it a bug?
 
  Best regards,
  Stanislaw

Re: Solr CoreAdmin create ignores dataDir Parameter

2010-09-13 Thread Frank Wesemann


MitchK schrieb:

Frank,

have a look at SOLR-646.

Do you think a workaround for the data-dir-tag in the solrconfig.xml can
help?
I think about something like dataDir${solr./data/corename}/dataDir for
illustration.

Unfortunately I am not very skilled in working with solr's variables and
therefore I do not know what variables are available. 
  

No, variables are not available at this stage.

If we find a solution, we should provide it as a suggestion at the wiki's
CoreAdmin-page.

Kind regards,
- Mitch
  



--
mit freundlichem Gruß,

Frank Wesemann
Fotofinder GmbH USt-IdNr. DE812854514
Software EntwicklungWeb: http://www.fotofinder.com/
Potsdamer Str. 96   Tel: +49 30 25 79 28 90
10785 BerlinFax: +49 30 25 79 28 999

Sitz: Berlin
Amtsgericht Berlin Charlottenburg (HRB 73099)
Geschäftsführer: Ali Paczensky

Re: Multiple sorting on text fields

2010-09-13 Thread Erick Erickson

A couple of things come to mind:
1 what happens if you remove the sort clauses?
 Because I suspect they're irrelevant and your
 duplicate issue is something different.
2 SOLR admin should let you determine this.
3 Please show us the configurations that
 make you sure that the documents
 are unique (I'm assuming you've defined
 uniqueKey in your schema, but please
 show us. And show us the field TYPE
  definition).
4 Assuming the uniqueKey is defined, did you
 perhaps define it after you'd indexed some
 documents? SOLR doesn't apply uniqueness
 retroactively.
5 Your secondary sort looks like it's on a tokenized
 field (again guessing, you haven't provided your
 schema definitions). It should not be. NOTE: this
 is different than multivalued! Again, I doubt this
 has anything to do with your duplcate issue, but
 it'll make your sorting interesting.

Again, I think the sorting is unrelated to your underlying
duplication issue, so until you're sure your index is in the
state you think it's in, I'd ignore sorting..

Best
Erick

On Mon, Sep 13, 2010 at 5:56 AM, Stanislaw solrgeschic...@googlemail.comwrote:

 Hi Dennis,
 thanks for reply.
 Please explain me what filter do you mean.

 I'm searching only on one field with names:
 query.setQuery(suchstring);

 then I'm adding two sortings on another fields:
 query.addSortField(type, SolrQuery.ORDER.asc);
 query.addSortField(sortName, SolrQuery.ORDER.asc);

 the results should be sorted in first queue by 'type' (only one letter 'A'
 or 'B')
 and then they should be sorted by names

 how I can define hier 'OR' or 'AND' relations?

 Best regards,
 Stanislaw


 2010/9/13 Dennis Gearon gear...@sbcglobal.net

  My guess is two things are happening:
   1/ Your combination of filters is in parallel,or an OR expression. This
 I
  think for sure  maybe, seen next.
   2/ To get 3 duplicate results, your custom filter AND the OR expression
  above have to be working togther, or it's possible that your customer
 filter
  is the WHOLE problem, supplying the duplicates and the triplicates.
 
  A first guess  nothing more :-)
  Dennis Gearon
 
  Signature Warning
  
  EARTH has a Right To Life,
   otherwise we all die.
 
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
 
 
  --- On Mon, 9/13/10, Stanislaw solrgeschic...@googlemail.com wrote:
 
   From: Stanislaw solrgeschic...@googlemail.com
   Subject: Multiple sorting on text fields
   To: solr-user@lucene.apache.org
   Date: Monday, September 13, 2010, 12:12 AM
   Hi all!
  
   i found some strange behavior of solr. If I do sorting by 2
   text fields in
   chain, I do receive some results doubled.
   The both text fields are not multivalued, one of them is
   string, the other
   custom type based on text field and keyword analyzer.
  
   I do this:
  
   *CommonsHttpSolrServer server
   =
   SolrServer.getInstance().getServer();
   SolrQuery query = new
   SolrQuery();
   query.setQuery(suchstring);
   query.addSortField(type,
   SolrQuery.ORDER.asc);
   //String field- it's only one letter
   query.addSortField(sortName,
   SolrQuery.ORDER.asc); //text
   field, not tokenized
  
   QueryResponse rsp = new
   QueryResponse();
   rsp = server.query(query);*
  
   after that I extract results as a list Entity objects, the
   most of them are
   unique, but some of them are doubled and even tripled in
   this list.
   (Each object has a unique id and there is only one time in
   index)
   If I'm sorting only by one text field, I'm receiving
   normal results w/o
   problems.
   Where could I do a mistake, or is it a bug?
  
   Best regards,
   Stanislaw

stopwords in AND clauses

2010-09-13 Thread Xavier Noria

Let's suppose we have a regular search field body_t, and an internal
boolean flag flag_t not exposed to the user.

I'd like

body_t:foo AND flag_t:true

to be an intersection, but if foo is a stopword I get all documents
for which flag_t is true, as if the first class was dropped, or if
technically all documents match an empty string.

Is there a way to get 0 results instead?

Re: stopwords in AND clauses

2010-09-13 Thread Simon Willnauer

On Mon, Sep 13, 2010 at 3:27 PM, Xavier Noria f...@hashref.com wrote:
 Let's suppose we have a regular search field body_t, and an internal
 boolean flag flag_t not exposed to the user.

 I'd like

    body_t:foo AND flag_t:true

this is solr right? why don't you use filterquery for you unexposed
flat_t field q=boty_t:foofq=flag_t:true
this might help too: http://wiki.apache.org/solr/CommonQueryParameters#fq

simon

 to be an intersection, but if foo is a stopword I get all documents
 for which flag_t is true, as if the first class was dropped, or if
 technically all documents match an empty string.

 Is there a way to get 0 results instead?

Re: stopwords in AND clauses

2010-09-13 Thread Xavier Noria

On Mon, Sep 13, 2010 at 4:29 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:

 On Mon, Sep 13, 2010 at 3:27 PM, Xavier Noria f...@hashref.com wrote:
 Let's suppose we have a regular search field body_t, and an internal
 boolean flag flag_t not exposed to the user.

 I'd like

    body_t:foo AND flag_t:true

 this is solr right? why don't you use filterquery for you unexposed
 flat_t field q=boty_t:foofq=flag_t:true
 this might help too: http://wiki.apache.org/solr/CommonQueryParameters#fq

Sounds good.

Re: mm=0?

2010-09-13 Thread Satish Kumar

Hi Erik,

I completely agree with you that showing a random document for user's query
would be very poor experience. I have raised this in our product review
meetings before. I was told that because of contractual agreement some
sponsored content needs to be returned even if it meant no match. And the
sponsored content drives the ads displayed on the page-- so it is more for
showing some ad on the page when there is no matching result from sponsored
content for user's query.

Note that some other content in addition to sponsored content is displayed
on the page, so user is not seeing just one random result when there is not
a good match.

It looks like I have to do another search to get a random result when there
are no results. In this case I will use RandomSortField to generate random
result (so that a different ad is displayed from set of sponsored ads) for
each no result case.

Thanks for the comments!


Satish



On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson erickerick...@gmail.comwrote:

 Could you explain the use-case a bit? Because the very
 first response I would have is why in the world did
 product management make this a requirement and try
 to get the requirement changed

 As a user, I'm having a hard time imagining being well
 served by getting a document in response to a search that
 had no relation to my search, it was just a random doc
 selected from the corpus.

 All that said, I don't think a single query would do the trick.
 You could include a very special document with a field
 that no other document had with very special text in it. Say
 field name bogusmatch, filled with the text bogustext
 then, at least the second query would match one and only
 one document and would take minimal time. Or you could
 tack on to each and every query OR bogusmatch:bogustext^0.001
 (which would really be inexpensive) and filter it out if there
 was more than one response. By boosting it really low, it should
 always appear at the end of the list which wouldn't be a bad thing.

 DisMax might help you here...

 But do ask if it is really a requirement or just something nobody's
 objected to before bothering IMO...

 Best
 Erick

 On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar 
 satish.kumar.just.d...@gmail.com wrote:

  Hi,
 
  We have a requirement to show at least one result every time -- i.e.,
 even
  if user entered term is not found in any of the documents. I was hoping
  setting mm to 0 will return results in all cases, but it is not.
 
  For example, if user entered term alpha and it is *not* in any of the
  documents in the index, any document in the index can be returned. If
 term
  alpha is in the document set, documents having the term alpha only
 must
  be returned.
 
  My idea so far is to perform a search using user entered term. If there
 are
  any results, return them. If there are no results, perform another search
  without the query term-- this means doing two searches. Any suggestions
 on
  implementing this requirement using only one search?
 
 
  Thanks,
  Satish

Re: Sorting not working on a string field

2010-09-13 Thread noel

You're right, it would be better to just give it a sortable numerical value. 
For now I gave time_code a sdouble type and see if it sorted, and it did. 
However all the 0's are trimmed, but that shouldn't be a problem unless it were 
to truncate any values past the hundreds column.

Thanks.
- Noel

-Original Message-
From: Jan Høydahl / Cominvent jan@cominvent.com
Sent: Monday, September 13, 2010 5:31am
To: solr-user@lucene.apache.org
Subject: Re: Sorting not working on a string field

Hi,

May you show us what result you actually get? Wouldn't it make more sense to 
choose a numeric fieldtype? To get proper sort order of numbers in a string 
field, all number need to be exactly same length since order will be 
lexiographical, i.e. 10 will come before 2, but after 02.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 10. sep. 2010, at 19.14, n...@frameweld.com wrote:

 Hello, I seem to be having a problem with sorting. I have a string field 
 (time_code) that I want to order by. When the results come up, it displays 
 the results differently from relevance which I would assume, but the results 
 aren't ordered. The data in time_code came from a numeric decimal with a six 
 digit precision if that makes a difference(ex: 1.00).
 
 Here is the query I give it:
 
 q=ceremony+AND+presentation_id%3A296+AND+type%3Ablobversion=1.3json.nl=maprows=10start=0wt=jsonhl=truehl.fl=texthl.simple.pre=span+class%3Dhlhl.simple.post=%2Fspanhl.fragsize=0hl.mergeContiguous=falsesort=time_code+asc
 
 
 And here's the field schema:
 
 field name=presentation_id type=sint indexed=true stored=true/
 field name=asset_id type=sint indexed=true stored=true/
 field name=type type=text indexed=true stored=true/
 field name=text type=text indexed=true stored=true 
 multiValued=true/
 field name=time_code type=string indexed=true stored=true/
 field name=unique_key type=string indexed=true stored=true/
 dynamicField name=* type=text indexed=true stored=true/
 field name=all_text type=text indexed=true stored=true 
 allowDups=true multiValued=true/
 copyField source=* dest=all_text/
 field name=text_dup type=textSpell indexed=true stored=true 
 allowDups=true/
 copyField source=text dest=text_dup/
 
 Thanks for any help.

Re: Multiple sorting on text fields

2010-09-13 Thread Dennis Gearon

I thought I saw 'custom analyzer', but you wrote 'custom field'.

My mistake.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/13/10, Stanislaw solrgeschic...@googlemail.com wrote:

 From: Stanislaw solrgeschic...@googlemail.com
 Subject: Re: Multiple sorting on text fields
 To: solr-user@lucene.apache.org
 Date: Monday, September 13, 2010, 2:56 AM
 Hi Dennis,
 thanks for reply.
 Please explain me what filter do you mean.
 
 I'm searching only on one field with names:
 query.setQuery(suchstring);
 
 then I'm adding two sortings on another fields:
 query.addSortField(type, SolrQuery.ORDER.asc);
 query.addSortField(sortName, SolrQuery.ORDER.asc);
 
 the results should be sorted in first queue by 'type' (only
 one letter 'A'
 or 'B')
 and then they should be sorted by names
 
 how I can define hier 'OR' or 'AND' relations?
 
 Best regards,
 Stanislaw
 
 
 2010/9/13 Dennis Gearon gear...@sbcglobal.net
 
  My guess is two things are happening:
   1/ Your combination of filters is in parallel,or
 an OR expression. This I
  think for sure  maybe, seen next.
   2/ To get 3 duplicate results, your custom
 filter AND the OR expression
  above have to be working togther, or it's possible
 that your customer filter
  is the WHOLE problem, supplying the duplicates and the
 triplicates.
 
  A first guess  nothing more :-)
  Dennis Gearon
 
  Signature Warning
  
  EARTH has a Right To Life,
   otherwise we all die.
 
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
 
 
  --- On Mon, 9/13/10, Stanislaw solrgeschic...@googlemail.com
 wrote:
 
   From: Stanislaw solrgeschic...@googlemail.com
   Subject: Multiple sorting on text fields
   To: solr-user@lucene.apache.org
   Date: Monday, September 13, 2010, 12:12 AM
   Hi all!
  
   i found some strange behavior of solr. If I do
 sorting by 2
   text fields in
   chain, I do receive some results doubled.
   The both text fields are not multivalued, one of
 them is
   string, the other
   custom type based on text field and keyword
 analyzer.
  
   I do this:
  
   *       
 CommonsHttpSolrServer server
   =
   SolrServer.getInstance().getServer();
           SolrQuery
 query = new
   SolrQuery();
       
    query.setQuery(suchstring);
       
    query.addSortField(type,
   SolrQuery.ORDER.asc);
   //String field- it's only one letter
       
    query.addSortField(sortName,
   SolrQuery.ORDER.asc); 
    //text
   field, not tokenized
  
       
    QueryResponse rsp = new
   QueryResponse();
           rsp =
 server.query(query);*
  
   after that I extract results as a list Entity
 objects, the
   most of them are
   unique, but some of them are doubled and even
 tripled in
   this list.
   (Each object has a unique id and there is only
 one time in
   index)
   If I'm sorting only by one text field, I'm
 receiving
   normal results w/o
   problems.
   Where could I do a mistake, or is it a bug?
  
   Best regards,
   Stanislaw

Re: mm=0?

2010-09-13 Thread Dennis Gearon

This issue is one I hope to head off in my application / on my site. Instead of 
an ad feed, I HOPE to be able to have an ad QUEUE on my site. If necessary, 
I'll convert the feed TO a queue.

The queue will get a first pass done on it by either an employee or a 
compensated user. Either one generates up to 4 keywords/tags for the 
advertisement. THEY determine when the ad gets shown based on relevancy.

Nice idea, hope it'll fly :-)

I actually detest the adds that say 'Lucene instance for sale, lowest prices!', 
or the industrial clearing houses that make you wade through 4 -6 screens to 
find that you need a membership in order to look for the rice of some stainless 
steel nuts. 

And usually, those ads must be paying top dollar, because they are the first 
three ads on google's search (that is until reacently.) Anyone notice that 
there's hardly any more ads on google search results?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/13/10, Satish Kumar satish.kumar.just.d...@gmail.com wrote:

 From: Satish Kumar satish.kumar.just.d...@gmail.com
 Subject: Re: mm=0?
 To: solr-user@lucene.apache.org
 Date: Monday, September 13, 2010, 7:41 AM
 Hi Erik,
 
 I completely agree with you that showing a random document
 for user's query
 would be very poor experience. I have raised this in our
 product review
 meetings before. I was told that because of contractual
 agreement some
 sponsored content needs to be returned even if it meant no
 match. And the
 sponsored content drives the ads displayed on the page-- so
 it is more for
 showing some ad on the page when there is no matching
 result from sponsored
 content for user's query.
 
 Note that some other content in addition to sponsored
 content is displayed
 on the page, so user is not seeing just one random result
 when there is not
 a good match.
 
 It looks like I have to do another search to get a random
 result when there
 are no results. In this case I will use RandomSortField to
 generate random
 result (so that a different ad is displayed from set of
 sponsored ads) for
 each no result case.
 
 Thanks for the comments!
 
 
 Satish
 
 
 
 On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
  Could you explain the use-case a bit? Because the
 very
  first response I would have is why in the world did
  product management make this a requirement and try
  to get the requirement changed
 
  As a user, I'm having a hard time imagining being
 well
  served by getting a document in response to a search
 that
  had no relation to my search, it was just a random
 doc
  selected from the corpus.
 
  All that said, I don't think a single query would do
 the trick.
  You could include a very special document with a
 field
  that no other document had with very special text in
 it. Say
  field name bogusmatch, filled with the text
 bogustext
  then, at least the second query would match one and
 only
  one document and would take minimal time. Or you
 could
  tack on to each and every query OR
 bogusmatch:bogustext^0.001
  (which would really be inexpensive) and filter it out
 if there
  was more than one response. By boosting it really low,
 it should
  always appear at the end of the list which wouldn't be
 a bad thing.
 
  DisMax might help you here...
 
  But do ask if it is really a requirement or just
 something nobody's
  objected to before bothering IMO...
 
  Best
  Erick
 
  On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar 
  satish.kumar.just.d...@gmail.com
 wrote:
 
   Hi,
  
   We have a requirement to show at least one result
 every time -- i.e.,
  even
   if user entered term is not found in any of the
 documents. I was hoping
   setting mm to 0 will return results in all cases,
 but it is not.
  
   For example, if user entered term alpha and it
 is *not* in any of the
   documents in the index, any document in the index
 can be returned. If
  term
   alpha is in the document set, documents having
 the term alpha only
  must
   be returned.
  
   My idea so far is to perform a search using user
 entered term. If there
  are
   any results, return them. If there are no
 results, perform another search
   without the query term-- this means doing two
 searches. Any suggestions
  on
   implementing this requirement using only one
 search?
  
  
   Thanks,
   Satish

Re: mm=0?

2010-09-13 Thread Dennis Gearon

I just tried several searches again on google.

I think they've refined the ads placements so that certain kind of searches 
return no ads, the kinds that I've been doing relative to programming being one 
of them.

If OTOH I do some product related search, THEN lots of ads show up, but fairly 
accurate ones.

They've immproved the ads placement a LOT!

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/13/10, Satish Kumar satish.kumar.just.d...@gmail.com wrote:

 From: Satish Kumar satish.kumar.just.d...@gmail.com
 Subject: Re: mm=0?
 To: solr-user@lucene.apache.org
 Date: Monday, September 13, 2010, 7:41 AM
 Hi Erik,
 
 I completely agree with you that showing a random document
 for user's query
 would be very poor experience. I have raised this in our
 product review
 meetings before. I was told that because of contractual
 agreement some
 sponsored content needs to be returned even if it meant no
 match. And the
 sponsored content drives the ads displayed on the page-- so
 it is more for
 showing some ad on the page when there is no matching
 result from sponsored
 content for user's query.
 
 Note that some other content in addition to sponsored
 content is displayed
 on the page, so user is not seeing just one random result
 when there is not
 a good match.
 
 It looks like I have to do another search to get a random
 result when there
 are no results. In this case I will use RandomSortField to
 generate random
 result (so that a different ad is displayed from set of
 sponsored ads) for
 each no result case.
 
 Thanks for the comments!
 
 
 Satish
 
 
 
 On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
  Could you explain the use-case a bit? Because the
 very
  first response I would have is why in the world did
  product management make this a requirement and try
  to get the requirement changed
 
  As a user, I'm having a hard time imagining being
 well
  served by getting a document in response to a search
 that
  had no relation to my search, it was just a random
 doc
  selected from the corpus.
 
  All that said, I don't think a single query would do
 the trick.
  You could include a very special document with a
 field
  that no other document had with very special text in
 it. Say
  field name bogusmatch, filled with the text
 bogustext
  then, at least the second query would match one and
 only
  one document and would take minimal time. Or you
 could
  tack on to each and every query OR
 bogusmatch:bogustext^0.001
  (which would really be inexpensive) and filter it out
 if there
  was more than one response. By boosting it really low,
 it should
  always appear at the end of the list which wouldn't be
 a bad thing.
 
  DisMax might help you here...
 
  But do ask if it is really a requirement or just
 something nobody's
  objected to before bothering IMO...
 
  Best
  Erick
 
  On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar 
  satish.kumar.just.d...@gmail.com
 wrote:
 
   Hi,
  
   We have a requirement to show at least one result
 every time -- i.e.,
  even
   if user entered term is not found in any of the
 documents. I was hoping
   setting mm to 0 will return results in all cases,
 but it is not.
  
   For example, if user entered term alpha and it
 is *not* in any of the
   documents in the index, any document in the index
 can be returned. If
  term
   alpha is in the document set, documents having
 the term alpha only
  must
   be returned.
  
   My idea so far is to perform a search using user
 entered term. If there
  are
   any results, return them. If there are no
 results, perform another search
   without the query term-- this means doing two
 searches. Any suggestions
  on
   implementing this requirement using only one
 search?
  
  
   Thanks,
   Satish

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Dennis Gearon

Thanks guys for the explanation.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/13/10, Simon Willnauer simon.willna...@googlemail.com wrote:

 From: Simon Willnauer simon.willna...@googlemail.com
 Subject: Re: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Monday, September 13, 2010, 1:33 AM
 On Mon, Sep 13, 2010 at 8:02 AM,
 Dennis Gearon gear...@sbcglobal.net
 wrote:
  BTW, what is a segment?
 
 On the Lucene level an index is composed of one or more
 index
 segments. Each segment is an index by itself and consists
 of several
 files like doc stores, proximity data, term dictionaries
 etc. During
 indexing Lucene / Solr creates those segments depending on
 ram buffer
 / document buffer settings and flushes them to disk (if you
 index to
 disk). Once a segment has been flushed Lucene will never
 change the
 segments (well up to a certain level - lets keep this
 simple) but
 write new ones for new added documents. Since segments have
 a
 write-once policy Lucene merges multiple segments into a
 new segment
 (how and when this happens is different story) from time to
 time to
 get rid of deleted documents and to reduce the number of
 overall
 segments in the index.
 Generally a higher number of segments will also influence
 you search
 performance since Lucene performs almost all operations on
 a
 per-segment level. If you want to reduce the number of
 segment to one
 you need to call optimize and lucene will merge all
 existing ones into
 one single segment.
 
 hope that answers your question
 
 simon
 
  I've only heard about them in the last 2 weeks here on
 the list.
  Dennis Gearon
 
  Signature Warning
  
  EARTH has a Right To Life,
   otherwise we all die.
 
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
 
 
  --- On Sun, 9/12/10, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:
 
  From: Jason Rutherglen jason.rutherg...@gmail.com
  Subject: Re: Tuning Solr caches with high commit
 rates (NRT)
  To: solr-user@lucene.apache.org
  Date: Sunday, September 12, 2010, 7:52 PM
  Yeah there's no patch... I think
  Yonik can write it. :-)  Yah... The
  Lucene version shouldn't matter.  The
 distributed
  faceting
  theoretically can easily be applied to multiple
 segments,
  however the
  way it's written for me is a challenge to untangle
 and
  apply
  successfully to a working patch.  Also I don't
 have
  this as an itch to
  scratch at the moment.
 
  On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge
 peter.stu...@gmail.com
  wrote:
   Hi Jason,
  
   I've tried some limited testing with the 4.x
 trunk
  using fcs, and I
   must say, I really like the idea of
 per-segment
  faceting.
   I was hoping to see it in 3.x, but I don't
 see this
  option in the
   branch_3x trunk. Is your SOLR-1606 patch
 referred to
  in SOLR-1617 the
   one to use with 3.1?
   There seems to be a number of Solr issues
 tied to this
  - one of them
   being Lucene-1785. Can the per-segment
 faceting patch
  work with Lucene
   2.9/branch_3x?
  
   Thanks,
   Peter
  
  
  
   On Mon, Sep 13, 2010 at 12:05 AM, Jason
 Rutherglen
   jason.rutherg...@gmail.com
  wrote:
   Peter,
  
   Are you using per-segment faceting, eg,
 SOLR-1617?
   That could help
   your situation.
  
   On Sun, Sep 12, 2010 at 12:26 PM, Peter
 Sturge
  peter.stu...@gmail.com
  wrote:
   Hi,
  
   Below are some notes regarding Solr
 cache
  tuning that should prove
   useful for anyone who uses Solr with
 frequent
  commits (e.g. 5min).
  
   Environment:
   Solr 1.4.1 or branch_3x trunk.
   Note the 4.x trunk has lots of neat
 new
  features, so the notes here
   are likely less relevant to the 4.x
  environment.
  
   Overview:
   Our Solr environment makes extensive
 use of
  faceting, we perform
   commits every 30secs, and the indexes
 tend be
  on the large-ish side
   (20million docs).
   Note: For our data, when we commit,
 we are
  always adding new data,
   never changing existing data.
   This type of environment can be
 tricky to
  tune, as Solr is more geared
   toward fast reads than frequent
 writes.
  
   Symptoms:
   If anyone has used faceting in
 searches where
  you are also performing
   frequent commits, you've likely
 encountered
  the dreaded OutOfMemory or
   GC Overhead Exeeded errors.
   In high commit rate environments,
 this is
  almost always due to
   multiple 'onDeck' searchers and
 autowarming -
  i.e. new searchers don't
   finish autowarming their caches
 before the
  next commit()
   comes along and invalidates them.
   Once this starts happening on a
 regular basis,
  it is likely your
   Solr's JVM will run out of memory
 eventually,
  as the number of
   searchers (and their cache arrays)
 will keep
  growing until the JVM
   dies of thirst.
   To check if your Solr environment is
 suffering
  from

Re: what differents between SolrCloud and Solr+Hadoop

2010-09-13 Thread Lance Norskog

You do not need either addition if you just want to have multiple Solr
instances on different machines, and query them all at once. Look at
this for the simplest way:

http://wiki.apache.org/solr/DistributedSearch

On Mon, Sep 13, 2010 at 12:52 AM, Marc Sturlese marc.sturl...@gmail.com wrote:

 Well these are pretty different things. SolrCloud is meant to handle
 distributed search in a more easy way that raw solr distributed search.
 You have to build the shards in your own way.
 Solr+hadoop is a way to build these shards/indexes in paralel.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/what-differents-between-SolrCloud-and-Solr-Hadoop-tp1463809p1464106.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com

Re: mm=0?

2010-09-13 Thread Lance Norskog

Java Swing no longer gives ads for swinger's clubs.

On Mon, Sep 13, 2010 at 9:37 AM, Dennis Gearon gear...@sbcglobal.net wrote:
 I just tried several searches again on google.

 I think they've refined the ads placements so that certain kind of searches 
 return no ads, the kinds that I've been doing relative to programming being 
 one of them.

 If OTOH I do some product related search, THEN lots of ads show up, but 
 fairly accurate ones.

 They've immproved the ads placement a LOT!

 Dennis Gearon

 Signature Warning
 
 EARTH has a Right To Life,
  otherwise we all die.

 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php


 --- On Mon, 9/13/10, Satish Kumar satish.kumar.just.d...@gmail.com wrote:

 From: Satish Kumar satish.kumar.just.d...@gmail.com
 Subject: Re: mm=0?
 To: solr-user@lucene.apache.org
 Date: Monday, September 13, 2010, 7:41 AM
 Hi Erik,

 I completely agree with you that showing a random document
 for user's query
 would be very poor experience. I have raised this in our
 product review
 meetings before. I was told that because of contractual
 agreement some
 sponsored content needs to be returned even if it meant no
 match. And the
 sponsored content drives the ads displayed on the page-- so
 it is more for
 showing some ad on the page when there is no matching
 result from sponsored
 content for user's query.

 Note that some other content in addition to sponsored
 content is displayed
 on the page, so user is not seeing just one random result
 when there is not
 a good match.

 It looks like I have to do another search to get a random
 result when there
 are no results. In this case I will use RandomSortField to
 generate random
 result (so that a different ad is displayed from set of
 sponsored ads) for
 each no result case.

 Thanks for the comments!


 Satish



 On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

  Could you explain the use-case a bit? Because the
 very
  first response I would have is why in the world did
  product management make this a requirement and try
  to get the requirement changed
 
  As a user, I'm having a hard time imagining being
 well
  served by getting a document in response to a search
 that
  had no relation to my search, it was just a random
 doc
  selected from the corpus.
 
  All that said, I don't think a single query would do
 the trick.
  You could include a very special document with a
 field
  that no other document had with very special text in
 it. Say
  field name bogusmatch, filled with the text
 bogustext
  then, at least the second query would match one and
 only
  one document and would take minimal time. Or you
 could
  tack on to each and every query OR
 bogusmatch:bogustext^0.001
  (which would really be inexpensive) and filter it out
 if there
  was more than one response. By boosting it really low,
 it should
  always appear at the end of the list which wouldn't be
 a bad thing.
 
  DisMax might help you here...
 
  But do ask if it is really a requirement or just
 something nobody's
  objected to before bothering IMO...
 
  Best
  Erick
 
  On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar 
  satish.kumar.just.d...@gmail.com
 wrote:
 
   Hi,
  
   We have a requirement to show at least one result
 every time -- i.e.,
  even
   if user entered term is not found in any of the
 documents. I was hoping
   setting mm to 0 will return results in all cases,
 but it is not.
  
   For example, if user entered term alpha and it
 is *not* in any of the
   documents in the index, any document in the index
 can be returned. If
  term
   alpha is in the document set, documents having
 the term alpha only
  must
   be returned.
  
   My idea so far is to perform a search using user
 entered term. If there
  are
   any results, return them. If there are no
 results, perform another search
   without the query term-- this means doing two
 searches. Any suggestions
  on
   implementing this requirement using only one
 search?
  
  
   Thanks,
   Satish
  
 





-- 
Lance Norskog
goks...@gmail.com

Re: mm=0?

2010-09-13 Thread Simon Willnauer

On Mon, Sep 13, 2010 at 8:07 PM, Lance Norskog goks...@gmail.com wrote:
 Java Swing no longer gives ads for swinger's clubs.
damned no i have to explicitly enter it?! - argh!

:)

simon

 On Mon, Sep 13, 2010 at 9:37 AM, Dennis Gearon gear...@sbcglobal.net wrote:
 I just tried several searches again on google.

 I think they've refined the ads placements so that certain kind of searches 
 return no ads, the kinds that I've been doing relative to programming being 
 one of them.

 If OTOH I do some product related search, THEN lots of ads show up, but 
 fairly accurate ones.

 They've immproved the ads placement a LOT!

 Dennis Gearon

 Signature Warning
 
 EARTH has a Right To Life,
  otherwise we all die.

 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php


 --- On Mon, 9/13/10, Satish Kumar satish.kumar.just.d...@gmail.com wrote:

 From: Satish Kumar satish.kumar.just.d...@gmail.com
 Subject: Re: mm=0?
 To: solr-user@lucene.apache.org
 Date: Monday, September 13, 2010, 7:41 AM
 Hi Erik,

 I completely agree with you that showing a random document
 for user's query
 would be very poor experience. I have raised this in our
 product review
 meetings before. I was told that because of contractual
 agreement some
 sponsored content needs to be returned even if it meant no
 match. And the
 sponsored content drives the ads displayed on the page-- so
 it is more for
 showing some ad on the page when there is no matching
 result from sponsored
 content for user's query.

 Note that some other content in addition to sponsored
 content is displayed
 on the page, so user is not seeing just one random result
 when there is not
 a good match.

 It looks like I have to do another search to get a random
 result when there
 are no results. In this case I will use RandomSortField to
 generate random
 result (so that a different ad is displayed from set of
 sponsored ads) for
 each no result case.

 Thanks for the comments!


 Satish



 On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

  Could you explain the use-case a bit? Because the
 very
  first response I would have is why in the world did
  product management make this a requirement and try
  to get the requirement changed
 
  As a user, I'm having a hard time imagining being
 well
  served by getting a document in response to a search
 that
  had no relation to my search, it was just a random
 doc
  selected from the corpus.
 
  All that said, I don't think a single query would do
 the trick.
  You could include a very special document with a
 field
  that no other document had with very special text in
 it. Say
  field name bogusmatch, filled with the text
 bogustext
  then, at least the second query would match one and
 only
  one document and would take minimal time. Or you
 could
  tack on to each and every query OR
 bogusmatch:bogustext^0.001
  (which would really be inexpensive) and filter it out
 if there
  was more than one response. By boosting it really low,
 it should
  always appear at the end of the list which wouldn't be
 a bad thing.
 
  DisMax might help you here...
 
  But do ask if it is really a requirement or just
 something nobody's
  objected to before bothering IMO...
 
  Best
  Erick
 
  On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar 
  satish.kumar.just.d...@gmail.com
 wrote:
 
   Hi,
  
   We have a requirement to show at least one result
 every time -- i.e.,
  even
   if user entered term is not found in any of the
 documents. I was hoping
   setting mm to 0 will return results in all cases,
 but it is not.
  
   For example, if user entered term alpha and it
 is *not* in any of the
   documents in the index, any document in the index
 can be returned. If
  term
   alpha is in the document set, documents having
 the term alpha only
  must
   be returned.
  
   My idea so far is to perform a search using user
 entered term. If there
  are
   any results, return them. If there are no
 results, perform another search
   without the query term-- this means doing two
 searches. Any suggestions
  on
   implementing this requirement using only one
 search?
  
  
   Thanks,
   Satish
  
 





 --
 Lance Norskog
 goks...@gmail.com

Re: How to Update Value of One Field of a Document in Index?

2010-09-13 Thread Zachary Chang


 Hi Savannah,

if you *only want to boost* documents based on the information you 
calculate from the MoreLikeThis results (i.e. numeric measure), you 
might want to take a look at the ExternalFileField type. This field type 
reads its contents from a file which contains key-value pairs, e.g. the 
document ids and the corresponding measure values, resp.
If some values change you still have to regenerate the whole file 
(instead of the whole index). But of course, this file can be generated 
from a DB, which might be updated incrementally.


For setup and usage e.g. see: 
http://dev.tailsweep.com/solr-external-scoring/


Zachary

On 10.09.2010 19:57, Savannah Beckett wrote:

I want to do MoreLikeThis to find documents that are similar to the document
that I am indexing.  Then I want to calculate the average of one of the fields
of all those documents and input this average into a field of the document that
I am indexing.  From my research, it seems that MoreLikeThis can only be used to
find similarity of document that is already in the index.  So, I think I need to
index it first, and then use MoreLikeThis to find similar documents in the index
and then reindex that document.  Any better way?  I try not to reindex a
document because it's not efficient.  I don't have to use MoreLikeThis.
Thanks.




From: Jonathan Rochkindrochk...@jhu.edu
To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
Sent: Fri, September 10, 2010 9:58:20 AM
Subject: RE: How to Update Value of One Field of a Document in Index?

More like this is intended to be run at query time. For what reasons are you
thinking you want to (re-)index each document based on the results of
MoreLikeThis?  You're right that that's not what the component is intended for.


Jonathan

From: Savannah Beckett [savannah_becket...@yahoo.com]
Sent: Friday, September 10, 2010 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: How to Update Value of One Field of a Document in Index?

Thanks.  I am trying to use MoreLikeThis in Solr to find similar documents in
the solr index and use the data from these similar documents to modify a field
in each document that I am indexing.  I found that MoreLikeThis in Solr only
works when the document is in the index, is it true?  If so, I may have to wait
til the indexing is finished, then run my own command to do MoreLikeThis to each
document in the index, and then reindex each document?  It sounds like it's not
efficient.  Is there a better way?
Thanks.





From: Liam O'Boyleliam.obo...@intelligencebank.com
To: solr-user@lucene.apache.org
Cc: u...@nutch.apache.org
Sent: Thu, September 9, 2010 11:06:36 PM
Subject: Re: How to Update Value of One Field of a Document in Index?

Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
savannah_becket...@yahoo.com  wrote:

I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
update the value of one of the fields of a document in the solr index after

the

document was already indexed, and I have only the document id.  How do I do
that?

Thanks.








__
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. 
http://mail.yahoo.com

RE: Solr memory use, jmap and TermInfos/tii

2010-09-13 Thread Burton-West, Tom

Thanks Robert and everyone!

I'm working on changing our JVM settings today, since putting Solr 1.4.1 into 
production will take a bit more work and testing.  Hopefully, I'll be able to 
test the setTermIndexDivisor on our test server tomorrow.

Mike, I've started the process to see if we can provide you with our tii/tis 
data.  I'll let you know as soon as I hear anything.  


Tom

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Sunday, September 12, 2010 10:48 AM
To: solr-user@lucene.apache.org; simon.willna...@gmail.com
Subject: Re: Solr memory use, jmap and TermInfos/tii

On Sun, Sep 12, 2010 at 9:57 AM, Simon Willnauer 
simon.willna...@googlemail.com wrote:

  To change the divisor in your solrconfig, for example to 4, it looks like
  you need to do this.
 
   indexReaderFactory name=IndexReaderFactory
  class=org.apache.solr.core.StandardIndexReaderFactory
 int name=setTermIndexInterval4/int
   /indexReaderFactory 

 Ah, thanks robert! I didn't know about that one either!

 simon


actually I'm wrong, for solr 1.4, use setTermIndexDivisor.

i was looking at 3.1/trunk and there is a bug in the name of this parameter:
https://issues.apache.org/jira/browse/SOLR-2118

-- 
Robert Muir
rcm...@gmail.com

RE: Solr and jvm Garbage Collection tuning

2010-09-13 Thread Burton-West, Tom

Thanks Kent for your info.  

We are not doing any faceting, sorting, or much else.  My guess is that most of 
the memory increase is just the data structures created when parts of the frq 
and prx files get read into memory.  Our frq files are about 77GB  and the prx 
files are about 260GB per shard and we are running 3 shards per machine.   I 
suspect that the document cache and query result cache don't take up that much 
space, but will try a run with those caches set to 0, just to see.

We have dual 4 core processors and 74GB total memory.  We want to leave a 
significant amount of memory free for OS disk caching. 

We tried increasing the memory from 20GB to 28GB and adding the 
-XXMaxGCPauseMillis=1000 flag but that seemed to have no effect.  

Currently I'm testing using the ConcurrentMarkSweep and that's looking much 
better although I don't understand why it has sized the Eden space down into 
the 20MB range. However, I am very new to Java memory management.

Anyone know if when using ConcurrentMarkSweep its better to let the JVM size 
the Eden space or better to give it some hints?


Once we get some decent JVM settings we can put into production I'll be testing 
using termIndexInterval with Solr 1.4.1 on our test server.

Tom

-Original Message-
From: Grant Ingersoll [mailto:gsing...@apache.org] 

.What are your current GC settings?  Also, I guess I'd look at ways you can 
reduce the heap size needed. 
 Caching, field type choices, faceting choices.  
Also could try playing with the termIndexInterval which will load fewer terms 
into memory at the cost of longer seeks. 

 At some point, though, you just may need more shards and the resulting 
 smaller indexes.  How many CPU cores do you have on each machine?

Re: How to extend IndexSchema and SchemaField

2010-09-13 Thread Chris Hostetter


: Yes, I have thought of that, or even extending field type. But this does not
: work for my use case, since I can have multiple fields of a same type
: (therefore with the same field type, and same analyzer), but each one of them
: needs specific information. Therefore, I think the only nice way to achieve
: this is to have the possibility to add attributes to any field definition.

Right, at the moment custom FieldType classes can specify whatever 
attributes they want to use in the fieldType / declaration -- but it's 
not possible to specify arbitrary attributes that can be used in the 
field / declaration.

By all means, pelase open an issue requesting this as a feature.

I don't know that anyone explicitly set out to impose this limitation, but 
one of the reasons it likely exists is because SchemaField is not 
something that is intended to be customized -- while FieldType 
objects are constructed once at startup, SchemaField obejcts are 
frequently created on the fly when dealing with dynamicFields, so 
initialization complexity is kept to a minimum.  

That said -- this definitely seems like that type of usecase that we 
should try to find *some* solution for -- even if it just means having 
Solr automaticly create hidden FieldType instances for you on startup 
based on attributes specified in the field / that the corrisponding 
FieldType class understands.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!

Re: Need Advice for Finding Freelance Solr Expert

2010-09-13 Thread Chris Hostetter


: References: c75bde7840b0f74f853020b773ce450302b35...@nyserver19.nextjump.com
:  4c881061.60...@jhu.edu
: c75bde7840b0f74f853020b773ce450302b35...@nyserver19.nextjump.com
: In-Reply-To:
: c75bde7840b0f74f853020b773ce450302b35...@nyserver19.nextjump.com
: Subject: Need Advice for Finding Freelance Solr Expert

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!

RE: Field names

2010-09-13 Thread Peter A. Kirk

Fantastic - that is exactly what I was looking for!

But here is one thing I don't undertstand:

If I call the url:
http://localhost:8983/solr/admin/luke?numTerms=10fl=name

Some of the result looks like:

lst name=fields
  lst name=name
lst name=topTerms
  int name=gb18/int 

Does this mean that the term gb occurs 18 times in the name field?

Because if I issue this search:
http://localhost:8983/solr/select/?q=name:gb

I get results like:
result name=response numFound=9 start=0
  doc

So it only finds 9?

What do the above results actually tell me?

Thanks,
Peter


From: Ryan McKinley [ryan...@gmail.com]
Sent: Tuesday, 14 September 2010 11:30
To: solr-user@lucene.apache.org
Subject: Re: Field names

check:
http://wiki.apache.org/solr/LukeRequestHandler



On Mon, Sep 13, 2010 at 7:00 PM, Peter A. Kirk p...@alpha-solutions.dk wrote:
 Hi

 is it possible to issue a query to solr, to get a list which contains all the 
 field names in the index?

 What about to get a list of the freqency of individual words in each field?

 thanks,
 Peter

Re: Solr memory use, jmap and TermInfos/tii

2010-09-13 Thread Michael McCandless

On Mon, Sep 13, 2010 at 6:29 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Robert and everyone!

 I'm working on changing our JVM settings today, since putting Solr 1.4.1 into 
 production will take a bit more work and testing.  Hopefully, I'll be able to 
 test the setTermIndexDivisor on our test server tomorrow.

 Mike, I've started the process to see if we can provide you with our tii/tis 
 data.  I'll let you know as soon as I hear anything.

Super, thanks Tom!

Mike

Re: Distance sorting with spatial filtering

2010-09-13 Thread Scott K

I tracked down the problem and found a workaround. If there is a
wildcard entry in schema.xml such as the following.

   !-- Ignore any fields that don't already match an existing field name --
   dynamicField name=* type=ignored multiValued=true /

then sort by function fails and returns Error 400 can not sort on
unindexed field: function name

Removing the name=* entry from schema.xml is a workaround. I noted
this in the Solr-1297 JIRA entry.

Scott

On Fri, Sep 10, 2010 at 01:40, Lance Norskog goks...@gmail.com wrote:
 Since no one has jumped in to give the right syntax- yeah, it's a bug.
 Please file a JIRA.

 On Thu, Sep 9, 2010 at 9:44 PM, Scott K s...@skister.com wrote:
 On Thu, Sep 9, 2010 at 21:00, Lance Norskog goks...@gmail.com wrote:
 I just checked out the trunk, and branch 3.x This query is accepted on both,
 but gives no responses:
 http://localhost:8983/solr/select/?q=*:*sort=dist(2,x_dt,y_dt,0,0)+asc

 So you are saying when you add the sort parameter you get no results
 back, but do not get the error I am seeing? Should I open a Jira
 ticket?

 x_dt and y_dt are wildcard fields with the tdouble type. tdouble
 explicitly says it is stored and indexed. Your 'longitude' and 'latitude'
 fields may not be stored?

 No, they are stored.
 http://localhost:8983/solr/select?q=*:*rows=1wt=xmlindent=true
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeader
  int name=status0/int
  int name=QTime9/int
 /lst
 result name=response numFound=365775 start=0
  doc
 ...
    double name=latitude47.6636/double
    double name=longitude-122.3054/double


 Also, this is accepted on both branches:
 http://localhost:8983/solr/select/?q=*:*sort=sum(1)+asc

 The documentation for sum() does not mention single-argument calls.

 This also fails
 http://localhost:8983/solr/select/?q=*:*sort=sum(1,2)+asc
 http://localhost:8983/solr/select/?q=*:*sort=sum(latitude,longitude)+asc


 Scott K wrote:

 According to the documentation, sorting by function has been a feature
 since Solr 1.5. It seems like a major regression if this no longer
 works.
 http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function

 The _val_ trick does not seem to work if used with a query term,
 although I can try some more things to give 0 value to the query term.

 On Wed, Sep 8, 2010 at 22:21, Lance Norskoggoks...@gmail.com  wrote:


 It says that the field sum(1) is not indexed. You don't have a field
 called 'sum(1)'. I know there has been a lot of changes in query parsing,
 and sorting by functions may be on the list. But the _val_ trick is the
 older one and, and you noted, still works. The _val_ trick sets the
 ranking
 value to the output of the function, thus indirectly doing what sort=
 does.

 Lance

 Scott K wrote:


 I get the error on all functions.
 GET 'http://localhost:8983/solr/select?q=*:*sort=sum(1)+asc'
 Error 400 can not sort on unindexed field: sum(1)

 I tried another nightly build from today, Sep 7th, with the same
 results. I attached the schema.xml

 Thanks for the help!
 Scott

 On Wed, Sep 1, 2010 at 18:43, Lance Norskoggoks...@gmail.com    wrote:



 Post your schema.

 On Mon, Aug 30, 2010 at 2:04 PM, Scott Ks...@skister.com    wrote:



 The new spatial filtering (SOLR-1586) works great and is much faster
 than fq={!frange. However, I am having problems sorting by distance.
 If I try
 GET

 'http://localhost:8983/solr/select/?q=*:*sort=dist(2,latitude,longitude,0,0)+asc'
 I get an error:
 Error 400 can not sort on unindexed field:
 dist(2,latitude,longitude,0,0)

 I was able to work around this with
 GET 'http://localhost:8983/solr/select/?q=*:* AND _val_:recip(dist(2,
 latitude, longitude, 0,0),1,1,1)fl=*,score'

 But why isn't sorting by functions working? I get this error with any
 function I try to sort on.This is a nightly trunk build from Aug 25th.
 I see SOLR-1297 was reopened, but that seems to be for edge cases.

 Second question: I am using the LatLonType from the Spatial Filtering
 wiki, http://wiki.apache.org/solr/SpatialSearch
 Are there any distance sorting functions that use this field, or do I
 need to have three indexed fields, store_lat_lon, latitude, and
 longitude, if I want both filtering and sorting by distance.

 Thanks, Scott




 --
 Lance Norskog
 goks...@gmail.com










 --
 Lance Norskog
 goks...@gmail.com

Re: Solr and jvm Garbage Collection tuning

2010-09-13 Thread Stephen Green

On Mon, Sep 13, 2010 at 6:45 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Kent for your info.

 We are not doing any faceting, sorting, or much else.  My guess is that most 
 of the memory increase is just the data structures created when parts of the 
 frq and prx files get read into memory.  Our frq files are about 77GB  and 
 the prx files are about 260GB per shard and we are running 3 shards per 
 machine.   I suspect that the document cache and query result cache don't 
 take up that much space, but will try a run with those caches set to 0, just 
 to see.

 We have dual 4 core processors and 74GB total memory.  We want to leave a 
 significant amount of memory free for OS disk caching.

 We tried increasing the memory from 20GB to 28GB and adding the 
 -XXMaxGCPauseMillis=1000 flag but that seemed to have no effect.

 Currently I'm testing using the ConcurrentMarkSweep and that's looking much 
 better although I don't understand why it has sized the Eden space down into 
 the 20MB range. However, I am very new to Java memory management.

 Anyone know if when using ConcurrentMarkSweep its better to let the JVM size 
 the Eden space or better to give it some hints?

Really the best thing to do is to run the system for a while with GC
logging on and then look at how often the young generation GC is
occurring.  A set of parameters like:

-verbose:gc -XX:+PrintGCTimeStamps  -XX:+PrintGCDetails

Should give you some indication how often the young gen GC is
occurring.  If it's often, you can try increasing the size of the
young generation.  The option:

-Xloggc:some file

will dump this information to the specified file rather than sending
it to the standard error.

I've done this a few times with a variety of systems:  some times you
want to make the young gen bigger and some times you don't.

Steve
-- 
Stephen Green
http://thesearchguy.wordpress.com

geographic sharding . . . or not

2010-09-13 Thread Dennis Gearon

Think about THE big one - google.

(First, China for this example is avoided because much Chinese data is
ILLEGAL to be
provided for search outside of China)

If there is data generated by people in Europe, in various languages:
  1/ Is it stored close to where it is generated?
  2/ Are sharding and replication also close to where it is
generated?
  3/ How accessible IS that data to someone from the US who speaks one
of those languages?
  4/ How much is sharding and replication done AWAY from where data is
geographically generated?
  5/ What if a set of linked documents, from a relational database, has half of 
it's documents in one language AND related to people/places/or things in one 
country, and half in another country and it's language. There's a parent record 
for the two sets.= in the country of the user originating the parent/dual sets. 
 A/ is the parent record replicated in both countries, so that searches 
finding the child records can easily get to the parent record, vs 
transatlantic/pacific fetches?
 B/ Any thoughts about machine translation of said parent record?
 

What are people's thoughts on making sites that cater to people
interested in web pages, etc in other countries? Any examples out
there?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php

Our SOLR instance seems to be single-threading and therefore not taking advantage of its multi-proc host

2010-09-13 Thread David Crane


We are running SOLR 1.4.1 (Lucene 2.9.3) on a 2-CPU Linux host, but it seems
that only 1 CPU is ever being used. It almost seems like something is
single-threading inside the SOLR application. The CPU utilization is very
seldom over 0.9 even under load.

We are running on virtual Linux hosts and our other apps in the same cluster
are multi-threading w/o issue. Some more info on our stack and versions:

  Linux 2.6.16.33-xenU 
  Apache 2.2.3 
  Tomcat 6.0.16 
  Java SE Runtime Environment (build 1.6.0_10-ea-b11)

Has anyone else noticed this problem?

Might there be some SOLR config aspect to enable multi-threading that we're
missing? Any suggestions for troubleshooting?

Judging by SOLR's logs, we do see that multiple requests are processing
simultaneously inside SOLR so we do not believe we're sequentially feeding
requests to SOLR, ie. bottle-necking things outside of SOLR.

Thanks,
David Crane
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Our-SOLR-instance-seems-to-be-single-threading-and-therefore-not-taking-advantage-of-its-multi-proc-t-tp1470282p1470282.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Field names

2010-09-13 Thread Simon Willnauer

On Tue, Sep 14, 2010 at 1:39 AM, Peter A. Kirk p...@alpha-solutions.dk wrote:
 Fantastic - that is exactly what I was looking for!

 But here is one thing I don't undertstand:

 If I call the url:
 http://localhost:8983/solr/admin/luke?numTerms=10fl=name

 Some of the result looks like:

 lst name=fields
  lst name=name
    lst name=topTerms
      int name=gb18/int

 Does this mean that the term gb occurs 18 times in the name field?
Yes that is the Doc Frequency of the term gb. Remember that deleted
/ updated documents and their terms contribute to the doc frequency
until they are expunged from the index. That either happens through a
segment merge in the background or due to an explicit call to
optimize.

 Because if I issue this search:
 http://localhost:8983/solr/select/?q=name:gb

 I get results like:
 result name=response numFound=9 start=0
  doc

 So it only finds 9?
Since the gb term says 18 occurrences throughout the index I suspect
you updated you docs once without optimizing or indexing a lot of docs
so that segments are merged. Try to call optimize if you can afford it
and see if the doc-freq count goes back to 9

simon

 What do the above results actually tell me?

 Thanks,
 Peter

 
 From: Ryan McKinley [ryan...@gmail.com]
 Sent: Tuesday, 14 September 2010 11:30
 To: solr-user@lucene.apache.org
 Subject: Re: Field names

 check:
 http://wiki.apache.org/solr/LukeRequestHandler



 On Mon, Sep 13, 2010 at 7:00 PM, Peter A. Kirk p...@alpha-solutions.dk 
 wrote:
 Hi

 is it possible to issue a query to solr, to get a list which contains all 
 the field names in the index?

 What about to get a list of the freqency of individual words in each field?

 thanks,
 Peter

39 matches

Mail list logo