Re: Slow Commits

2009-11-10 Thread Jim Murphy

Just an update to the list.  It appears that memory was the culprit.  I
attached a JMX console to the running Tomcat instance and monitored memory
usage.  Used Total memory stayed ~900MB till a commit then jumped to m Xmx
setting of 1.2GB where the "peak" flatlined and fell down likely after an
OOM exception.  I upped the Xmx to 2GB and commits are happening much better
- in the 1 minute range.


Jim



Jim Murphy wrote:
> 
> Thanks Jerome,
> 
> 
> 1. I have shut off autowarming by setting params to 0.
> 2. My JVM Settings: -Xmx1200m -Xms1200m -XX:-UseGCOverheadLimit
> -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=50
> 3. I am using autocommits - every 6 ms.  But the commit blocks all the
> master request threadpool threads as it spends 2-3 minutes commiting.
> 4. I'm reluctant to NOT waitFlush since I don't want commits stacking up. 
> 
> 
> Any other thoughts?
> 
> Thanks
> 
> Jim
> 
> 
> 
> 
> Jérôme Etévé wrote:
>> 
>> Hi, here's two thing that can slow down commits:
>> 
>> 1) Autowarming the caches.
>> 2) The Java old generation object garbage collection.
>> 
>> You can try:
>> - Turning autowarming off (set autowarmCount="0"  in the caches
>> configuration)
>> - If you use the sun jvm, use  -XX:+UseConcMarkSweepGC to get a less
>> blocking garbage collection.
>> 
>> You may also try to:
>> - Not wait for the new searcher when you commit. The commit will then
>> be instant from your posting application point of view. ( option
>> waitSearcher=false  ).
>> - Leave the commits to the server ( by setting autocommits in the
>> solrconfig.xml). This is the best strategy if you've got lot of
>> concurrent processes who posts.
>> 
>> Cheers.
>> 
>> Jerome.
>> 
>> 2009/10/28 Jim Murphy :
>>>
>>> Hi All,
>>>
>>> We have 8 solr shards, index is ~ 90M documents 190GB.  :)
>>>
>>> 4 of the shards have acceptable commit time - 30-60 seconds.  The other
>>> 4
>>> have drifted over the last couple months to but up around 2-3 minutes. 
>>> This
>>> is killing our write throughput as you can imagine.
>>>
>>> I've included a log dump of a typical commit.  Not the large time period
>>> (3:40) between the start commit log message and the OnCommit log
>>> message.
>>> So, I think warming issues are not relevant.
>>>
>>> Any ideas what to debug at this point?
>>>
>>> I'm about to issue an optimize and see where that goes.  Its been a
>>> while
>>> since I did that.
>>>
>>> Cheers,
>>>
>>> Jim
>>>
>>>
>>>
>>>
>>> Oct 28, 2009 11:47:02 AM org.apache.solr.update.DirectUpdateHandler2
>>> commit
>>> INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true)
>>> Oct 28, 2009 11:50:43 AM org.apache.solr.core.SolrDeletionPolicy
>>> onCommit
>>> INFO: SolrDeletionPolicy.onCommit: commits:num=2
>>>
>>> commit{dir=/master/data/index,segFN=segments_8us4,version=1228872482131,generation=413140,filenames=[segments_8us4,
>>> _alae.fnm, _ai
>>> lk.tis, _ala9.fnm, _ala9.fdx, _alac.fnm, _al9w_h.del, _alab.prx,
>>> _ala9.fdt,
>>> _a61p_b76.del, _alab.fnm, _al8x.frq, _al7i_2f.del, _akh1.tis,
>>> _add1.frq, _alae.tis, _alad_1.del, _alaa.fnm, _alad.nrm, _al9w.frq,
>>> _alae.tii, _ailk.tii, _add1.tis, _alac.tii, _akuu.tis, _add1.tii, _ail
>>> k.frq, _alac.tis, _7zfh.tii, _962y.tis, _ala7.frq, _ah91.prx, _akuu.tii,
>>> _alab_3.del, _ah91.fnm, _7zfh.tis, _ala8.frq, _962y.tii, _alae.pr
>>> x, _a61p.fdt, _akuu.frq, _a61p.fdx, _al7i.fdx, _al2o.tis, _al9w.tis,
>>> _ala7.fnm, _a61p.frq, _akzu.fnm, _9wzn.fnm, _akh1.prx, _al7i.fdt, _al
>>> a9_2.del, _962y.prx, _al7i.prx, _al9w.tii, _alaa_4.del, _al7i.frq,
>>> _ah91.tii, _ala8.nrm, _962y.fdt, _add1_62u.del, _alae.nrm, _ah91.tis, _
>>> 962y.fdx, _akh1.fnm, _al8x.prx, _al2o.tii, _ala7.fdx, _ala9.prx,
>>> _ala7.fdt,
>>> _al9w.prx, _ala8.prx, _akh1.tii, _al2o.fdx, _7zfh.frq, _alac_3
>>> .del, _akzu.tii, _akzu.fdt, _alad.fnm, _akzu.tis, _alab.nrm, _akzu.fdx,
>>> _al2o.fnm, _al2o.fdt, _alaa.prx, _alaa.nrm, _962y.fnm, _ala7.prx,
>>> _alaa.tis, _ailk.fdt, _akzu_8d.del, _alac.frq, _akzu.prx, _ala9.nrm,
>>> _ailk.prx, _ala9.tis, _alaa.tii, _alae.frq, _add1.fnm, _7zfh.prx, _al
>>> 9w.fnm, _ala9.tii, _ala9.frq, _962y.nrm, _alab.frq, _ala8.fdx,
>>> _al8x.fnm,
>>> _a61p.prx, _7zfh.fnm, _ala8.fdt, _ailk.fdx, _

Re: Slow Commits

2009-10-28 Thread Jim Murphy

Thanks Jerome,


1. I have shut off autowarming by setting params to 0.
2. My JVM Settings: -Xmx1200m -Xms1200m -XX:-UseGCOverheadLimit
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=50
3. I am using autocommits - every 6 ms.  But the commit blocks all the
master request threadpool threads as it spends 2-3 minutes commiting.
4. I'm reluctant to NOT waitFlush since I don't want commits stacking up. 


Any other thoughts?

Thanks

Jim




Jérôme Etévé wrote:
> 
> Hi, here's two thing that can slow down commits:
> 
> 1) Autowarming the caches.
> 2) The Java old generation object garbage collection.
> 
> You can try:
> - Turning autowarming off (set autowarmCount="0"  in the caches
> configuration)
> - If you use the sun jvm, use  -XX:+UseConcMarkSweepGC to get a less
> blocking garbage collection.
> 
> You may also try to:
> - Not wait for the new searcher when you commit. The commit will then
> be instant from your posting application point of view. ( option
> waitSearcher=false  ).
> - Leave the commits to the server ( by setting autocommits in the
> solrconfig.xml). This is the best strategy if you've got lot of
> concurrent processes who posts.
> 
> Cheers.
> 
> Jerome.
> 
> 2009/10/28 Jim Murphy :
>>
>> Hi All,
>>
>> We have 8 solr shards, index is ~ 90M documents 190GB.  :)
>>
>> 4 of the shards have acceptable commit time - 30-60 seconds.  The other 4
>> have drifted over the last couple months to but up around 2-3 minutes. 
>> This
>> is killing our write throughput as you can imagine.
>>
>> I've included a log dump of a typical commit.  Not the large time period
>> (3:40) between the start commit log message and the OnCommit log message.
>> So, I think warming issues are not relevant.
>>
>> Any ideas what to debug at this point?
>>
>> I'm about to issue an optimize and see where that goes.  Its been a while
>> since I did that.
>>
>> Cheers,
>>
>> Jim
>>
>>
>>
>>
>> Oct 28, 2009 11:47:02 AM org.apache.solr.update.DirectUpdateHandler2
>> commit
>> INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true)
>> Oct 28, 2009 11:50:43 AM org.apache.solr.core.SolrDeletionPolicy onCommit
>> INFO: SolrDeletionPolicy.onCommit: commits:num=2
>>
>> commit{dir=/master/data/index,segFN=segments_8us4,version=1228872482131,generation=413140,filenames=[segments_8us4,
>> _alae.fnm, _ai
>> lk.tis, _ala9.fnm, _ala9.fdx, _alac.fnm, _al9w_h.del, _alab.prx,
>> _ala9.fdt,
>> _a61p_b76.del, _alab.fnm, _al8x.frq, _al7i_2f.del, _akh1.tis,
>> _add1.frq, _alae.tis, _alad_1.del, _alaa.fnm, _alad.nrm, _al9w.frq,
>> _alae.tii, _ailk.tii, _add1.tis, _alac.tii, _akuu.tis, _add1.tii, _ail
>> k.frq, _alac.tis, _7zfh.tii, _962y.tis, _ala7.frq, _ah91.prx, _akuu.tii,
>> _alab_3.del, _ah91.fnm, _7zfh.tis, _ala8.frq, _962y.tii, _alae.pr
>> x, _a61p.fdt, _akuu.frq, _a61p.fdx, _al7i.fdx, _al2o.tis, _al9w.tis,
>> _ala7.fnm, _a61p.frq, _akzu.fnm, _9wzn.fnm, _akh1.prx, _al7i.fdt, _al
>> a9_2.del, _962y.prx, _al7i.prx, _al9w.tii, _alaa_4.del, _al7i.frq,
>> _ah91.tii, _ala8.nrm, _962y.fdt, _add1_62u.del, _alae.nrm, _ah91.tis, _
>> 962y.fdx, _akh1.fnm, _al8x.prx, _al2o.tii, _ala7.fdx, _ala9.prx,
>> _ala7.fdt,
>> _al9w.prx, _ala8.prx, _akh1.tii, _al2o.fdx, _7zfh.frq, _alac_3
>> .del, _akzu.tii, _akzu.fdt, _alad.fnm, _akzu.tis, _alab.nrm, _akzu.fdx,
>> _al2o.fnm, _al2o.fdt, _alaa.prx, _alaa.nrm, _962y.fnm, _ala7.prx,
>> _alaa.tis, _ailk.fdt, _akzu_8d.del, _alac.frq, _akzu.prx, _ala9.nrm,
>> _ailk.prx, _ala9.tis, _alaa.tii, _alae.frq, _add1.fnm, _7zfh.prx, _al
>> 9w.fnm, _ala9.tii, _ala9.frq, _962y.nrm, _alab.frq, _ala8.fdx, _al8x.fnm,
>> _a61p.prx, _7zfh.fnm, _ala8.fdt, _ailk.fdx, _alaa.frq, _7zfh.fdx
>> , _al7i.tis, _ah91.fdt, _ailk.fnm, _9wzn_i0m.del, _ah91.fdx, _al7i.tii,
>> _ailk_24j.del, _alad.fdx, _al8x.tii, _alae.fdx, _add1.prx, _akuu.f
>> nm, _al8x.tis, _ah91.frq, _ala8.fnm, _7zfh.fdt, _alad.fdt, _alae_1.del,
>> _alae.fdt, _akzu.frq, _a61p.fnm, _9wzn.frq, _ala8.tii, _7zfh_1gsd.
>> del, _7zfh.nrm, _ala7_6.del, _a61p.tis, _9wzn.tii, _alad.frq, _alad.tii,
>> _akuu.fdt, _alab.tii, _ala8.tis, _962y_xgg.del, _akh1.frq, _akuu.
>> fdx, _alab.tis, _al7i.fnm, _alad.tis, _alac.nrm, _alab.fdx, _ala8_5.del,
>> _add1.fdx, _ala7.tii, _akuu_cc.del, _alab.fdt, _9wzn.prx, _alaa.f
>> dx, _al9w.fdt, _al2o.frq, _akh1_nf.del, _alac.prx, _akh1.fdx, _alaa.fdt,
>> _al9w.fdx, _al8x_17.del, _add1.fdt, _al2o.prx, _akh1.fdt, _alad.p
>> rx, _akuu.prx, _962y.frq, _al2o_66.del, _alac.fdt, _ala7.tis, _a61p.tii,
>> _alac.fdx, _al8x.fdt, _9wzn.tis,

Slow Commits

2009-10-28 Thread Jim Murphy

Hi All,

We have 8 solr shards, index is ~ 90M documents 190GB.  :)

4 of the shards have acceptable commit time - 30-60 seconds.  The other 4
have drifted over the last couple months to but up around 2-3 minutes.  This
is killing our write throughput as you can imagine.

I've included a log dump of a typical commit.  Not the large time period
(3:40) between the start commit log message and the OnCommit log message.
So, I think warming issues are not relevant.

Any ideas what to debug at this point? 

I'm about to issue an optimize and see where that goes.  Its been a while
since I did that.

Cheers,

Jim 




Oct 28, 2009 11:47:02 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true)
Oct 28, 2009 11:50:43 AM org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2

commit{dir=/master/data/index,segFN=segments_8us4,version=1228872482131,generation=413140,filenames=[segments_8us4,
_alae.fnm, _ai
lk.tis, _ala9.fnm, _ala9.fdx, _alac.fnm, _al9w_h.del, _alab.prx, _ala9.fdt,
_a61p_b76.del, _alab.fnm, _al8x.frq, _al7i_2f.del, _akh1.tis, 
_add1.frq, _alae.tis, _alad_1.del, _alaa.fnm, _alad.nrm, _al9w.frq,
_alae.tii, _ailk.tii, _add1.tis, _alac.tii, _akuu.tis, _add1.tii, _ail
k.frq, _alac.tis, _7zfh.tii, _962y.tis, _ala7.frq, _ah91.prx, _akuu.tii,
_alab_3.del, _ah91.fnm, _7zfh.tis, _ala8.frq, _962y.tii, _alae.pr
x, _a61p.fdt, _akuu.frq, _a61p.fdx, _al7i.fdx, _al2o.tis, _al9w.tis,
_ala7.fnm, _a61p.frq, _akzu.fnm, _9wzn.fnm, _akh1.prx, _al7i.fdt, _al
a9_2.del, _962y.prx, _al7i.prx, _al9w.tii, _alaa_4.del, _al7i.frq,
_ah91.tii, _ala8.nrm, _962y.fdt, _add1_62u.del, _alae.nrm, _ah91.tis, _
962y.fdx, _akh1.fnm, _al8x.prx, _al2o.tii, _ala7.fdx, _ala9.prx, _ala7.fdt,
_al9w.prx, _ala8.prx, _akh1.tii, _al2o.fdx, _7zfh.frq, _alac_3
.del, _akzu.tii, _akzu.fdt, _alad.fnm, _akzu.tis, _alab.nrm, _akzu.fdx,
_al2o.fnm, _al2o.fdt, _alaa.prx, _alaa.nrm, _962y.fnm, _ala7.prx, 
_alaa.tis, _ailk.fdt, _akzu_8d.del, _alac.frq, _akzu.prx, _ala9.nrm,
_ailk.prx, _ala9.tis, _alaa.tii, _alae.frq, _add1.fnm, _7zfh.prx, _al
9w.fnm, _ala9.tii, _ala9.frq, _962y.nrm, _alab.frq, _ala8.fdx, _al8x.fnm,
_a61p.prx, _7zfh.fnm, _ala8.fdt, _ailk.fdx, _alaa.frq, _7zfh.fdx
, _al7i.tis, _ah91.fdt, _ailk.fnm, _9wzn_i0m.del, _ah91.fdx, _al7i.tii,
_ailk_24j.del, _alad.fdx, _al8x.tii, _alae.fdx, _add1.prx, _akuu.f
nm, _al8x.tis, _ah91.frq, _ala8.fnm, _7zfh.fdt, _alad.fdt, _alae_1.del,
_alae.fdt, _akzu.frq, _a61p.fnm, _9wzn.frq, _ala8.tii, _7zfh_1gsd.
del, _7zfh.nrm, _ala7_6.del, _a61p.tis, _9wzn.tii, _alad.frq, _alad.tii,
_akuu.fdt, _alab.tii, _ala8.tis, _962y_xgg.del, _akh1.frq, _akuu.
fdx, _alab.tis, _al7i.fnm, _alad.tis, _alac.nrm, _alab.fdx, _ala8_5.del,
_add1.fdx, _ala7.tii, _akuu_cc.del, _alab.fdt, _9wzn.prx, _alaa.f
dx, _al9w.fdt, _al2o.frq, _akh1_nf.del, _alac.prx, _akh1.fdx, _alaa.fdt,
_al9w.fdx, _al8x_17.del, _add1.fdt, _al2o.prx, _akh1.fdt, _alad.p
rx, _akuu.prx, _962y.frq, _al2o_66.del, _alac.fdt, _ala7.tis, _a61p.tii,
_alac.fdx, _al8x.fdt, _9wzn.tis, _9wzn.fdt, _al8x.fdx, _9wzn.fdx,
 _ah91_35l.del]

commit{dir=/master/data/index,segFN=segments_8us5,version=1228872482132,generation=413141,filenames=[_ala9.fnm,
_alaa_5.del, _alab
.fnm, _962y_xgh.del, _al8x.frq, _akh1.tis, _add1.frq, _alae.tis,
_7zfh_1gse.del, _alad.nrm, _alae.tii, _akuu.tis, _ah91_35m.del, _ailk.frq
, _7zfh.tii, _962y.tis, _akuu.tii, _ah91.prx, _7zfh.tis, _ala8.frq,
_962y.tii, _ala7.fnm, _akzu.fnm, _9wzn.fnm, _ala9_2.del, _ala8.nrm, _a
laf.fnm, _alae.nrm, _ala9.prx, _ailk_24k.del, _alaf.prx, _al9w.prx,
_ala8.prx, _akh1.tii, _akzu.tii, _akzu.tis, _alad.fnm, _al2o.fnm, _962
y.fnm, _al8x_18.del, _ala7_7.del, _alaa.tis, _ala9.nrm, _ala9.tis,
_alaa.tii, _962y.nrm, _ala9.tii, _a61p.prx, _add1_62v.del, _al8x.fnm, _
7zfh.fnm, _al7i_2g.del, _ailk.fnm, _al8x.tii, _al8x.tis, _ala8.fnm,
_akzu.frq, _9wzn.frq, _7zfh.nrm, _akuu.fdt, _alad.tii, _akuu.fdx, _aku
u_cd.del, _a61p_b77.del, _alad.tis, _al2o_67.del, _add1.fdx, _9wzn.prx,
_al9w.fdt, _add1.fdt, _al9w.fdx, _akuu.prx, _962y.frq, _9wzn.fdt, 
_alab_4.del, _9wzn.fdx, segments_8us5, _alac_4.del, _alae.fnm, _ailk.tis,
_ala9.fdx, _alac.fnm, _ala9.fdt, _alab.prx, _alae_2.del, _alaa.f
nm, _alad_1.del, _al9w.frq, _ailk.tii, _add1.tis, _alac.tii, _add1.tii,
_alac.tis, _ala7.frq, _ah91.fnm, _a61p.fdt, _alae.prx, _akuu.frq, 
_a61p.fdx, _akh1_ng.del, _al7i.fdx, _al2o.tis, _al9w.tis, _a61p.frq,
_akh1.prx, _9wzn_i0n.del, _al7i.fdt, _al7i.prx, _962y.prx, _al9w.tii,
 _al7i.frq, _ah91.tii, _962y.fdt, _akh1.fnm, _962y.fdx, _ah91.tis,
_al8x.prx, _al2o.tii, _ala7.fdx, _ala7.fdt, _alaf.fdx, _alaf.fdt, _al2o
.fdx, _7zfh.frq, _akzu.fdt, _alaf.nrm, _akzu.fdx, _alab.nrm, _al2o.fdt,
_alaa.prx, _alaa.nrm, _ala7.prx, _ailk.fdt, _akzu.prx, _alac.frq, 
_ailk.prx, _alaf.tii, _alaf_1.del, _alae.frq, _add1.fnm, _alaf.tis,
_7zfh.prx, _al9w.fnm, _ala9.frq, _alab.frq, _ala8.fdx, _akzu_8e.del, _
ala8.fdt, _ailk.fdx, _alaa.frq, _al7i.tis, _7zfh.fdx, _al9w_i.del,
_ah91.fdt, _a

Re: Autocommit blocking adds? AutoCommit Speedup?

2009-05-08 Thread Jim Murphy



Yonik Seeley-2 wrote:
> 
> ...your code snippit elided and edited below ...
> 



Don't take this code as correct (or even compiling) but is this the essence? 
I moved shared access to the writer inside the read lock and kept the other
non-commit bits to the write lock.  I'd need to rethink the locking in a
more fundamental way but is this close to idea? 



 public void commit(CommitUpdateCommand cmd) throws IOException {

if (cmd.optimize) {
  optimizeCommands.incrementAndGet();
} else {
  commitCommands.incrementAndGet();
}

Future[] waitSearcher = null;
if (cmd.waitSearcher) {
  waitSearcher = new Future[1];
}

boolean error=true;
iwCommit.lock();
try {
  log.info("start "+cmd);

  if (cmd.optimize) {
closeSearcher();
openWriter();
writer.optimize(cmd.maxOptimizeSegments);
  }
finally {
  iwCommit.unlock();
 }


  iwAccess.lock(); 
  try
 {
  writer.commit();
 }
 finally
 {
  iwAccess.unlock(); 
 }

  iwCommit.lock(); 
  try
 {
  callPostCommitCallbacks();
  if (cmd.optimize) {
callPostOptimizeCallbacks();
  }
  // open a new searcher in the sync block to avoid opening it
  // after a deleteByQuery changed the index, or in between deletes
  // and adds of another commit being done.
  core.getSearcher(true,false,waitSearcher);

  // reset commit tracking
  tracker.didCommit();

  log.info("end_commit_flush");

  error=false;
}
finally {
  iwCommit.unlock();
  addCommands.set(0);
  deleteByIdCommands.set(0);
  deleteByQueryCommands.set(0);
  numErrors.set(error ? 1 : 0);
}

// if we are supposed to wait for the searcher to be registered, then we
should do it
// outside of the synchronized block so that other update operations can
proceed.
if (waitSearcher!=null && waitSearcher[0] != null) {
   try {
waitSearcher[0].get();
  } catch (InterruptedException e) {
SolrException.log(log,e);
  } catch (ExecutionException e) {
SolrException.log(log,e);
  }
}
  }



-- 
View this message in context: 
http://www.nabble.com/Autocommit-blocking-adds---AutoCommit-Speedup--tp23435224p23454419.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Autocommit blocking adds? AutoCommit Speedup?

2009-05-08 Thread Jim Murphy

Any pointers to this newer more concurrent behavior in lucene?  I can try an
experiment where I downgrade the iwCommit lock to the iwAccess lock to allow
updates to happen during commit.  

Would you expect that to work?

Thanks for bootstrapping me on this. 

Jim



Yonik Seeley-2 wrote:
> 
> On Thu, May 7, 2009 at 8:37 PM, Jim Murphy  wrote:
>> Interesting.  So is there a JIRA ticket open for this already? Any chance
>> of
>> getting it into 1.4?
> 
> No ticket currently open, but IMO it could make it for 1.4.
> 
>> Its seriously kicking out butts right now.  We write
>> into our masters with ~50ms response times till we hit the autocommit
>> then
>> add/update response time is 10-30 seconds.  Ouch.
> 
> It's probably been made a little worse lately since Lucene now does
> fsync on index files before writing the segments file that points to
> those files.  A necessary evil to prevent index corruption.
> 
>> I'd be willing to work on submitting a patch given a better understanding
>> of
>> the issue.
> 
> Great, go for it!
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Autocommit-blocking-adds---AutoCommit-Speedup--tp23435224p23452011.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Autocommit blocking adds? AutoCommit Speedup?

2009-05-07 Thread Jim Murphy

Interesting.  So is there a JIRA ticket open for this already? Any chance of
getting it into 1.4?  Its seriously kicking out butts right now.  We write
into our masters with ~50ms response times till we hit the autocommit then
add/update response time is 10-30 seconds.  Ouch.

I'd be willing to work on submitting a patch given a better understanding of
the issue. 

Jim
-- 
View this message in context: 
http://www.nabble.com/Autocommit-blocking-adds---AutoCommit-Speedup--tp23435224p23438134.html
Sent from the Solr - User mailing list archive at Nabble.com.



Autocommit blocking adds? AutoCommit Speedup?

2009-05-07 Thread Jim Murphy

Question 1: I see in DirectUpdateHandler2 that there is a read/Write lock
used between addDoc and commit.  

My mental model of the process was this: clients can add/update documents
until the auto commit threshold was hit.  At that point the commit tracker
would schedule a background commit.  The commit would run and NOT BLOCK
subsequent adds.  clearly thast not happening because when the autocommit
background thread runs it gets the iwCommit lock blocking anyone in addDoc
trying to get iwAccess lock.

Is this just the way it is or is it possible to configure Solr to process
the pending documents int he background, queuing new documents in memory as
before.  

Question 2: I ask this question because autocommits are taking a LONG time
to complete, like 10-25 seconds.  I have a 40M document index many 10s of
GBs.  What can I do to speed this up?

Thanks

Jim
-- 
View this message in context: 
http://www.nabble.com/Autocommit-blocking-adds---AutoCommit-Speedup--tp23435224p23435224.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Tomcat 6 HTTP Connector Threads all blocked

2009-03-01 Thread Jim Murphy
pplicationFilterChain.doFilter(javax.servlet.ServletRequest,
javax.servlet.ServletResponse) @bci=101, line=206 (Interpreted frame)
 -
org.apache.catalina.core.StandardWrapperValve.invoke(org.apache.catalina.connector.Request,
org.apache.catalina.connector.Response) @bci=804, line=233 (Interpreted
frame)
 -
org.apache.catalina.core.StandardContextValve.invoke(org.apache.catalina.connector.Request,
org.apache.catalina.connector.Response) @bci=285, line=175 (Interpreted
frame)
 -
org.apache.catalina.core.StandardHostValve.invoke(org.apache.catalina.connector.Request,
org.apache.catalina.connector.Response) @bci=64, line=128 (Interpreted
frame)
 -
org.apache.catalina.valves.ErrorReportValve.invoke(org.apache.catalina.connector.Request,
org.apache.catalina.connector.Response) @bci=6, line=102 (Interpreted frame)
 -
org.apache.catalina.valves.AccessLogValve.invoke(org.apache.catalina.connector.Request,
org.apache.catalina.connector.Response) @bci=24, line=563 (Interpreted
frame)
 -
org.apache.catalina.core.StandardEngineValve.invoke(org.apache.catalina.connector.Request,
org.apache.catalina.connector.Response) @bci=42, line=109 (Interpreted
frame)
 -
org.apache.catalina.connector.CoyoteAdapter.service(org.apache.coyote.Request,
org.apache.coyote.Response) @bci=157, line=263 (Interpreted frame)
 - org.apache.coyote.http11.Http11Processor.process(java.net.Socket)
@bci=432, line=844 (Interpreted frame)
 -
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(java.net.Socket)
@bci=82, line=584 (Interpreted frame)
 - org.apache.tomcat.util.net.JIoEndpoint$Worker.run() @bci=41, line=447
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=619 (Interpreted frame)





Yonik Seeley-2 wrote:
> 
> On Sun, Mar 1, 2009 at 10:32 AM, Jim Murphy  wrote:
>> I should have said - tomcat is hosting 2 webapps a solr 1.3 master and
>> slave
>> - as separate web apps.
> 
> Given the the socket writes are blocked, it appears like whatever is
> supposed to be reading the other endpoint isn't doing it's job.
> 
> Are you using java-based replication?  Do you know if these sockets
> that are blocking are from client queries or from replication
> requests?  Splitting up the master and slave into separate JVMs might
> help shed some light on the situation.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Tomcat-6-HTTP-Connector-Threads-all-blocked-tp22274107p22278035.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Tomcat 6 HTTP Connector Threads all blocked

2009-03-01 Thread Jim Murphy

I should have said - tomcat is hosting 2 webapps a solr 1.3 master and slave
- as separate web apps.

Looking for anything to try.

Jim



Jim Murphy wrote:
> 
> I have a 100 thread HTTP connector pool that for some reason ends up with
> all its threads blocked here:
> 
> java.net.SocketOutputStream.socketWrite0(Native Method)
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOutputBuffer.java:737)
> org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:434)
> org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:349)
> org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.doWrite(InternalOutputBuffer.java:761)
> org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutputFilter.java:126)
> org.apache.coyote.http11.InternalOutputBuffer.doWrite(InternalOutputBuffer.java:570)
> org.apache.coyote.Response.doWrite(Response.java:560)
> org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:353)
> org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:434)
> org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:309)
> org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:273)
> org.apache.catalina.connector.Response.finishResponse(Response.java:486)
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:287)
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:584)
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
> java.lang.Thread.run(Thread.java:619)
> 
> 
> Any hints on what to try to diagnose?
> 
> Regards
> 
> Jim
> 

-- 
View this message in context: 
http://www.nabble.com/Tomcat-6-HTTP-Connector-Threads-all-blocked-tp22274107p22274129.html
Sent from the Solr - User mailing list archive at Nabble.com.



Tomcat 6 HTTP Connector Threads all blocked

2009-03-01 Thread Jim Murphy

I have a 100 thread HTTP connector pool that for some reason ends up with all
its threads blocked here:

java.net.SocketOutputStream.socketWrite0(Native Method)
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOutputBuffer.java:737)
org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:434)
org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:349)
org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.doWrite(InternalOutputBuffer.java:761)
org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutputFilter.java:126)
org.apache.coyote.http11.InternalOutputBuffer.doWrite(InternalOutputBuffer.java:570)
org.apache.coyote.Response.doWrite(Response.java:560)
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:353)
org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:434)
org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:309)
org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:273)
org.apache.catalina.connector.Response.finishResponse(Response.java:486)
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:287)
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:584)
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
java.lang.Thread.run(Thread.java:619)


Any hints on what to try to diagnose?

Regards

Jim
-- 
View this message in context: 
http://www.nabble.com/Tomcat-6-HTTP-Connector-Threads-all-blocked-tp22274107p22274107.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solrconfig clarification: ColdSearcher/MaxWarmingSearcher

2008-12-14 Thread Jim Murphy

Thanks for the clarification and for untangling my questions. :)

I'm in the process of finding out why our snapshot installs take so long to
commit and didn't feel so confident about my settings, thanks.

In terms of long snapshot commits - I've isolated it to long warming times. 
But since the warming query I use is of the same basic layout as "the" query
we do at runtime I'm not sure what to do.  At the moment I'm trying to
isolate where the time is spent, if its just in pre-allocating large arrays
and other data structures like the FieldSortedHitQueue - because my queries
tend to be date sorted...

Index Facts:

1. 22M documents
2. Snapshot installs to slave: 5 minutes
3. Warmup times: ~200-400 seconds hence the backup on warmers

Problem with the backup searchers is that I run my JVM out of heap space if
more than one searcher is warming up int he background.

Continuing to profile commit/warming queries.  Any helpful hints would be
much appreciated.


Cheers,

Jim
-- 
View this message in context: 
http://www.nabble.com/Solrconfig-clarification%3A-ColdSearcher-MaxWarmingSearcher-tp20904462p21003725.html
Sent from the Solr - User mailing list archive at Nabble.com.



Solrconfig clarification: ColdSearcher/MaxWarmingSearcher

2008-12-08 Thread Jim Murphy

I have a cluster of Solr Master/Slaves.  We write tot he master and replicate
to the slaves via rsync.  

Master:

1. Replication is every 5 minutes.  
2. Inserting many 100's docs per minute
3. Index is: 23 million documents
4. commits are every 30 seconds

Slave:

1. Pre-warmed after rsync snapshot takes ~50 seconds
2. many queries per second


So given that how should I setup the following searcher configs:

false



5

Here's what I'm thinking:

For the Master:

I don't care about searchers, we do no autowarming and never query so ? 1,
or 5 or what does this even mean?

Current settings:

useColdSearcher: true, why do I care?
maxWarmingSearchers: 1 - becasue I can't see why I would ever have more than
one but I'm concerned since the docs advise otherwise for high throughput
masters which applies in my case.

For the Slave:
Again, seems we shoudl get 1 new searcher every 5 minutes but there would be
2 existing for ~50 seconds as the second one autowarms.  

useColdSearcher: false, wait for the warming to do the heavy lifting
maxWarmingSearchers: 1 again, always use just one?


Thanks for any insights into these


Jim
-- 
View this message in context: 
http://www.nabble.com/Solrconfig-clarification%3A-ColdSearcher-MaxWarmingSearcher-tp20904462p20904462.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Index updates blocking readers: To Multicore or not?

2008-10-22 Thread Jim Murphy

We shread the RSS into individual items then create Solr XML documents to
insert.  Solr is an easy choice for us over straight Lucene since it adds
the server infrastructure that we would mostly be writing ourself - caching,
data types, master/slave replication.

We looked at nutch too - but that was before my time.

Jim



John Martyniak-3 wrote:
> 
> Thank you that is good information, as that is kind of way that I am  
> leaning.
> 
> So when you fetch the content from RSS, does that get rendered to an  
> XML document that Solr indexes?
> 
> Also what where a couple of decision points for using Solr as opposed  
> to using Nutch, or even straight Lucene?
> 
> -John
> 
> 
> 
> On Oct 22, 2008, at 11:22 AM, Jim Murphy wrote:
> 
>>
>> We index RSS content using our own home grown distributed spiders -  
>> not using
>> Nutch.  We use ruby processes do do the feed fetching and XML  
>> shreading, and
>> Amazon SQS to queue up work packets to insert into our Solr cluster.
>>
>> Sorry can't be of more help.
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p20113143.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p20114697.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Index updates blocking readers: To Multicore or not?

2008-10-22 Thread Jim Murphy

We index RSS content using our own home grown distributed spiders - not using
Nutch.  We use ruby processes do do the feed fetching and XML shreading, and
Amazon SQS to queue up work packets to insert into our Solr cluster. 

Sorry can't be of more help.

-- 
View this message in context: 
http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p20113143.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Index updates blocking readers: To Multicore or not?

2008-10-22 Thread Jim Murphy

Thanks Yonik, 

I have more information...

1. We do indeed have large indexes: 40GB on disk, 30M documents - and is
just a test server we have 8 of these in parallel.

2. The performance problem I was seeing followed replication, and first
query on a new searcher.  It turns out we didn't configure index warming
queries very well so we removes the various "solr rocks" type queries to one
that was better for our data - and had not improvement.  The problem was
that replication completed, a new searcher was created and registered but
the first query qould take 10-20 seconds to complete.  There after it took
<200 milliseconds for similar non-cached queries.

Profiler pointed us to building the FieldSortedHitQueue was taking all the
time.  Our warming query did not include a sort but our queries commonly do. 
Once we added the sort parameter our warming query started taking the 10-20
seconds prior to registering the searcher.  After that the first query on
the new searcher took the expected 200ms.

LESSON LEARNED: warm your caches! And, if a sort is involved in your queries
incorporate that sort in your warming query!  Add a warming query for each
kind of sort that you expect to do.

 







Yonik Seeley wrote:
> 
> On Mon, Oct 6, 2008 at 2:10 PM, Jim Murphy <[EMAIL PROTECTED]> wrote:
>> We have a farm of several Master-Slave pairs all managing a single very
>> large
>> "logical" index sharded across the master-slaves.  We notice on the
>> slaves,
>> after an rsync update, as the index is being committed that all queries
>> are
>> blocked sometimes resulting in unacceptable service times.  I'm looking
>> at
>> ways we can manage these "update burps".
> 
> Updates should never block queries.
> What version of Solr are you using?
> Is it possible that your indexes are so big, opening a new index in
> the background causes enough of the old index to be flushed from OS
> cache, causing big slowdowns?
> 
> -Yonik
> 
> 
>> Question #1: Anything obvious I can tweak in the configuration to
>> mitigate
>> these multi-second blocking updates?  Our Indexes are 40GB, 20M documents
>> each.  RSync updates are every 5 minutes several hundred KB per update.
>>
>> Question #2: I'm considering setting up each slave with multiple Solr
>> cores.
>> The 2 indexes per instance would be nearly identical copies but "A" would
>> be
>> read from while "B" is being updated, then they would swap.  I'll have to
>> figure out how to rsync these 2 indexes properly but if I can get the
>> commits to happen to the offline index then I suspect my queries could
>> proceed unblocked.
>>
>> Is this the wrong tree to be barking up?  Any other thoughts?
>>
>> Thanks in advance,
>>
>> Jim
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p19843098.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p20112546.html
Sent from the Solr - User mailing list archive at Nabble.com.



FileNotFoundException on slave after replication - script bug?

2008-10-22 Thread Jim Murphy

We're seeing strange behavior on one of our slave nodes after replication. 
When the new searcher is created we see FileNotFoundExceptions in the log
and the index is strangely invalid/corrupted.

We may have identified the root cause but wanted to run it by the community. 
We figure there is a bug in the snappuller shell script, line 181:

snap_name=`ssh -o StrictHostKeyChecking=no ${master_host} "ls
${master_data_dir}|grep 'snapshot\.'|grep -v wip|sort -r|head -1"` 

This line determines the directory name of the latest snapshot to download
to the slave from the master.  Problem with this line is that it grab the
temporary work directory of a snapshot in progress.  Those temporary
directories are prefixed with  "temp" and as far as I can tell should never
get pulled from the master so its easy to disambiguate.  It seems that this
temp directory, if it exists will be the newest one so if present it will be
the one replicated: FAIL.

We've tweaked the line to exclude any directories starting with "temp":
snap_name=`ssh -o StrictHostKeyChecking=no ${master_host} "ls
${master_data_dir}|grep 'snapshot\.'|grep -v wip|grep -v temp|sort -r|head
-1"` 

This has fixed our local issue, we can submit a patch but wanted a quick
sanity check because I'm surprised its not much more commonly seen.

Jim

-- 
View this message in context: 
http://www.nabble.com/FileNotFoundException-on-slave-after-replication---script-bug--tp20111313p20111313.html
Sent from the Solr - User mailing list archive at Nabble.com.



Index updates blocking readers: To Multicore or not?

2008-10-06 Thread Jim Murphy

We have a farm of several Master-Slave pairs all managing a single very large
"logical" index sharded across the master-slaves.  We notice on the slaves,
after an rsync update, as the index is being committed that all queries are
blocked sometimes resulting in unacceptable service times.  I'm looking at
ways we can manage these "update burps".

Question #1: Anything obvious I can tweak in the configuration to mitigate
these multi-second blocking updates?  Our Indexes are 40GB, 20M documents
each.  RSync updates are every 5 minutes several hundred KB per update. 

Question #2: I'm considering setting up each slave with multiple Solr cores.
The 2 indexes per instance would be nearly identical copies but "A" would be
read from while "B" is being updated, then they would swap.  I'll have to
figure out how to rsync these 2 indexes properly but if I can get the
commits to happen to the offline index then I suspect my queries could
proceed unblocked.  

Is this the wrong tree to be barking up?  Any other thoughts? 

Thanks in advance,

Jim



-- 
View this message in context: 
http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p19843098.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Calculated Unique Key Field

2008-10-06 Thread Jim Murphy

Thanks, Shalin. 



Shalin Shekhar Mangar wrote:
> 
> On Wed, Oct 1, 2008 at 12:08 AM, Jim Murphy <[EMAIL PROTECTED]> wrote:
> 
>>
>> Question1: Is this the best place to do this?
> 
> 
> This sounds like a job for
> http://wiki.apache.org/solr/UpdateRequestProcessor
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Calculated-Unique-Key-Field-tp19747955p19842973.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Calculated Unique Key Field

2008-09-30 Thread Jim Murphy

It may not be all that relevant but our Update handler extends from
DirectUpdateHandler2.
-- 
View this message in context: 
http://www.nabble.com/Calculated-Unique-Key-Field-tp19747955p19748032.html
Sent from the Solr - User mailing list archive at Nabble.com.



Calculated Unique Key Field

2008-09-30 Thread Jim Murphy

My unique key field is an MD5 hash of several other fields that represent
identity of documents in my index.  We've been calculating this externally
and setting the key value in documents but have found recurring bugs as the
number and variety of inserting consumers has grown...

So I wanted to move to calculating these at "add" time.  We already have our
own UpdateHandler, extending from DirectUpdateHandler, so I extended its
addDoc method to do the hashing and field setting.  

Here's the implementation highlights:

String postGuid = 

// set the value - overwrite if already present
{
  SolrInputField postGuidField = doc.getField(POST_GUID_NAME);
  if (postGuidField != null)
  {
postGuidField.setValue(postGuid, DEFAULT_BOOST);
  }
  else
  {
doc.addField(POST_GUID_NAME, postGuid);
  }
}

{

  // add guid field to the lucene doc too - huh. 
  Document lucDoc = cmd.getLuceneDocument(schema);

  Field aiPostGuidField = lucDoc.getField(POST_GUID_NAME);
  if (aiPostGuidField != null)
  {
aiPostGuidField.setValue(postGuid);
  }
  else
  {
SchemaField aiPostGuidSchemaField = schema.getField(POST_GUID_NAME);
Field postGuidField = aiPostGuidSchemaField.createField(postGuid,
DEFAULT_BOOST);
lucDoc.add(postGuidField);
  }
}


Question1: Is this the best place to do this?
Question2: Is there a way around adding it to both the SolrDocument and the
Lucene Document?

Thoughts?

Best regards,

Jim
-- 
View this message in context: 
http://www.nabble.com/Calculated-Unique-Key-Field-tp19747955p19747955.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Using Solr for Info Retreval not so much Search...

2008-07-30 Thread Jim Murphy

*Excellent* so a custom QueryComponent it is.

The Solr score doesn't factor in too much - our search needs are modest -
just does it contain the keyword (or variants, stems etc) or not.  So the
query trims down from ~100M to 10-1.  That way the more expensive
filtering operates at the smaller set as you suggest.

I need to sort by one of my date fields or the external rank.  The first is
easy.  The second is difficult so I will have to query the external system
for all matching docs - but if its on the reduced set its manageable.

One Remaining Question:  I'd like to include my external threshold value int
he document.  Any ideas?  Can I stuff a float field somewhere on the docs? 

Thanks!

Jim

hossman wrote:
> 
> 
> : 1. Query the index for entries matching keyword.
> : 2. remove any entries that are below a threshold score from the external
> : system
> 
> what do you need to sort by? .. if it's the threshold score from your 
> external system, you have no way of avoiding a call out to your external 
> system for every matching doc ... if you want to sort by the "Solr Score" 
> then it should be fairly easy to write a SearchComponent that gets a 
> DocList and walks them in order removing anything that doesn't meet the 
> threshold (re-executing the query with a higher number of rows if it 
> exhausts the current DocList) untill you've got enough to return to your 
> client.
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-Solr-for-Info-Retreval-not-so-much-Search...-tp18723102p18744997.html
Sent from the Solr - User mailing list archive at Nabble.com.



Understanding Filters

2008-07-30 Thread Jim Murphy

I'm still trying to filter my search results with external data.  I have ~100
million documents in the index.  I want to use the power of lucene to knock
that index down to 10-100 with keyword searching and a few other regular
query terms.  With that smaller subset I'd like to apply a filter based on
making calls to an external system to further reduce that set to 5-20.

Looking at filter queries and lucense search filters it seems that they
iterate over the entire index to create a bitset of documents to be included
in the query.  This seems the inverse of my needs.  I can't make ~100 milion
external calls to filter - I want lucene to handle that heavy lifting.

I'm trying to figure out the right place to hook to let paging and caching
in Solr work as normal but drop out result documents based on that expensive
external call.

Thanks, and sorry for the repeat requests. 

Jim
-- 
View this message in context: 
http://www.nabble.com/Understanding-Filters-tp18742220p18742220.html
Sent from the Solr - User mailing list archive at Nabble.com.



Question about ValueSource and large datasets

2008-07-30 Thread Jim Murphy

I'm looking to incorporate an external calculation in Solr/Lucene search
results.  I'd like to write queries that filter and sort on the value of
this "virtual field".  The value of the field is actually calculated at
runtime based on a remote call to an external system.  My Solr queries will
include termqueries to match keywords - nothing special, but I'd like to
filter and order results based on the virtual field as well.  

I started looking at a custom Field Type + ValueSource.  I add a field of
this "virtual field type" to the schema, and have the custom ValueSource
wired in to the field type.  I used the FileFloatSource example as
inspiration - seems ok - but 2 questions:

1. How do I query for my virtual field?  My ValueSource never seems to be
activated not matter what I query for.  Here is the relevant parts of my
schema - see any issues?  Any hints on what the query string should be?


...



2.  How can I limit the number of external calls I need to make.  If I use
FunctionQuery syntax then my ValueSource is used.  But, a BIG but, I notice
that it is queried for field values for every document in the index.  My
index is 100 million documents but typical result size is on the order of
tens.  I'd like to perform the external call on those tens not on the entire
index every time.

ValueSource DocValues getValues(IndexReader reader) throws
IOException
{
final float[] arr = getCachedFloats(reader);
return new DocValues()
{
public float floatVal(int doc) { ...called 100 million
times... }
...

I like this approach a lot but I'm getting the feeling that I want to hook
later in the query process - after the initial query (matching kleywords) is
done and the document set is reduced from 100 million to tens.  

Do I really want a filter query of some kind?  Or some other layer of
filtering? 

Thanks in advance,

Jim

-- 
View this message in context: 
http://www.nabble.com/Question-about-ValueSource-and-large-datasets-tp18733993p18733993.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Using Solr for Info Retreval not so much Search...

2008-07-29 Thread Jim Murphy

Thanks Walter,

My requirements are this:

1. Query the index for entries matching keyword.
2. remove any entries that are below a threshold score from the external
system


I'm looking at building a custom field type similar to ExternalFileField
that can dole out a ValueSource that calls my external system.

Jim


Walter Underwood wrote:
> 
> You might be able to split the ranking into a common score and
> a dynamic score. Return the results nearly the right order, then
> do a minimal reordering after. If you plan to move a result by
> a maximum of five positions, then you could fetch 15 results to
> show 10 results. That is far, far cheaper than fetching all
> results and ranking them all.
> 
> I wrote a description of the client-side part of this last year:
> 
> http://wunderwood.org/most_casual_observer/2007/04/progressive_reranking.htm
> l
> 
> wunder
> 
> On 7/29/08 5:59 PM, "Jim Murphy" <[EMAIL PROTECTED]> wrote:
> 
>> 
>> If figured that it would be - but the rankings are dynamically
>> calculated.
>> I'd like to limit the number of calculations performed for this very
>> reason.
>> Still not sure if this approach will be better than naivly filtering docs
>> after the query has happened.
>> 
>> Reading about ValueSource thanks...
>> 
>> Jim
>> 
>>  
>> 
>> Yonik Seeley wrote:
>>> 
>>> Calling out will be an order of magnitude (or two) slower compared to
>>> moving the rankings into Solr, but it is doable.  See ValueSource
>>> (it's used by FunctionQuery).
>>> 
>>> -Yonik
>>> 
>>> On Tue, Jul 29, 2008 at 8:23 PM, Jim Murphy <[EMAIL PROTECTED]>
>>> wrote:
>>>> 
>>>> I take it I can add my own functions that would take care of calling
>>>> out
>>>> to
>>>> my external ranking system?
>>>> 
>>>> Looking for docs on that...
>>>> 
>>>> Jim
>>>> 
>>>> 
>>>> Yonik Seeley wrote:
>>>>> 
>>>>> A function query might fit your needs... you could move some or all of
>>>>> your external ranking system into Solr.
>>>>> 
>>>>> -Yonik
>>>>> 
>>>>> On Tue, Jul 29, 2008 at 7:08 PM, Jim Murphy <[EMAIL PROTECTED]>
>>>>> wrote:
>>>>>> 
>>>>>> I need to store 100 million documents in our Solr instance and be
>>>>>> able
>>>>>> to
>>>>>> retrieve them with simple term queries - keyword matches.  I'm NOT
>>>>>> implementing a search application where documents are scored and
>>>>>> ranked...they either match the keywords or not.  Also, I have an
>>>>>> external
>>>>>> ranking system that I need to use to filter and order the search
>>>>>> results.
>>>>>> 
>>>>>> My requirements are for the very fast and reliable retrieval so I'm
>>>>>> trying
>>>>>> to figure a place to hook in or customize Solr/Lucene to just do the
>>>>>> simplest thing, reliably and fast.
>>>>>> 
>>>>>> 1. A naive approach would be to implement a handler, let the query
>>>>>> happen
>>>>>> normally then perform N lookups to my external scoring system then
>>>>>> filter
>>>>>> and sort the documents.  It seems I may be doing a lot of extra work
>>>>>> that
>>>>>> way, especially with paging results and who knows what I'd doing to
>>>>>> the
>>>>>> cache.
>>>>>> 
>>>>>> 2. Create a custom FieldType that is virtual and calls out to my
>>>>>> external
>>>>>> system? Then queries could be written to return all docs > my rank.
>>>>>> 
>>>>>> 3. Implement custom Query, Weight, Scorer (et al) implementations to
>>>>>> minimize the "Search Stuff" and just delegate calls to my external
>>>>>> ranking
>>>>>> system.
>>>>>> 
>>>>>> 4.  A filter of some kind?
>>>>>> 
>>>>>> 
>>>>>> I'd love to get a sanity check on any of these approaches or some
>>>>>> recommendations.
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> Jim
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Using-Solr-for-Info-Retreval-not-so-much-Search...-tp1
>>>> 8723102p18723877.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>> 
>>>> 
>>> 
>>> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-Solr-for-Info-Retreval-not-so-much-Search...-tp18723102p18724853.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Using Solr for Info Retreval not so much Search...

2008-07-29 Thread Jim Murphy

If figured that it would be - but the rankings are dynamically calculated.
I'd like to limit the number of calculations performed for this very reason. 
Still not sure if this approach will be better than naivly filtering docs
after the query has happened.

Reading about ValueSource thanks...  

Jim

 

Yonik Seeley wrote:
> 
> Calling out will be an order of magnitude (or two) slower compared to
> moving the rankings into Solr, but it is doable.  See ValueSource
> (it's used by FunctionQuery).
> 
> -Yonik
> 
> On Tue, Jul 29, 2008 at 8:23 PM, Jim Murphy <[EMAIL PROTECTED]> wrote:
>>
>> I take it I can add my own functions that would take care of calling out
>> to
>> my external ranking system?
>>
>> Looking for docs on that...
>>
>> Jim
>>
>>
>> Yonik Seeley wrote:
>>>
>>> A function query might fit your needs... you could move some or all of
>>> your external ranking system into Solr.
>>>
>>> -Yonik
>>>
>>> On Tue, Jul 29, 2008 at 7:08 PM, Jim Murphy <[EMAIL PROTECTED]>
>>> wrote:
>>>>
>>>> I need to store 100 million documents in our Solr instance and be able
>>>> to
>>>> retrieve them with simple term queries - keyword matches.  I'm NOT
>>>> implementing a search application where documents are scored and
>>>> ranked...they either match the keywords or not.  Also, I have an
>>>> external
>>>> ranking system that I need to use to filter and order the search
>>>> results.
>>>>
>>>> My requirements are for the very fast and reliable retrieval so I'm
>>>> trying
>>>> to figure a place to hook in or customize Solr/Lucene to just do the
>>>> simplest thing, reliably and fast.
>>>>
>>>> 1. A naive approach would be to implement a handler, let the query
>>>> happen
>>>> normally then perform N lookups to my external scoring system then
>>>> filter
>>>> and sort the documents.  It seems I may be doing a lot of extra work
>>>> that
>>>> way, especially with paging results and who knows what I'd doing to the
>>>> cache.
>>>>
>>>> 2. Create a custom FieldType that is virtual and calls out to my
>>>> external
>>>> system? Then queries could be written to return all docs > my rank.
>>>>
>>>> 3. Implement custom Query, Weight, Scorer (et al) implementations to
>>>> minimize the "Search Stuff" and just delegate calls to my external
>>>> ranking
>>>> system.
>>>>
>>>> 4.  A filter of some kind?
>>>>
>>>>
>>>> I'd love to get a sanity check on any of these approaches or some
>>>> recommendations.
>>>>
>>>> Thanks
>>>>
>>>> Jim
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Using-Solr-for-Info-Retreval-not-so-much-Search...-tp18723102p18723877.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-Solr-for-Info-Retreval-not-so-much-Search...-tp18723102p18724269.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Using Solr for Info Retreval not so much Search...

2008-07-29 Thread Jim Murphy

I take it I can add my own functions that would take care of calling out to
my external ranking system?

Looking for docs on that...

Jim


Yonik Seeley wrote:
> 
> A function query might fit your needs... you could move some or all of
> your external ranking system into Solr.
> 
> -Yonik
> 
> On Tue, Jul 29, 2008 at 7:08 PM, Jim Murphy <[EMAIL PROTECTED]> wrote:
>>
>> I need to store 100 million documents in our Solr instance and be able to
>> retrieve them with simple term queries - keyword matches.  I'm NOT
>> implementing a search application where documents are scored and
>> ranked...they either match the keywords or not.  Also, I have an external
>> ranking system that I need to use to filter and order the search results.
>>
>> My requirements are for the very fast and reliable retrieval so I'm
>> trying
>> to figure a place to hook in or customize Solr/Lucene to just do the
>> simplest thing, reliably and fast.
>>
>> 1. A naive approach would be to implement a handler, let the query happen
>> normally then perform N lookups to my external scoring system then filter
>> and sort the documents.  It seems I may be doing a lot of extra work that
>> way, especially with paging results and who knows what I'd doing to the
>> cache.
>>
>> 2. Create a custom FieldType that is virtual and calls out to my external
>> system? Then queries could be written to return all docs > my rank.
>>
>> 3. Implement custom Query, Weight, Scorer (et al) implementations to
>> minimize the "Search Stuff" and just delegate calls to my external
>> ranking
>> system.
>>
>> 4.  A filter of some kind?
>>
>>
>> I'd love to get a sanity check on any of these approaches or some
>> recommendations.
>>
>> Thanks
>>
>> Jim
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-Solr-for-Info-Retreval-not-so-much-Search...-tp18723102p18723877.html
Sent from the Solr - User mailing list archive at Nabble.com.



Using Solr for Info Retreval not so much Search...

2008-07-29 Thread Jim Murphy

I need to store 100 million documents in our Solr instance and be able to
retrieve them with simple term queries - keyword matches.  I'm NOT
implementing a search application where documents are scored and
ranked...they either match the keywords or not.  Also, I have an external
ranking system that I need to use to filter and order the search results.

My requirements are for the very fast and reliable retrieval so I'm trying
to figure a place to hook in or customize Solr/Lucene to just do the
simplest thing, reliably and fast.  

1. A naive approach would be to implement a handler, let the query happen
normally then perform N lookups to my external scoring system then filter
and sort the documents.  It seems I may be doing a lot of extra work that
way, especially with paging results and who knows what I'd doing to the
cache.

2. Create a custom FieldType that is virtual and calls out to my external
system? Then queries could be written to return all docs > my rank.

3. Implement custom Query, Weight, Scorer (et al) implementations to
minimize the "Search Stuff" and just delegate calls to my external ranking
system.

4.  A filter of some kind?


I'd love to get a sanity check on any of these approaches or some
recommendations.

Thanks

Jim

-- 
View this message in context: 
http://www.nabble.com/Using-Solr-for-Info-Retreval-not-so-much-Search...-tp18723102p18723102.html
Sent from the Solr - User mailing list archive at Nabble.com.