Re: Problem adding new requesthandler to solr branch_3x

2011-03-09 Thread Paul Rogers
Hoss

many thanks for the reply

Paul

On 8 March 2011 19:45, Chris Hostetter hossman_luc...@fucit.org wrote:

 : 1.  Why the problem occurs (has something changed between 1.4.1 and 3x)?

 Various pieces of code dealing with config parsing have changed since
 1.4.1 to be better about verifying that configs are meaningful ,ad
 reporting errors when unexpected things are encountered.  i'm not sure of
 the specific change, but the underlying point is: if 1.4.1 wasn't giving
 you an error for that syntax, it's because it was compleltey ignoring it.


 -Hoss


LucidGaze Monitoring tool

2011-03-09 Thread Isan Fulia
Hi all,
Does anyone know what  does m on the y -axis stands for in req/sec graph for
update handler .

-- 
Thanks  Regards,
Isan Fulia.


Re: NRT in Solr

2011-03-09 Thread stockii
i am using solr for NRT with this version of solr ...

Solr Specification Version: 4.0.0.2010.10.26.08.43.14
Solr Implementation Version: 4.0-2010-10-26_08-05-39 1027394 - hudson -
2010-10-26 08:43:14
Lucene Specification Version: 4.0-2010-10-26_08-05-39
Lucene Implementation Version: 4.0-2010-10-26_08-05-39 1027394 - 2010-10-26
08:43:44

is this version ready for NRT or not ? it works, but if it can work better i
gonna be update solr ... 

thx 

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 4GB Xmx
- Solr2 for Update-Request  - delta every 2 Minutes - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/NRT-in-Solr-tp2652689p2654472.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr UIMA Wiki page

2011-03-09 Thread Tommaso Teofili
Hi all,
I just improved the Solr UIMA integration wiki page [1] so if anyone is
using it and/or has any feedback it'd be more than welcome.
Regards,
Tommaso

[1] : http://wiki.apache.org/solr/SolrUIMA


Re: NRT in Solr

2011-03-09 Thread stockii
question: http://wiki.apache.org/solr/NearRealtimeSearchTuning


'PERFORMANCE WARNING: Overlapping onDeckSearchers=x 

i got this message. 
in my solrconfig.xml: maxWarmingSearchers=4, if i set this to 1 or 2 i got
exception. with 4 i got nothing, but the Performance Warning. the
wiki-articel says, that the best solution is to set the warmingSearcher to
1!!! how can this work ?

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/NRT-in-Solr-tp2652689p2654696.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: getting much double-Values from solr -- timeout

2011-03-09 Thread stockii
Are you using shards or have everything in same index? 
- shards == distributed Search over several cores ? = yes, but not always.
but in generally not.

What problem did you experience with the StatsCompnent?
- if i use stats on my 34Million Index, no matter how many docs founded, the
sum takes VEERY long time.

How did you use it? 
- like in the wiki, i think statscomp is not so dynamic usable !? 


I think the right approach will be to optimize StatsComponent to do quick
sum() 
- how can i optimize this ? change the code vom statscomponent and create a
new solr ? 

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/getting-much-double-Values-from-solr-timeout-tp2650981p2654721.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: getting much double-Values from solr -- timeout

2011-03-09 Thread stockii
i am using NRT, and the caches are not always warmed, i think this is almost
a problem !?

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/getting-much-double-Values-from-solr-timeout-tp2650981p2654725.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr UIMA Wiki page

2011-03-09 Thread Markus Jelsma
Great work!

On Wednesday 09 March 2011 11:20:41 Tommaso Teofili wrote:
 Hi all,
 I just improved the Solr UIMA integration wiki page [1] so if anyone is
 using it and/or has any feedback it'd be more than welcome.
 Regards,
 Tommaso
 
 [1] : http://wiki.apache.org/solr/SolrUIMA

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


NRT and warmupTime of filterCache

2011-03-09 Thread stockii
I tried to create an NRT like in the wiki but i got some problems with
autowarming and ondeckSearchers.

ervery minute i start a delta of one core and the other core start every
minute a commit of the index to search for it.


wiki says ... = 1 Searcher and fitlerCache warmupCount=3600. with this
config i got exception that no searcher is available ... so i cannot use
this config ...
my config is, 4 Searchers and warmupCount=3000... with this settings i got
Performance Warning, but it works. BUT when the complete 30 seconds (or
more) needed to warming the searcher, i cannot ping my server in this time
and i got errors ...
make it sense to decrese my warmupCount to 0 ??? 

how serchers do i need for 7 Cores ? 

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/NRT-and-warmupTime-of-filterCache-tp2654886p2654886.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: NRT in Solr

2011-03-09 Thread Markus Jelsma
maxWarmingSearcher=1 is good for current stable Solr versions where memory is 
important. Overlapping warming searchers can be extremely memory consuming. I 
don't know how cache warming behaves with NRT.

On Wednesday 09 March 2011 11:27:39 stockii wrote:
 question: http://wiki.apache.org/solr/NearRealtimeSearchTuning
 
 
 'PERFORMANCE WARNING: Overlapping onDeckSearchers=x
 
 i got this message.
 in my solrconfig.xml: maxWarmingSearchers=4, if i set this to 1 or 2 i got
 exception. with 4 i got nothing, but the Performance Warning. the
 wiki-articel says, that the best solution is to set the warmingSearcher to
 1!!! how can this work ?
 
 -
 --- System
 
 
 One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
 1 Core with 31 Million Documents other Cores  100.000
 
 - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
 - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/NRT-in-Solr-tp2652689p2654696.html Sent
 from the Solr - User mailing list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: True master-master fail-over without data gaps

2011-03-09 Thread Michael Sokolov
Yes, I think this should be pushed upstream - insert a tee in the 
document stream so that all documents go to both masters.

Then use a load balancer to make requests of the masters.

The tee itself then becomes a possible single point of failure, but 
you didn't say anything about the architecture of the document feed.  Is 
that also fault-tolerant?


-Mike

On 3/9/2011 1:06 AM, Jonathan Rochkind wrote:

I'd honestly think about buffer the incoming documents in some store that's 
actually made for fail-over persistence reliability, maybe CouchDB or 
something. And then that's taking care of not losing anything, and the problem 
becomes how we make sure that our solr master indexes are kept in sync with the 
actual persistent store; which I'm still not sure about, but I'm thinking it's 
a simpler problem. The right tool for the right job, that kind of failover 
persistence is not solr's specialty.

From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
Sent: Tuesday, March 08, 2011 11:45 PM
To: solr-user@lucene.apache.org
Subject: True master-master fail-over without data gaps

Hello,

What are some common or good ways to handle indexing (master) fail-over?
Imagine you have a continuous stream of incoming documents that you have to
index without losing any of them (or with losing as few of them as possible).
How do you set up you masters?
In other words, you can't just have 2 masters where the secondary is the
Repeater (or Slave) of the primary master and replicates the index periodically:
you need to have 2 masters that are in sync at all times!
How do you achieve that?

* Do you just put N masters behind a LB VIP, configure them both to point to the
index on some shared storage (e.g. SAN), and count on the LB to fail-over to the
secondary master when the primary becomes unreachable?
If so, how do you deal with index locks?  You use the Native lock and count on
it disappearing when the primary master goes down?  That means you count on the
whole JVM process dying, which may not be the case...

* Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters
with 2 separate indices in sync, while making sure you write to only 1 of them
via LB VIP or otherwise?

* Or ...


This thread is on a similar topic, but is inconclusive:
   http://search-lucene.com/m/aOsyN15f1qd1

Here is another similar thread, but this one doesn't cover how 2 masters are
kept in sync at all times:
   http://search-lucene.com/m/aOsyN15f1qd1

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/





Re: Help -DIH (mail)

2011-03-09 Thread Matias Alonso
Hi Peter,

When I execute the commands you mentioned, nothing happend.
Below I show you the comands executed and the answered of they.
Sorry, but I don´t know how to enable the log; my jre is by default.
Rememeber I´m running the example-DIH (trunk\solr\example\example-DIH\solr);
java -Dsolr.solr.home=./example-DIH/solr/ -jar start.jar.



Import:
http://localhost:8983/solr/mail/dataimport?command=status
http://localhost:8983/solr/mail/dataimport?command=full-import

response
-
lst name=responseHeader
int name=status0/int
int name=QTime15/int
/lst
-
lst name=initArgs
-
lst name=defaults
str name=configdata-config.xml/str
/lst
/lst
-
str name=command
full-importhttp://localhost:8983/solr/mail/dataimport?command=full-import
/str
str name=statusidle/str
str name=importResponse/
lst name=statusMessages/
-
str name=WARNING
This response format is experimental.  It is likely to change in the future.
/str
/response



Status:
http://localhost:8983/solr/mail/dataimport?command=status
http://localhost:8983/solr/mail/dataimport?command=full-import


response
-
lst name=responseHeader
int name=status0/int
int name=QTime0/int
/lst
-
lst name=initArgs
-
lst name=defaults
str name=configdata-config.xml/str
/lst
/lst
-
str name=command
statushttp://localhost:8983/solr/mail/dataimport?command=full-import
/str
str name=statusidle/str
str name=importResponse/
lst name=statusMessages/
-
str name=WARNING
This response format is experimental.  It is likely to change in the future.
/str
/response




Thank you for your help.

Matias.






2011/3/4 Peter Sturge peter.stu...@gmail.com

 Can you try this:

 Issue a full import command like this:

 http://localhost:8983/solr/dataimport?command=full-import
 http://localhost:8983/solr/db/dataimport?command=full-import

 (There is no core name here - if you're using a core name (db?), then add
 that in between solr/ and /dataimport)

 then, run:
 http://localhost:8983/solr/dataimport?command=status
 http://localhost:8983/solr/db/dataimport?command=full-import

 This will show the results of the previous import. Has it been rolled-back?
 If so, there might be something in the log if it's enabled (see your jre's
 lib/logging.properties file).
 (you won't see any errors unless you run the status command - that's where
 they're stored)

 HTH
 Peter




 On Sat, Mar 5, 2011 at 12:46 AM, Matias Alonso matiasgalo...@gmail.com
 wrote:

  I´m using the trunk.
 
  Thanks Peter for your preoccupation!
 
  Matias.
 
 
 
  2011/3/4 Peter Sturge peter.stu...@gmail.com
 
   Hi Matias,
  
   What version of Solr are you using? Are you running any patches (maybe
   SOLR-2245)?
  
   Thanks,
   Peter
  
  
  
   On Fri, Mar 4, 2011 at 8:25 PM, Matias Alonso matiasgalo...@gmail.com
   wrote:
  
Hi Peter,
   
From DataImportHandler Development Console I made a full-import,
 but
didn´t work.
   
Now, I execute 
http://localhost:8983/solr/mail/dataimport?command=full-import; but
nothing
happends; no index; no errors.
   
thks...
   
Matias.
   
   
   
2011/3/4 Peter Sturge peter.stu...@gmail.com
   
 Hi Mataias,



   
  
 
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimportaccesses
 the dataimport handler, but you need to tell it to do something by
 sending a command:

  
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport
 ?command=full-import
 http://localhost:8983/solr/db/dataimport?command=full-import

 If you haven't already, have a look at:



   
  
 
 http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FDataImportHandler

 It gives very thorough and useful advice on getting the DIH
 working.

 Peter



 On Fri, Mar 4, 2011 at 6:59 PM, Matias Alonso 
  matiasgalo...@gmail.com
 wrote:

  Hi Peter,
 
  I test with deltaFetch=false, but doesn´t work :(
  I'm using DataImportHandler Development Console to index (
 
   
  http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport
 );
  I'm working with example-DIH.
 
  thks...
 
 
 
  2011/3/4 Peter Sturge peter.stu...@gmail.com
 
   Hi Matias,
  
   I haven't seen it in the posts, but I may have missed it --
 what
  is
the
   import command you're sending?
   Something like:
   http://localhost:8983/solr/db/dataimport?command=full-import
  
   Can you also test it with deltaFetch=false. I seem to
 remember
having
   some
   problems with delta in the MailEntityProcessor.
  
  
  
   On Fri, Mar 4, 2011 at 6:29 PM, Matias Alonso 
matiasgalo...@gmail.com
   wrote:
  
dataConfig
 document
  entity name=email
  user=myem...@gmail.com
 password=mypassword
 host=imap.gmail.com
 fetchMailsSince=2011-01-01 00:00:00
 

Re: NRT and warmupTime of filterCache

2011-03-09 Thread stockii
make it sense to update solr for getting SOLR-571 ???

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/NRT-and-warmupTime-of-filterCache-tp2654886p2655073.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: getting much double-Values from solr -- timeout

2011-03-09 Thread Jan Høydahl
You have a large index with tough performance requirements on one server.
I would analyze your system to see if it's got any bottlenecks.
Watch out for auto-warming taking too long so it does not finish before next 
commit()
Watch out for too frequent commits
Monitor mem usage (JConsole or similar) to find if the correct RAM is allocated 
to each JVM.
How large is your index in terms of Gb? It may very well be that you need even 
more RAM in the server to cache more of the index files in OS memory.

Try to stop the Update JVM and let only Search-JVM be active. This will free 
RAM for OS. Then see if performance increases.
Next, try an optimize() and then see if that makes a difference.

I'm not familiar with the implementation details of StatsComponent. But if your 
Stats query is still slow after freeing RAM and optimize() I would file a JIRA 
issue, and attach to that issue some detailed response XMLs with 
debugQuery=trueechoParams=all , to document exactly how you use it and how it 
performs. It may be possible to optimize the code.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 9. mars 2011, at 11.39, stockii wrote:

 i am using NRT, and the caches are not always warmed, i think this is almost
 a problem !?
 
 -
 --- System 
 
 
 One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
 1 Core with 31 Million Documents other Cores  100.000
 
 - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
 - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/getting-much-double-Values-from-solr-timeout-tp2650981p2654725.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Help -DIH (mail)

2011-03-09 Thread Peter Sturge
Hi,

You've included some output in your message, so I presume something
*did* happen when you ran the 'status' command (but it might not be
what you wanted to happen :-)

If you run:
http://localhost:8983/solr/mail/dataimport?command=status

and you get something like this back:
str name=statusidle/str
str name=importResponse/
lst name=statusMessages/

It means that no full-import or delta-import has been run during the
life of the JVM Solr session.

You should try running:
   http://localhost:8983/solr/mail/dataimport?command=full-import

Then run:
   http://localhost:8983/solr/mail/dataimport?command=status

to see the status of the full-import (busy, idle, error, rolled back etc.)

You can enable java logging by editing your JRE's lib/logging.properties file.

Something like this should give you some log files:
handlers= java.util.logging.FileHandler
.level= INFO
java.util.logging.FileHandler.pattern = ./logs/mylogs%d.log
java.util.logging.FileeHandler.level = INFO
java.util.logging.FileHandler.limit = 50
java.util.logging.FileHandler.count = 1
java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter

NOTE: Make sure the 'logs' folder exists (in your $cwd) before you
start, or you'll get an error.

HTH
Peter




On Wed, Mar 9, 2011 at 12:47 PM, Matias Alonso matiasgalo...@gmail.com wrote:
 Hi Peter,

 When I execute the commands you mentioned, nothing happend.
 Below I show you the comands executed and the answered of they.
 Sorry, but I don´t know how to enable the log; my jre is by default.
 Rememeber I´m running the example-DIH (trunk\solr\example\example-DIH\solr);
 java -Dsolr.solr.home=./example-DIH/solr/ -jar start.jar.



 Import:
 http://localhost:8983/solr/mail/dataimport?command=status
 http://localhost:8983/solr/mail/dataimport?command=full-import

 response
 -
 lst name=responseHeader
 int name=status0/int
 int name=QTime15/int
 /lst
 -
 lst name=initArgs
 -
 lst name=defaults
 str name=configdata-config.xml/str
 /lst
 /lst
 -
 str name=command
 full-importhttp://localhost:8983/solr/mail/dataimport?command=full-import
 /str
 str name=statusidle/str
 str name=importResponse/
 lst name=statusMessages/
 -
 str name=WARNING
 This response format is experimental.  It is likely to change in the future.
 /str
 /response



 Status:
 http://localhost:8983/solr/mail/dataimport?command=status
 http://localhost:8983/solr/mail/dataimport?command=full-import


 response
 -
 lst name=responseHeader
 int name=status0/int
 int name=QTime0/int
 /lst
 -
 lst name=initArgs
 -
 lst name=defaults
 str name=configdata-config.xml/str
 /lst
 /lst
 -
 str name=command
 statushttp://localhost:8983/solr/mail/dataimport?command=full-import
 /str
 str name=statusidle/str
 str name=importResponse/
 lst name=statusMessages/
 -
 str name=WARNING
 This response format is experimental.  It is likely to change in the future.
 /str
 /response




 Thank you for your help.

 Matias.






 2011/3/4 Peter Sturge peter.stu...@gmail.com

 Can you try this:

 Issue a full import command like this:

 http://localhost:8983/solr/dataimport?command=full-import
 http://localhost:8983/solr/db/dataimport?command=full-import

 (There is no core name here - if you're using a core name (db?), then add
 that in between solr/ and /dataimport)

 then, run:
 http://localhost:8983/solr/dataimport?command=status
 http://localhost:8983/solr/db/dataimport?command=full-import

 This will show the results of the previous import. Has it been rolled-back?
 If so, there might be something in the log if it's enabled (see your jre's
 lib/logging.properties file).
 (you won't see any errors unless you run the status command - that's where
 they're stored)

 HTH
 Peter




 On Sat, Mar 5, 2011 at 12:46 AM, Matias Alonso matiasgalo...@gmail.com
 wrote:

  I´m using the trunk.
 
  Thanks Peter for your preoccupation!
 
  Matias.
 
 
 
  2011/3/4 Peter Sturge peter.stu...@gmail.com
 
   Hi Matias,
  
   What version of Solr are you using? Are you running any patches (maybe
   SOLR-2245)?
  
   Thanks,
   Peter
  
  
  
   On Fri, Mar 4, 2011 at 8:25 PM, Matias Alonso matiasgalo...@gmail.com
   wrote:
  
Hi Peter,
   
From DataImportHandler Development Console I made a full-import,
 but
didn´t work.
   
Now, I execute 
http://localhost:8983/solr/mail/dataimport?command=full-import; but
nothing
happends; no index; no errors.
   
thks...
   
Matias.
   
   
   
2011/3/4 Peter Sturge peter.stu...@gmail.com
   
 Hi Mataias,



   
  
 
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimportaccesses
 the dataimport handler, but you need to tell it to do something by
 sending a command:

  
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport
 ?command=full-import
 http://localhost:8983/solr/db/dataimport?command=full-import

 If you haven't already, have a look at:



   
  
 
 

Re: NRT in Solr

2011-03-09 Thread Jason Rutherglen
Jae,

NRT hasn't been implemented NRT as of yet in Solr, I think partially
because major features such as replication, caching, and uninverted
faceting suddenly are no longer viable, eg, it's another round of
testing etc.  It's doable, however I think the best approach is a
separate request call path, to avoid altering to current [working]
API.

On Tue, Mar 8, 2011 at 1:27 PM, Jae Joo jaejo...@gmail.com wrote:
 Hi,
 Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not
 find the configuration for NRT.

 Regards

 Jae



Re: NRT and warmupTime of filterCache

2011-03-09 Thread Jason Rutherglen
I think it's best to turn the warmupCount to zero because usually
there isn't time in between the creation of a new searcher to run the
warmup queries, eg, it'll negatively impact the desired goal of low
latency new index readers?

On Wed, Mar 9, 2011 at 3:41 AM, stockii stock.jo...@googlemail.com wrote:
 I tried to create an NRT like in the wiki but i got some problems with
 autowarming and ondeckSearchers.

 ervery minute i start a delta of one core and the other core start every
 minute a commit of the index to search for it.


 wiki says ... = 1 Searcher and fitlerCache warmupCount=3600. with this
 config i got exception that no searcher is available ... so i cannot use
 this config ...
 my config is, 4 Searchers and warmupCount=3000... with this settings i got
 Performance Warning, but it works. BUT when the complete 30 seconds (or
 more) needed to warming the searcher, i cannot ping my server in this time
 and i got errors ...
 make it sense to decrese my warmupCount to 0 ???

 how serchers do i need for 7 Cores ?

 -
 --- System 
 

 One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
 1 Core with 31 Million Documents other Cores  100.000

 - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
 - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/NRT-and-warmupTime-of-filterCache-tp2654886p2654886.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: True master-master fail-over without data gaps

2011-03-09 Thread Jason Rutherglen
If you're using the delta import handler the problem would seem to go
away because you can have two separate masters running at all times,
and if one fails, you can then point the slaves to the secondary
master, that is guaranteed to be in sync because it's been importing
from the same database?

On Tue, Mar 8, 2011 at 8:45 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello,

 What are some common or good ways to handle indexing (master) fail-over?
 Imagine you have a continuous stream of incoming documents that you have to
 index without losing any of them (or with losing as few of them as possible).
 How do you set up you masters?
 In other words, you can't just have 2 masters where the secondary is the
 Repeater (or Slave) of the primary master and replicates the index 
 periodically:
 you need to have 2 masters that are in sync at all times!
 How do you achieve that?

 * Do you just put N masters behind a LB VIP, configure them both to point to 
 the
 index on some shared storage (e.g. SAN), and count on the LB to fail-over to 
 the
 secondary master when the primary becomes unreachable?
 If so, how do you deal with index locks?  You use the Native lock and count on
 it disappearing when the primary master goes down?  That means you count on 
 the
 whole JVM process dying, which may not be the case...

 * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters
 with 2 separate indices in sync, while making sure you write to only 1 of them
 via LB VIP or otherwise?

 * Or ...


 This thread is on a similar topic, but is inconclusive:
  http://search-lucene.com/m/aOsyN15f1qd1

 Here is another similar thread, but this one doesn't cover how 2 masters are
 kept in sync at all times:
  http://search-lucene.com/m/aOsyN15f1qd1

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/




Re: dataimport

2011-03-09 Thread Brian Lamb
This has since been fixed. The problem was that there was not enough memory
on the machine. It works just fine now.

On Tue, Mar 8, 2011 at 6:22 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : INFO: Creating a connection for entity id with URL:
 :
 jdbc:mysql://localhost/researchsquare_beta_library?characterEncoding=UTF8zeroDateTimeBehavior=convertToNull
 : Feb 24, 2011 8:58:25 PM
 org.apache.solr.handler.dataimport.JdbcDataSource$1
 : call
 : INFO: Time taken for getConnection(): 137
 : Killed
 :
 : So it looks like for whatever reason, the server crashes trying to do a
 full
 : import. When I add a LIMIT clause on the query, it works fine when the
 LIMIT
 : is only 250 records but if I try to do 500 records, I get the same
 message.

 ...wow.  that's ... weird.

 I've never seen a java process just log Killed like that.

 The only time i've ever seen a process log Killed is if it was
 terminated by the os (ie: kill -9 pid)

 What OS are you using? how are you running solr? (ie: are you using the
 simple jetty example java -jar start.jar or are you using a differnet
 servlet container?) ... are you absolutely certain your machine doens't
 have some sort of monitoring in place that kills jobs if they take too
 long, or use too much CPU?


 -Hoss



Re: Help -DIH (mail)

2011-03-09 Thread Matias Alonso
Peter,

You´re right; may be I expose wrong because of my english.
I done everything you told me. I think that no find the folder when index.
What you thinking about?
Below I show to you part of the log.



09/03/2011 11:52:01 org.apache.solr.core.SolrCore execute
INFO: [mail] webapp=/solr path=/dataimport params={command=full-import}
status=0 QTime=0
09/03/2011 11:52:01 org.apache.solr.handler.dataimport.DataImporter
doFullImport
INFO: Starting Full Import
09/03/2011 11:52:01 org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties
INFO: Read dataimport.properties
09/03/2011 11:52:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [mail] REMOVING ALL DOCUMENTS FROM INDEX
09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
commit{dir=D:\Search
Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c]
09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1298912662799
09/03/2011 11:52:01 org.apache.solr.handler.dataimport.MailEntityProcessor
logConfig
INFO: user : myem...@gmail.com
pwd : mypass
protocol : imaps
host : imap.gmail.com
folders :
Recibidos,recibidos,RECIBIDOS,inbox.InBox,INBOX,Mail,MAIL,mail,CORREO,correo,Correo,Mail,mail,MAIL
recurse : false
exclude : []
include : []
batchSize : 100
fetchSize : 32768
read timeout : 6
conection timeout : 3
custom filter :
fetch mail since : Thu Mar 03 00:00:00 GFT 2011

09/03/2011 11:52:03 org.apache.solr.handler.dataimport.MailEntityProcessor
connectToMailBox
INFO: Connected to mailbox
09/03/2011 11:52:03 org.apache.solr.handler.dataimport.DocBuilder finish
INFO: Import completed successfully
09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2
commit{dir=D:\Search
Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c]
commit{dir=D:\Search
Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_d,version=1298912662800,generation=13,filenames=[segments_d]
09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1298912662800
09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher init
INFO: Opening Searcher@1cee792 main
09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main

fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@1cee792 main

fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main

filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@1cee792 main

filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main

queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=5,evictions=0,size=5,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@1cee792 main

queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=5,evictions=0,size=5,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main

documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
09/03/2011 11:52:03 

Re: Help -DIH (mail)

2011-03-09 Thread Peter Sturge
Hi,

When you ran the status command, what was the output?


On Wed, Mar 9, 2011 at 2:55 PM, Matias Alonso matiasgalo...@gmail.com wrote:
 Peter,

 You´re right; may be I expose wrong because of my english.
 I done everything you told me. I think that no find the folder when index.
 What you thinking about?
 Below I show to you part of the log.



 09/03/2011 11:52:01 org.apache.solr.core.SolrCore execute
 INFO: [mail] webapp=/solr path=/dataimport params={command=full-import}
 status=0 QTime=0
 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.DataImporter
 doFullImport
 INFO: Starting Full Import
 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.SolrWriter
 readIndexerProperties
 INFO: Read dataimport.properties
 09/03/2011 11:52:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll
 INFO: [mail] REMOVING ALL DOCUMENTS FROM INDEX
 09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=1
    commit{dir=D:\Search
 Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c]
 09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy updateCommits
 INFO: newest commit = 1298912662799
 09/03/2011 11:52:01 org.apache.solr.handler.dataimport.MailEntityProcessor
 logConfig
 INFO: user : myem...@gmail.com
 pwd : mypass
 protocol : imaps
 host : imap.gmail.com
 folders :
 Recibidos,recibidos,RECIBIDOS,inbox.InBox,INBOX,Mail,MAIL,mail,CORREO,correo,Correo,Mail,mail,MAIL
 recurse : false
 exclude : []
 include : []
 batchSize : 100
 fetchSize : 32768
 read timeout : 6
 conection timeout : 3
 custom filter :
 fetch mail since : Thu Mar 03 00:00:00 GFT 2011

 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.MailEntityProcessor
 connectToMailBox
 INFO: Connected to mailbox
 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.DocBuilder finish
 INFO: Import completed successfully
 09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start
 commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
 09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy onCommit
 INFO: SolrDeletionPolicy.onCommit: commits:num=2
    commit{dir=D:\Search
 Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c]
    commit{dir=D:\Search
 Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_d,version=1298912662800,generation=13,filenames=[segments_d]
 09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy updateCommits
 INFO: newest commit = 1298912662800
 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher init
 INFO: Opening Searcher@1cee792 main
 09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: end_commit_flush
 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
 INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main

 fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
 INFO: autowarming result for Searcher@1cee792 main

 fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
 INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main

 filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
 INFO: autowarming result for Searcher@1cee792 main

 filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
 INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main

 queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=5,evictions=0,size=5,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
 INFO: autowarming result for Searcher@1cee792 main

 queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=5,evictions=0,size=5,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
 INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main

 

Re: Help -DIH (mail)

2011-03-09 Thread Matias Alonso
Log:
09/03/2011 11:54:58 org.apache.solr.core.SolrCore execute
INFO: [mail] webapp=/solr path=/dataimport params={command=status} status=0
QTime=0


XML
response
-
lst name=responseHeader
int name=status0/int
int name=QTime0/int
/lst
-
lst name=initArgs
-
lst name=defaults
str name=configdata-config.xml/str
/lst
/lst
str name=commandstatus/str
str name=statusidle/str
str name=importResponse/
-
lst name=statusMessages
str name=Total Requests made to DataSource0/str
str name=Total Rows Fetched0/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2011-03-09 11:52:01/str
-
str name=
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
/str
str name=Committed2011-03-09 11:52:03/str
str name=Optimized2011-03-09 11:52:03/str
str name=Total Documents Processed0/str
str name=Time taken 0:0:2.359/str
/lst
-
str name=WARNING
This response format is experimental.  It is likely to change in the future.
/str
/response


Thks,

Matias.

2011/3/9 Peter Sturge peter.stu...@gmail.com

 Hi,

 When you ran the status command, what was the output?


 On Wed, Mar 9, 2011 at 2:55 PM, Matias Alonso matiasgalo...@gmail.com
 wrote:
  Peter,
 
  You´re right; may be I expose wrong because of my english.
  I done everything you told me. I think that no find the folder when
 index.
  What you thinking about?
  Below I show to you part of the log.
 
 
 
  09/03/2011 11:52:01 org.apache.solr.core.SolrCore execute
  INFO: [mail] webapp=/solr path=/dataimport params={command=full-import}
  status=0 QTime=0
  09/03/2011 11:52:01 org.apache.solr.handler.dataimport.DataImporter
  doFullImport
  INFO: Starting Full Import
  09/03/2011 11:52:01 org.apache.solr.handler.dataimport.SolrWriter
  readIndexerProperties
  INFO: Read dataimport.properties
  09/03/2011 11:52:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll
  INFO: [mail] REMOVING ALL DOCUMENTS FROM INDEX
  09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy onInit
  INFO: SolrDeletionPolicy.onInit: commits:num=1
 commit{dir=D:\Search
 
 Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c]
  09/03/2011 11:52:01 org.apache.solr.core.SolrDeletionPolicy updateCommits
  INFO: newest commit = 1298912662799
  09/03/2011 11:52:01
 org.apache.solr.handler.dataimport.MailEntityProcessor
  logConfig
  INFO: user : myem...@gmail.com
  pwd : mypass
  protocol : imaps
  host : imap.gmail.com
  folders :
 
 Recibidos,recibidos,RECIBIDOS,inbox.InBox,INBOX,Mail,MAIL,mail,CORREO,correo,Correo,Mail,mail,MAIL
  recurse : false
  exclude : []
  include : []
  batchSize : 100
  fetchSize : 32768
  read timeout : 6
  conection timeout : 3
  custom filter :
  fetch mail since : Thu Mar 03 00:00:00 GFT 2011
 
  09/03/2011 11:52:03
 org.apache.solr.handler.dataimport.MailEntityProcessor
  connectToMailBox
  INFO: Connected to mailbox
  09/03/2011 11:52:03 org.apache.solr.handler.dataimport.DocBuilder finish
  INFO: Import completed successfully
  09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit
  INFO: start
 
 commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
  09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy onCommit
  INFO: SolrDeletionPolicy.onCommit: commits:num=2
 commit{dir=D:\Search
 
 Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_c,version=1298912662799,generation=12,filenames=[segments_c]
 commit{dir=D:\Search
 
 Plugtree\trunk\solr\example\example-DIH\solr\mail\data\index,segFN=segments_d,version=1298912662800,generation=13,filenames=[segments_d]
  09/03/2011 11:52:03 org.apache.solr.core.SolrDeletionPolicy updateCommits
  INFO: newest commit = 1298912662800
  09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher init
  INFO: Opening Searcher@1cee792 main
  09/03/2011 11:52:03 org.apache.solr.update.DirectUpdateHandler2 commit
  INFO: end_commit_flush
  09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
  INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main
 
 
 fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
  09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
  INFO: autowarming result for Searcher@1cee792 main
 
 
 fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
  09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
  INFO: autowarming Searcher@1cee792 main from Searcher@9a18a0 main
 
 
 filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
  09/03/2011 11:52:03 

SolrJ and digest authentication

2011-03-09 Thread Erlend Garåsen


I'm trying to do a search with SolrJ using digest authentication, but 
I'm getting the following error:

org.apache.solr.common.SolrException: Unauthorized

I'm setting up SolrJ this way:

HttpClient client = new HttpClient();
ListString authPrefs = new ArrayListString();
authPrefs.add(AuthPolicy.DIGEST);
client.getParams().setParameter(AuthPolicy.AUTH_SCHEME_PRIORITY, authPrefs);
AuthScope scope = new AuthScope(host, 443, resin);
client.getState().setCredentials(scope, new 
UsernamePasswordCredentials(username, password));

client.getParams().setAuthenticationPreemptive(true);
SolrServer server = new CommonsHttpSolrServer(server, client);

Is this something which is not supported by SolrJ or have I written 
something wrong in the code above?


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


RE: True master-master fail-over without data gaps

2011-03-09 Thread Robert Petersen
If you have a wrapper, like an indexer app which prepares solr docs and
sends them into solr, then it is simple.  The wrapper is your 'tee' and
it can send docs to both (or N) masters.

-Original Message-
From: Michael Sokolov [mailto:soko...@ifactory.com] 
Sent: Wednesday, March 09, 2011 4:14 AM
To: solr-user@lucene.apache.org
Cc: Jonathan Rochkind
Subject: Re: True master-master fail-over without data gaps

Yes, I think this should be pushed upstream - insert a tee in the 
document stream so that all documents go to both masters.
Then use a load balancer to make requests of the masters.

The tee itself then becomes a possible single point of failure, but 
you didn't say anything about the architecture of the document feed.  Is

that also fault-tolerant?

-Mike

On 3/9/2011 1:06 AM, Jonathan Rochkind wrote:
 I'd honestly think about buffer the incoming documents in some store
that's actually made for fail-over persistence reliability, maybe
CouchDB or something. And then that's taking care of not losing
anything, and the problem becomes how we make sure that our solr master
indexes are kept in sync with the actual persistent store; which I'm
still not sure about, but I'm thinking it's a simpler problem. The right
tool for the right job, that kind of failover persistence is not solr's
specialty.
 
 From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
 Sent: Tuesday, March 08, 2011 11:45 PM
 To: solr-user@lucene.apache.org
 Subject: True master-master fail-over without data gaps

 Hello,

 What are some common or good ways to handle indexing (master)
fail-over?
 Imagine you have a continuous stream of incoming documents that you
have to
 index without losing any of them (or with losing as few of them as
possible).
 How do you set up you masters?
 In other words, you can't just have 2 masters where the secondary is
the
 Repeater (or Slave) of the primary master and replicates the index
periodically:
 you need to have 2 masters that are in sync at all times!
 How do you achieve that?

 * Do you just put N masters behind a LB VIP, configure them both to
point to the
 index on some shared storage (e.g. SAN), and count on the LB to
fail-over to the
 secondary master when the primary becomes unreachable?
 If so, how do you deal with index locks?  You use the Native lock and
count on
 it disappearing when the primary master goes down?  That means you
count on the
 whole JVM process dying, which may not be the case...

 * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2
masters
 with 2 separate indices in sync, while making sure you write to only 1
of them
 via LB VIP or otherwise?

 * Or ...


 This thread is on a similar topic, but is inconclusive:
http://search-lucene.com/m/aOsyN15f1qd1

 Here is another similar thread, but this one doesn't cover how 2
masters are
 kept in sync at all times:
http://search-lucene.com/m/aOsyN15f1qd1

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/




Re: True master-master fail-over without data gaps

2011-03-09 Thread Otis Gospodnetic
Hi,

- Original Message 

 If you're using the delta import handler the problem would seem to go
 away  because you can have two separate masters running at all times,
 and if one  fails, you can then point the slaves to the secondary
 master, that is  guaranteed to be in sync because it's been importing
 from the same  database?

Oh, there is no DB involved.  Think of a document stream continuously coming 
in, 
a component listening to that stream, grabbing docs, and pushing it to 
master(s).

Otis



 On Tue, Mar 8, 2011 at 8:45 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com  wrote:
  Hello,
 
  What are some common or good ways to  handle indexing (master) fail-over?
  Imagine you have a continuous stream  of incoming documents that you have to
  index without losing any of them  (or with losing as few of them as 
possible).
  How do you set up you  masters?
  In other words, you can't just have 2 masters where the  secondary is the
  Repeater (or Slave) of the primary master and  replicates the index 
periodically:
  you need to have 2 masters that are  in sync at all times!
  How do you achieve that?
 
  * Do you  just put N masters behind a LB VIP, configure them both to point 
  to 
the
   index on some shared storage (e.g. SAN), and count on the LB to fail-over 
  to  
the
  secondary master when the primary becomes unreachable?
  If  so, how do you deal with index locks?  You use the Native lock and 
  count  
on
  it disappearing when the primary master goes down?  That means you  count 
  on 
the
  whole JVM process dying, which may not be the  case...
 
  * Or do you use tools like DRBD, Corosync, Pacemaker,  etc. to keep 2 
masters
  with 2 separate indices in sync, while making  sure you write to only 1 of 
them
  via LB VIP or  otherwise?
 
  * Or ...
 
 
  This thread is on a  similar topic, but is inconclusive:
   http://search-lucene.com/m/aOsyN15f1qd1
 
  Here is another  similar thread, but this one doesn't cover how 2 masters 
are
  kept in  sync at all times:
   http://search-lucene.com/m/aOsyN15f1qd1
 
   Thanks,
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 


Re: True master-master fail-over without data gaps

2011-03-09 Thread Otis Gospodnetic
Hi,


- Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org
 Sent: Wed, March 9, 2011 11:40:56 AM
 Subject: RE: True master-master fail-over without data gaps
 
 If you have a wrapper, like an indexer app which prepares solr docs and
 sends  them into solr, then it is simple.  The wrapper is your 'tee' and
 it can  send docs to both (or N) masters.

Doesn't this make it too easy for 2 masters to get out of sync even if the 
problem is not with them?
e.g. something happens in this tee component and it indexes a doc to master 
A, 
but not master B.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



 -Original Message-
 From:  Michael Sokolov [mailto:soko...@ifactory.com] 
 Sent:  Wednesday, March 09, 2011 4:14 AM
 To: solr-user@lucene.apache.org
 Cc:  Jonathan Rochkind
 Subject: Re: True master-master fail-over without data  gaps
 
 Yes, I think this should be pushed upstream - insert a tee in the 
 document stream so that all documents go to both masters.
 Then use a load  balancer to make requests of the masters.
 
 The tee itself then becomes a  possible single point of failure, but 
 you didn't say anything about the  architecture of the document feed.  Is
 
 that also  fault-tolerant?
 
 -Mike
 
 On 3/9/2011 1:06 AM, Jonathan Rochkind  wrote:
  I'd honestly think about buffer the incoming documents in some  store
 that's actually made for fail-over persistence reliability,  maybe
 CouchDB or something. And then that's taking care of not  losing
 anything, and the problem becomes how we make sure that our solr  master
 indexes are kept in sync with the actual persistent store; which  I'm
 still not sure about, but I'm thinking it's a simpler problem. The  right
 tool for the right job, that kind of failover persistence is not  solr's
 specialty.
  
   From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
   Sent: Tuesday, March 08, 2011 11:45 PM
  To: solr-user@lucene.apache.org
   Subject: True master-master fail-over without data gaps
 
   Hello,
 
  What are some common or good ways to handle indexing  (master)
 fail-over?
  Imagine you have a continuous stream of incoming  documents that you
 have to
  index without losing any of them (or with  losing as few of them as
 possible).
  How do you set up you  masters?
  In other words, you can't just have 2 masters where the  secondary is
 the
  Repeater (or Slave) of the primary master and  replicates the index
 periodically:
  you need to have 2 masters that  are in sync at all times!
  How do you achieve that?
 
  * Do  you just put N masters behind a LB VIP, configure them both to
 point to  the
  index on some shared storage (e.g. SAN), and count on the LB  to
 fail-over to the
  secondary master when the primary becomes  unreachable?
  If so, how do you deal with index locks?  You use the  Native lock and
 count on
  it disappearing when the primary master goes  down?  That means you
 count on the
  whole JVM process dying,  which may not be the case...
 
  * Or do you use tools like DRBD,  Corosync, Pacemaker, etc. to keep 2
 masters
  with 2 separate indices  in sync, while making sure you write to only 1
 of them
  via LB VIP or  otherwise?
 
  * Or ...
 
 
  This thread is on a  similar topic, but is inconclusive:
 http://search-lucene.com/m/aOsyN15f1qd1
 
  Here is another  similar thread, but this one doesn't cover how 2
 masters are
  kept in  sync at all times:
 http://search-lucene.com/m/aOsyN15f1qd1
 
  Thanks,
   Otis
  
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene  ecosystem search :: http://search-lucene.com/
 
 
 


Re: True master-master fail-over without data gaps

2011-03-09 Thread Jason Rutherglen
 Oh, there is no DB involved.  Think of a document stream continuously coming 
 in,
 a component listening to that stream, grabbing docs, and pushing it to
 master(s).

I don't think Solr is designed for this use case, eg, I wouldn't
expect deterministic results with the current architecture as it's
something that's inherently a a key component of [No]SQL databases.

On Wed, Mar 9, 2011 at 8:49 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hi,

 - Original Message 

 If you're using the delta import handler the problem would seem to go
 away  because you can have two separate masters running at all times,
 and if one  fails, you can then point the slaves to the secondary
 master, that is  guaranteed to be in sync because it's been importing
 from the same  database?

 Oh, there is no DB involved.  Think of a document stream continuously coming 
 in,
 a component listening to that stream, grabbing docs, and pushing it to
 master(s).

 Otis



 On Tue, Mar 8, 2011 at 8:45 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com  wrote:
  Hello,
 
  What are some common or good ways to  handle indexing (master) fail-over?
  Imagine you have a continuous stream  of incoming documents that you have 
  to
  index without losing any of them  (or with losing as few of them as
possible).
  How do you set up you  masters?
  In other words, you can't just have 2 masters where the  secondary is the
  Repeater (or Slave) of the primary master and  replicates the index
periodically:
  you need to have 2 masters that are  in sync at all times!
  How do you achieve that?
 
  * Do you  just put N masters behind a LB VIP, configure them both to point 
  to
the
   index on some shared storage (e.g. SAN), and count on the LB to fail-over 
  to
the
  secondary master when the primary becomes unreachable?
  If  so, how do you deal with index locks?  You use the Native lock and 
  count
on
  it disappearing when the primary master goes down?  That means you  count 
  on
the
  whole JVM process dying, which may not be the  case...
 
  * Or do you use tools like DRBD, Corosync, Pacemaker,  etc. to keep 2
 masters
  with 2 separate indices in sync, while making  sure you write to only 1 of
them
  via LB VIP or  otherwise?
 
  * Or ...
 
 
  This thread is on a  similar topic, but is inconclusive:
   http://search-lucene.com/m/aOsyN15f1qd1
 
  Here is another  similar thread, but this one doesn't cover how 2 masters
 are
  kept in  sync at all times:
   http://search-lucene.com/m/aOsyN15f1qd1
 
   Thanks,
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 




Re: True master-master fail-over without data gaps

2011-03-09 Thread Otis Gospodnetic
Hi,



- Original Message 
 
 Yes, I think this should be pushed upstream - insert a tee in the 
 document  stream so that all documents go to both masters.
 Then use a load balancer to  make requests of the masters.

Hm, but this makes the tee app aware of this.  What if I want to hide that from 
any code of mine?

 The tee itself then becomes a possible  single point of failure, but 
 you didn't say anything about the architecture  of the document feed.  Is 
 that also  fault-tolerant?

Let's say it is! :)

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


 On 3/9/2011 1:06 AM, Jonathan Rochkind  wrote:
  I'd honestly think about buffer the incoming documents in some  store 
  that's 
actually made for fail-over persistence reliability, maybe CouchDB  or 
something. And then that's taking care of not losing anything, and the  
problem 
becomes how we make sure that our solr master indexes are kept in sync  with 
the 
actual persistent store; which I'm still not sure about, but I'm  thinking 
it's 
a simpler problem. The right tool for the right job, that kind of  failover 
persistence is not solr's specialty.
   
  From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
   Sent: Tuesday, March 08, 2011 11:45 PM
  To: solr-user@lucene.apache.org
   Subject: True master-master fail-over without data gaps
 
   Hello,
 
  What are some common or good ways to handle indexing  (master) fail-over?
  Imagine you have a continuous stream of incoming  documents that you have to
  index without losing any of them (or with  losing as few of them as 
possible).
  How do you set up you  masters?
  In other words, you can't just have 2 masters where the  secondary is the
  Repeater (or Slave) of the primary master and  replicates the index 
periodically:
  you need to have 2 masters that are  in sync at all times!
  How do you achieve that?
 
  * Do you  just put N masters behind a LB VIP, configure them both to point 
  to 
the
   index on some shared storage (e.g. SAN), and count on the LB to fail-over 
  to  
the
  secondary master when the primary becomes unreachable?
  If  so, how do you deal with index locks?  You use the Native lock and 
  count  
on
  it disappearing when the primary master goes down?  That means  you count 
  on 
the
  whole JVM process dying, which may not be the  case...
 
  * Or do you use tools like DRBD, Corosync, Pacemaker,  etc. to keep 2 
masters
  with 2 separate indices in sync, while making  sure you write to only 1 of 
them
  via LB VIP or  otherwise?
 
  * Or ...
 
 
  This thread is on a  similar topic, but is inconclusive:
 http://search-lucene.com/m/aOsyN15f1qd1
 
  Here is another  similar thread, but this one doesn't cover how 2 masters 
are
  kept in  sync at all times:
 http://search-lucene.com/m/aOsyN15f1qd1
 
  Thanks,
   Otis
  
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene  ecosystem search :: http://search-lucene.com/
 
 
 


Re: True master-master fail-over without data gaps

2011-03-09 Thread Otis Gospodnetic
Hi,


- Original Message 
 
  Oh, there is no DB involved.  Think of a document stream continuously  
  coming 
in,
  a component listening to that stream, grabbing docs, and  pushing it to
  master(s).
 
 I don't think Solr is designed for this  use case, eg, I wouldn't
 expect deterministic results with the current  architecture as it's
 something that's inherently a a key component of [No]SQL  databases.

You mean it's not possible to have 2 masters that are in nearly real-time sync?
How about with DRBD?  I know people use DRBD to keep 2 Hadoop NNs (their edit 
logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this 
could be doable with Solr masters, too, no?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


 On Wed, Mar 9, 2011 at 8:49 AM, Otis Gospodnetic
 otis_gospodne...@yahoo.com  wrote:
  Hi,
 
  - Original Message  
 
  If you're using the delta import handler the problem  would seem to go
  away  because you can have two separate masters  running at all times,
  and if one  fails, you can then point the  slaves to the secondary
  master, that is  guaranteed to be in sync  because it's been importing
  from the same  database?
 
   Oh, there is no DB involved.  Think of a document stream continuously 
  coming  
in,
  a component listening to that stream, grabbing docs, and pushing it  to
  master(s).
 
  Otis
 
 
 
   On Tue, Mar 8, 2011 at 8:45 PM, Otis Gospodnetic
  otis_gospodne...@yahoo.com   wrote:
   Hello,
  
   What are some  common or good ways to  handle indexing (master) 
fail-over?
Imagine you have a continuous stream  of incoming documents that you 
   have  
to
   index without losing any of them  (or with losing as few of  them as
 possible).
   How do you set up you   masters?
   In other words, you can't just have 2 masters where  the  secondary is 
the
   Repeater (or Slave) of the primary master  and  replicates the index
 periodically:
   you need to  have 2 masters that are  in sync at all times!
   How do you  achieve that?
  
   * Do you  just put N masters  behind a LB VIP, configure them both to 
point to
 the
 index on some shared storage (e.g. SAN), and count on the LB to 
fail-over  to
 the
   secondary master when the primary becomes  unreachable?
   If  so, how do you deal with index locks?  You use  the Native lock and 
count
 on
   it disappearing when  the primary master goes down?  That means you  
   count  
on
 the
   whole JVM process dying, which may not be the   case...
  
   * Or do you use tools like DRBD,  Corosync, Pacemaker,  etc. to keep 2
  masters
   with 2  separate indices in sync, while making  sure you write to only 1 

of
 them
   via LB VIP or  otherwise?
   
   * Or ...
  
  
This thread is on a  similar topic, but is inconclusive:
 http://search-lucene.com/m/aOsyN15f1qd1
  
Here is another  similar thread, but this one doesn't cover how 2  
masters
  are
   kept in  sync at all times:
 http://search-lucene.com/m/aOsyN15f1qd1
  
 Thanks,
   Otis
   
   Sematext  :: http://sematext.com/ ::  Solr -  Lucene - Nutch
   Lucene ecosystem search :: http://search-lucene.com/
  
   
 
 
 


RE: True master-master fail-over without data gaps

2011-03-09 Thread Robert Petersen
Currently I use an application connected to a queue containing incoming
data which my indexer app turns into solr docs.  I log everything to a
log table and have never had an issue with losing anything.  I can trace
incoming docs exactly, and keep timing data in there also. If I added a
second solr url for a second master and resent the same doc to master02
that I sent to master01, I would expect near 100% synchronization.  The
problem here is how to get the slave farm to start replicating from the
second master if and when the first goes down.  I can only see that as
being a manual operation, repointing the slaves to master02 and
restarting or reloading them etc...



-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Wednesday, March 09, 2011 8:52 AM
To: solr-user@lucene.apache.org
Subject: Re: True master-master fail-over without data gaps

Hi,


- Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org
 Sent: Wed, March 9, 2011 11:40:56 AM
 Subject: RE: True master-master fail-over without data gaps
 
 If you have a wrapper, like an indexer app which prepares solr docs
and
 sends  them into solr, then it is simple.  The wrapper is your 'tee'
and
 it can  send docs to both (or N) masters.

Doesn't this make it too easy for 2 masters to get out of sync even if
the 
problem is not with them?
e.g. something happens in this tee component and it indexes a doc to
master A, 
but not master B.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



 -Original Message-
 From:  Michael Sokolov [mailto:soko...@ifactory.com] 
 Sent:  Wednesday, March 09, 2011 4:14 AM
 To: solr-user@lucene.apache.org
 Cc:  Jonathan Rochkind
 Subject: Re: True master-master fail-over without data  gaps
 
 Yes, I think this should be pushed upstream - insert a tee in the 
 document stream so that all documents go to both masters.
 Then use a load  balancer to make requests of the masters.
 
 The tee itself then becomes a  possible single point of failure, but

 you didn't say anything about the  architecture of the document feed.
Is
 
 that also  fault-tolerant?
 
 -Mike
 
 On 3/9/2011 1:06 AM, Jonathan Rochkind  wrote:
  I'd honestly think about buffer the incoming documents in some
store
 that's actually made for fail-over persistence reliability,  maybe
 CouchDB or something. And then that's taking care of not  losing
 anything, and the problem becomes how we make sure that our solr
master
 indexes are kept in sync with the actual persistent store; which  I'm
 still not sure about, but I'm thinking it's a simpler problem. The
right
 tool for the right job, that kind of failover persistence is not
solr's
 specialty.
  
   From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
   Sent: Tuesday, March 08, 2011 11:45 PM
  To: solr-user@lucene.apache.org
   Subject: True master-master fail-over without data gaps
 
   Hello,
 
  What are some common or good ways to handle indexing  (master)
 fail-over?
  Imagine you have a continuous stream of incoming  documents that you
 have to
  index without losing any of them (or with  losing as few of them as
 possible).
  How do you set up you  masters?
  In other words, you can't just have 2 masters where the  secondary
is
 the
  Repeater (or Slave) of the primary master and  replicates the index
 periodically:
  you need to have 2 masters that  are in sync at all times!
  How do you achieve that?
 
  * Do  you just put N masters behind a LB VIP, configure them both to
 point to  the
  index on some shared storage (e.g. SAN), and count on the LB  to
 fail-over to the
  secondary master when the primary becomes  unreachable?
  If so, how do you deal with index locks?  You use the  Native lock
and
 count on
  it disappearing when the primary master goes  down?  That means you
 count on the
  whole JVM process dying,  which may not be the case...
 
  * Or do you use tools like DRBD,  Corosync, Pacemaker, etc. to keep
2
 masters
  with 2 separate indices  in sync, while making sure you write to
only 1
 of them
  via LB VIP or  otherwise?
 
  * Or ...
 
 
  This thread is on a  similar topic, but is inconclusive:
 http://search-lucene.com/m/aOsyN15f1qd1
 
  Here is another  similar thread, but this one doesn't cover how 2
 masters are
  kept in  sync at all times:
 http://search-lucene.com/m/aOsyN15f1qd1
 
  Thanks,
   Otis
  
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene  ecosystem search :: http://search-lucene.com/
 
 
 


Re: True master-master fail-over without data gaps

2011-03-09 Thread Otis Gospodnetic
Hi,



- Original Message 

 I'd honestly think about buffer the incoming documents in some store that's  
actually made for fail-over persistence reliability, maybe CouchDB or 
something.  
And then that's taking care of not losing anything, and the problem becomes 
how  
we make sure that our solr master indexes are kept in sync with the actual  
persistent store; which I'm still not sure about, but I'm thinking it's a  
simpler problem. The right tool for the right job, that kind of failover  
persistence is not solr's specialty. 


But check this!  In some cases one is not allowed to save content to disk 
(think 
copyrights).  I'm not making this up - we actually have a customer with this 
cannot save to disk (but can index) requirement.

So buffering to disk is not an option, and buffering in memory is not practical 
because of the input document rate and their size.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



 From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
 Sent:  Tuesday, March 08, 2011 11:45 PM
 To: solr-user@lucene.apache.org
 Subject:  True master-master fail-over without data gaps
 
 Hello,
 
 What are  some common or good ways to handle indexing (master) fail-over?
 Imagine you  have a continuous stream of incoming documents that you have to
 index without  losing any of them (or with losing as few of them as possible).
 How do you  set up you masters?
 In other words, you can't just have 2 masters where the  secondary is the
 Repeater (or Slave) of the primary master and replicates the  index 
periodically:
 you need to have 2 masters that are in sync at all  times!
 How do you achieve that?
 
 * Do you just put N masters behind a  LB VIP, configure them both to point to 
the
 index on some shared storage  (e.g. SAN), and count on the LB to fail-over to 
the
 secondary master when the  primary becomes unreachable?
 If so, how do you deal with index locks?   You use the Native lock and count 
on
 it disappearing when the primary master  goes down?  That means you count on 
the
 whole JVM process dying, which  may not be the case...
 
 * Or do you use tools like DRBD, Corosync,  Pacemaker, etc. to keep 2 masters
 with 2 separate indices in sync, while  making sure you write to only 1 of 
them
 via LB VIP or otherwise?
 
 * Or  ...
 
 
 This thread is on a similar topic, but is inconclusive:
   http://search-lucene.com/m/aOsyN15f1qd1
 
 Here is another similar  thread, but this one doesn't cover how 2 masters are
 kept in sync at all  times:
   http://search-lucene.com/m/aOsyN15f1qd1
 
 Thanks,
 Otis
 
 Sematext  :: http://sematext.com/ ::  Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 


Re: True master-master fail-over without data gaps

2011-03-09 Thread Walter Underwood
On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:

 You mean it's not possible to have 2 masters that are in nearly real-time 
 sync?
 How about with DRBD?  I know people use DRBD to keep 2 Hadoop NNs (their edit 
 logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this 
 could be doable with Solr masters, too, no?
 
 Otis


If you add fault-tolerant, you run into the CAP Theorem. Consistency, 
availability, partition: choose two. You cannot have it all.

wunder
--
Walter Underwood





RE: True master-master fail-over without data gaps

2011-03-09 Thread Robert Petersen
...but the index resides on disk doesn't it???  lol

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Wednesday, March 09, 2011 9:06 AM
To: solr-user@lucene.apache.org
Subject: Re: True master-master fail-over without data gaps

Hi,



- Original Message 

 I'd honestly think about buffer the incoming documents in some store
that's  
actually made for fail-over persistence reliability, maybe CouchDB or
something.  
And then that's taking care of not losing anything, and the problem
becomes how  
we make sure that our solr master indexes are kept in sync with the
actual  
persistent store; which I'm still not sure about, but I'm thinking it's
a  
simpler problem. The right tool for the right job, that kind of
failover  
persistence is not solr's specialty. 


But check this!  In some cases one is not allowed to save content to
disk (think 
copyrights).  I'm not making this up - we actually have a customer with
this 
cannot save to disk (but can index) requirement.

So buffering to disk is not an option, and buffering in memory is not
practical 
because of the input document rate and their size.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



 From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
 Sent:  Tuesday, March 08, 2011 11:45 PM
 To: solr-user@lucene.apache.org
 Subject:  True master-master fail-over without data gaps
 
 Hello,
 
 What are  some common or good ways to handle indexing (master)
fail-over?
 Imagine you  have a continuous stream of incoming documents that you
have to
 index without  losing any of them (or with losing as few of them as
possible).
 How do you  set up you masters?
 In other words, you can't just have 2 masters where the  secondary is
the
 Repeater (or Slave) of the primary master and replicates the  index 
periodically:
 you need to have 2 masters that are in sync at all  times!
 How do you achieve that?
 
 * Do you just put N masters behind a  LB VIP, configure them both to
point to 
the
 index on some shared storage  (e.g. SAN), and count on the LB to
fail-over to 
the
 secondary master when the  primary becomes unreachable?
 If so, how do you deal with index locks?   You use the Native lock and
count 
on
 it disappearing when the primary master  goes down?  That means you
count on 
the
 whole JVM process dying, which  may not be the case...
 
 * Or do you use tools like DRBD, Corosync,  Pacemaker, etc. to keep 2
masters
 with 2 separate indices in sync, while  making sure you write to only
1 of 
them
 via LB VIP or otherwise?
 
 * Or  ...
 
 
 This thread is on a similar topic, but is inconclusive:
   http://search-lucene.com/m/aOsyN15f1qd1
 
 Here is another similar  thread, but this one doesn't cover how 2
masters are
 kept in sync at all  times:
   http://search-lucene.com/m/aOsyN15f1qd1
 
 Thanks,
 Otis
 
 Sematext  :: http://sematext.com/ ::  Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 


Re: True master-master fail-over without data gaps

2011-03-09 Thread Otis Gospodnetic
Hi,



- Original Message 

 Currently I use an application connected to a queue containing incoming
 data  which my indexer app turns into solr docs.  I log everything to a
 log  table and have never had an issue with losing anything.  

Yeah, if everything goes through some storage that can be polled (either a DB 
or 
a durable JMS Topic or some such), then N masters could connect to it, not miss 
anything, and be more or less in near real-time sync.

 I can  trace
 incoming docs exactly, and keep timing data in there also. If I added  a
 second solr url for a second master and resent the same doc to  master02
 that I sent to master01, I would expect near 100%  synchronization.  The
 problem here is how to get the slave farm to start  replicating from the
 second master if and when the first goes down.  I  can only see that as
 being a manual operation, repointing the slaves to  master02 and
 restarting or reloading them etc...

Actually, you can configure a LB to handle that, so that's less of a problem, I 
think.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


 -Original  Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
 Sent: Wednesday, March 09, 2011 8:52 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: True master-master fail-over without data gaps
 
 Hi,
 
 
 -  Original Message 
  From: Robert Petersen rober...@buy.com
  To: solr-user@lucene.apache.org
   Sent: Wed, March 9, 2011 11:40:56 AM
  Subject: RE: True master-master  fail-over without data gaps
  
  If you have a wrapper, like an  indexer app which prepares solr docs
 and
  sends  them into solr,  then it is simple.  The wrapper is your 'tee'
 and
  it can   send docs to both (or N) masters.
 
 Doesn't this make it too easy for 2  masters to get out of sync even if
 the 
 problem is not with them?
 e.g.  something happens in this tee component and it indexes a doc to
 master A, 
 but not master B.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
  -Original  Message-
  From:  Michael Sokolov [mailto:soko...@ifactory.com] 
   Sent:  Wednesday, March 09, 2011 4:14 AM
  To: solr-user@lucene.apache.org
   Cc:  Jonathan Rochkind
  Subject: Re: True master-master fail-over  without data  gaps
  
  Yes, I think this should be pushed  upstream - insert a tee in the 
  document stream so that all documents  go to both masters.
  Then use a load  balancer to make requests of  the masters.
  
  The tee itself then becomes a  possible  single point of failure, but
 
  you didn't say anything about the   architecture of the document feed.
 Is
  
  that also   fault-tolerant?
  
  -Mike
  
  On 3/9/2011 1:06 AM,  Jonathan Rochkind  wrote:
   I'd honestly think about buffer the  incoming documents in some
 store
  that's actually made for fail-over  persistence reliability,  maybe
  CouchDB or something. And then  that's taking care of not  losing
  anything, and the problem becomes  how we make sure that our solr
 master
  indexes are kept in sync with  the actual persistent store; which  I'm
  still not sure about, but  I'm thinking it's a simpler problem. The
 right
  tool for the right  job, that kind of failover persistence is not
 solr's
   specialty.
   
 From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
 Sent: Tuesday, March 08, 2011 11:45 PM
   To: solr-user@lucene.apache.org
 Subject: True master-master fail-over without data gaps
   
Hello,
  
   What are some common or  good ways to handle indexing  (master)
  fail-over?
Imagine you have a continuous stream of incoming  documents that  you
  have to
   index without losing any of them (or with   losing as few of them as
  possible).
   How do you set up  you  masters?
   In other words, you can't just have 2 masters  where the  secondary
 is
  the
   Repeater (or Slave) of  the primary master and  replicates the index
  periodically:
you need to have 2 masters that  are in sync at all times!
How do you achieve that?
  
   * Do  you just put  N masters behind a LB VIP, configure them both to
  point to   the
   index on some shared storage (e.g. SAN), and count on the  LB  to
  fail-over to the
   secondary master when the  primary becomes  unreachable?
   If so, how do you deal with  index locks?  You use the  Native lock
 and
  count on
it disappearing when the primary master goes  down?  That means  you
  count on the
   whole JVM process dying,  which may  not be the case...
  
   * Or do you use tools like  DRBD,  Corosync, Pacemaker, etc. to keep
 2
  masters
with 2 separate indices  in sync, while making sure you write to
 only  1
  of them
   via LB VIP or  otherwise?
   
   * Or ...
  
  
   This thread is  on a  similar topic, but is inconclusive:
  http://search-lucene.com/m/aOsyN15f1qd1
 

Re: True master-master fail-over without data gaps

2011-03-09 Thread Otis Gospodnetic
On disk, yes, but only indexed, and thus far enough from the original content 
to 
make storing terms in Lucene's inverted index.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org
 Sent: Wed, March 9, 2011 12:07:27 PM
 Subject: RE: True master-master fail-over without data gaps
 
 ...but the index resides on disk doesn't it???  lol
 
 -Original  Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
 Sent: Wednesday, March 09, 2011 9:06 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: True master-master fail-over without data  gaps
 
 Hi,
 
 
 
 - Original Message 
 
  I'd  honestly think about buffer the incoming documents in some store
 that's  
 actually made for fail-over persistence reliability, maybe CouchDB  or
 something.  
 And then that's taking care of not losing  anything, and the problem
 becomes how  
 we make sure that our  solr master indexes are kept in sync with the
 actual  
 persistent  store; which I'm still not sure about, but I'm thinking it's
 a  
 simpler problem. The right tool for the right job, that kind  of
 failover  
 persistence is not solr's specialty. 
 
 
 But check this!  In some cases one is not allowed to save  content to
 disk (think 
 copyrights).  I'm not making this up - we  actually have a customer with
 this 
 cannot save to disk (but can index)  requirement.
 
 So buffering to disk is not an option, and buffering in  memory is not
 practical 
 because of the input document rate and their  size.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene  ecosystem search :: http://search-lucene.com/
 
 
 
  From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
   Sent:  Tuesday, March 08, 2011 11:45 PM
  To: solr-user@lucene.apache.org
   Subject:  True master-master fail-over without data gaps
  
   Hello,
  
  What are  some common or good ways to handle  indexing (master)
 fail-over?
  Imagine you  have a continuous  stream of incoming documents that you
 have to
  index without   losing any of them (or with losing as few of them as
 possible).
  How  do you  set up you masters?
  In other words, you can't just have 2  masters where the  secondary is
 the
  Repeater (or Slave) of the  primary master and replicates the  index 
 periodically:
  you  need to have 2 masters that are in sync at all  times!
  How do you  achieve that?
  
  * Do you just put N masters behind a  LB  VIP, configure them both to
 point to 
 the
  index on some shared  storage  (e.g. SAN), and count on the LB to
 fail-over to 
 the
  secondary master when the  primary becomes  unreachable?
  If so, how do you deal with index locks?   You use the  Native lock and
 count 
 on
  it disappearing when the primary  master  goes down?  That means you
 count on 
 the
   whole JVM process dying, which  may not be the case...
  
  *  Or do you use tools like DRBD, Corosync,  Pacemaker, etc. to keep  2
 masters
  with 2 separate indices in sync, while  making sure  you write to only
 1 of 
 them
  via LB VIP or otherwise?
  
  * Or  ...
  
  
  This thread is on a similar  topic, but is inconclusive:
   http://search-lucene.com/m/aOsyN15f1qd1
  
  Here is another  similar  thread, but this one doesn't cover how 2
 masters are
   kept in sync at all  times:
   http://search-lucene.com/m/aOsyN15f1qd1
  
  Thanks,
   Otis
  
  Sematext  :: http://sematext.com/ ::  Solr - Lucene - Nutch
   Lucene ecosystem search :: http://search-lucene.com/
  
  
 


Re: True master-master fail-over without data gaps

2011-03-09 Thread Markus Jelsma
RAMdisk

 ...but the index resides on disk doesn't it???  lol
 
 -Original Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, March 09, 2011 9:06 AM
 To: solr-user@lucene.apache.org
 Subject: Re: True master-master fail-over without data gaps
 
 Hi,
 
 
 
 - Original Message 
 
  I'd honestly think about buffer the incoming documents in some store
 
 that's
 
 actually made for fail-over persistence reliability, maybe CouchDB or
 
 something.
 
 And then that's taking care of not losing anything, and the problem
 
 becomes how
 
 we make sure that our solr master indexes are kept in sync with the
 
 actual
 
 persistent store; which I'm still not sure about, but I'm thinking it's
 
 a
 
 simpler problem. The right tool for the right job, that kind of
 
 failover
 
 persistence is not solr's specialty.
 
 But check this!  In some cases one is not allowed to save content to
 disk (think
 copyrights).  I'm not making this up - we actually have a customer with
 this
 cannot save to disk (but can index) requirement.
 
 So buffering to disk is not an option, and buffering in memory is not
 practical
 because of the input document rate and their size.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
  From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
  Sent:  Tuesday, March 08, 2011 11:45 PM
  To: solr-user@lucene.apache.org
  Subject:  True master-master fail-over without data gaps
  
  Hello,
  
  What are  some common or good ways to handle indexing (master)
 
 fail-over?
 
  Imagine you  have a continuous stream of incoming documents that you
 
 have to
 
  index without  losing any of them (or with losing as few of them as
 
 possible).
 
  How do you  set up you masters?
  In other words, you can't just have 2 masters where the  secondary is
 
 the
 
  Repeater (or Slave) of the primary master and replicates the  index
 
 periodically:
  you need to have 2 masters that are in sync at all  times!
  How do you achieve that?
  
  * Do you just put N masters behind a  LB VIP, configure them both to
 
 point to
 
 the
 
  index on some shared storage  (e.g. SAN), and count on the LB to
 
 fail-over to
 
 the
 
  secondary master when the  primary becomes unreachable?
  If so, how do you deal with index locks?   You use the Native lock and
 
 count
 on
 
  it disappearing when the primary master  goes down?  That means you
 
 count on
 
 the
 
  whole JVM process dying, which  may not be the case...
  
  * Or do you use tools like DRBD, Corosync,  Pacemaker, etc. to keep 2
 
 masters
 
  with 2 separate indices in sync, while  making sure you write to only
 
 1 of
 them
 
  via LB VIP or otherwise?
  
  * Or  ...
  
  This thread is on a similar topic, but is inconclusive:
http://search-lucene.com/m/aOsyN15f1qd1
  
  Here is another similar  thread, but this one doesn't cover how 2
 
 masters are
 
  kept in sync at all  times:
http://search-lucene.com/m/aOsyN15f1qd1
  
  Thanks,
  Otis
  
  Sematext  :: http://sematext.com/ ::  Solr - Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/


Re: True master-master fail-over without data gaps

2011-03-09 Thread Jason Rutherglen
This is why there's block cipher cryptography.

On Wed, Mar 9, 2011 at 9:11 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 On disk, yes, but only indexed, and thus far enough from the original content 
 to
 make storing terms in Lucene's inverted index.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



Re: dataimport

2011-03-09 Thread Adam Estrada
Brian,

I had the same problem a while back and set the JAVA_OPTS env variable
to something my machine could handle. That may also be an option for
you going forward.

Adam

On Wed, Mar 9, 2011 at 9:33 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 This has since been fixed. The problem was that there was not enough memory
 on the machine. It works just fine now.

 On Tue, Mar 8, 2011 at 6:22 PM, Chris Hostetter 
 hossman_luc...@fucit.orgwrote:


 : INFO: Creating a connection for entity id with URL:
 :
 jdbc:mysql://localhost/researchsquare_beta_library?characterEncoding=UTF8zeroDateTimeBehavior=convertToNull
 : Feb 24, 2011 8:58:25 PM
 org.apache.solr.handler.dataimport.JdbcDataSource$1
 : call
 : INFO: Time taken for getConnection(): 137
 : Killed
 :
 : So it looks like for whatever reason, the server crashes trying to do a
 full
 : import. When I add a LIMIT clause on the query, it works fine when the
 LIMIT
 : is only 250 records but if I try to do 500 records, I get the same
 message.

 ...wow.  that's ... weird.

 I've never seen a java process just log Killed like that.

 The only time i've ever seen a process log Killed is if it was
 terminated by the os (ie: kill -9 pid)

 What OS are you using? how are you running solr? (ie: are you using the
 simple jetty example java -jar start.jar or are you using a differnet
 servlet container?) ... are you absolutely certain your machine doens't
 have some sort of monitoring in place that kills jobs if they take too
 long, or use too much CPU?


 -Hoss




RE: True master-master fail-over without data gaps

2011-03-09 Thread Robert Petersen
I guess you could put a LB between slaves and masters, never thought of
that!  :)

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Wednesday, March 09, 2011 9:10 AM
To: solr-user@lucene.apache.org
Subject: Re: True master-master fail-over without data gaps

Hi,



- Original Message 

 Currently I use an application connected to a queue containing
incoming
 data  which my indexer app turns into solr docs.  I log everything to
a
 log  table and have never had an issue with losing anything.  

Yeah, if everything goes through some storage that can be polled (either
a DB or 
a durable JMS Topic or some such), then N masters could connect to it,
not miss 
anything, and be more or less in near real-time sync.

 I can  trace
 incoming docs exactly, and keep timing data in there also. If I added
a
 second solr url for a second master and resent the same doc to
master02
 that I sent to master01, I would expect near 100%  synchronization.
The
 problem here is how to get the slave farm to start  replicating from
the
 second master if and when the first goes down.  I  can only see that
as
 being a manual operation, repointing the slaves to  master02 and
 restarting or reloading them etc...

Actually, you can configure a LB to handle that, so that's less of a
problem, I 
think.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


 -Original  Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
 Sent: Wednesday, March 09, 2011 8:52 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: True master-master fail-over without data gaps
 
 Hi,
 
 
 -  Original Message 
  From: Robert Petersen rober...@buy.com
  To: solr-user@lucene.apache.org
   Sent: Wed, March 9, 2011 11:40:56 AM
  Subject: RE: True master-master  fail-over without data gaps
  
  If you have a wrapper, like an  indexer app which prepares solr docs
 and
  sends  them into solr,  then it is simple.  The wrapper is your
'tee'
 and
  it can   send docs to both (or N) masters.
 
 Doesn't this make it too easy for 2  masters to get out of sync even
if
 the 
 problem is not with them?
 e.g.  something happens in this tee component and it indexes a doc
to
 master A, 
 but not master B.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
  -Original  Message-
  From:  Michael Sokolov [mailto:soko...@ifactory.com] 
   Sent:  Wednesday, March 09, 2011 4:14 AM
  To: solr-user@lucene.apache.org
   Cc:  Jonathan Rochkind
  Subject: Re: True master-master fail-over  without data  gaps
  
  Yes, I think this should be pushed  upstream - insert a tee in the

  document stream so that all documents  go to both masters.
  Then use a load  balancer to make requests of  the masters.
  
  The tee itself then becomes a  possible  single point of failure,
but
 
  you didn't say anything about the   architecture of the document
feed.
 Is
  
  that also   fault-tolerant?
  
  -Mike
  
  On 3/9/2011 1:06 AM,  Jonathan Rochkind  wrote:
   I'd honestly think about buffer the  incoming documents in some
 store
  that's actually made for fail-over  persistence reliability,  maybe
  CouchDB or something. And then  that's taking care of not  losing
  anything, and the problem becomes  how we make sure that our solr
 master
  indexes are kept in sync with  the actual persistent store; which
I'm
  still not sure about, but  I'm thinking it's a simpler problem. The
 right
  tool for the right  job, that kind of failover persistence is not
 solr's
   specialty.
   
 From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
 Sent: Tuesday, March 08, 2011 11:45 PM
   To: solr-user@lucene.apache.org
 Subject: True master-master fail-over without data gaps
   
Hello,
  
   What are some common or  good ways to handle indexing  (master)
  fail-over?
Imagine you have a continuous stream of incoming  documents that
you
  have to
   index without losing any of them (or with   losing as few of them
as
  possible).
   How do you set up  you  masters?
   In other words, you can't just have 2 masters  where the
secondary
 is
  the
   Repeater (or Slave) of  the primary master and  replicates the
index
  periodically:
you need to have 2 masters that  are in sync at all times!
How do you achieve that?
  
   * Do  you just put  N masters behind a LB VIP, configure them both
to
  point to   the
   index on some shared storage (e.g. SAN), and count on the  LB  to
  fail-over to the
   secondary master when the  primary becomes  unreachable?
   If so, how do you deal with  index locks?  You use the  Native
lock
 and
  count on
it disappearing when the primary master goes  down?  That means
you
  count on the
   whole JVM process dying,  which may  not be the case...
  
   * Or do you use tools like  DRBD, 

Re: True master-master fail-over without data gaps

2011-03-09 Thread Otis Gospodnetic
Right.  LB VIP on both sides of master(s).  Black box.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org
 Sent: Wed, March 9, 2011 12:16:31 PM
 Subject: RE: True master-master fail-over without data gaps
 
 I guess you could put a LB between slaves and masters, never thought  of
 that!  :)
 
 -Original Message-
 From: Otis Gospodnetic  [mailto:otis_gospodne...@yahoo.com] 
 Sent: Wednesday, March 09, 2011 9:10 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: True master-master fail-over without data  gaps
 
 Hi,
 
 
 
 - Original Message 
 
   Currently I use an application connected to a queue  containing
 incoming
  data  which my indexer app turns into solr  docs.  I log everything to
 a
  log  table and have never had  an issue with losing anything.  
 
 Yeah, if everything goes through  some storage that can be polled (either
 a DB or 
 a durable JMS Topic or  some such), then N masters could connect to it,
 not miss 
 anything, and be  more or less in near real-time sync.
 
  I can  trace
   incoming docs exactly, and keep timing data in there also. If I  added
 a
  second solr url for a second master and resent the same doc  to
 master02
  that I sent to master01, I would expect near 100%   synchronization.
 The
  problem here is how to get the slave farm to  start  replicating from
 the
  second master if and when the first  goes down.  I  can only see that
 as
  being a manual  operation, repointing the slaves to  master02 and
  restarting or  reloading them etc...
 
 Actually, you can configure a LB to handle that, so  that's less of a
 problem, I 
 think.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
  -Original   Message-
  From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
  Sent: Wednesday, March 09, 2011 8:52 AM
  To: solr-user@lucene.apache.org
   Subject:  Re: True master-master fail-over without data gaps
  
  Hi,
  
  
  -  Original Message 
From: Robert Petersen rober...@buy.com
   To: solr-user@lucene.apache.org
 Sent: Wed, March 9, 2011 11:40:56 AM
   Subject: RE: True  master-master  fail-over without data gaps
   
   If  you have a wrapper, like an  indexer app which prepares solr docs
   and
   sends  them into solr,  then it is simple.  The  wrapper is your
 'tee'
  and
   it can   send docs to  both (or N) masters.
  
  Doesn't this make it too easy for 2   masters to get out of sync even
 if
  the 
  problem is not with  them?
  e.g.  something happens in this tee component and it  indexes a doc
 to
  master A, 
  but not master B.
  
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
   Lucene ecosystem search :: http://search-lucene.com/
  
  
  
-Original  Message-
   From:  Michael Sokolov  [mailto:soko...@ifactory.com] 
 Sent:  Wednesday, March 09, 2011 4:14 AM
   To: solr-user@lucene.apache.org
 Cc:  Jonathan Rochkind
   Subject: Re: True  master-master fail-over  without data  gaps
   
Yes, I think this should be pushed  upstream - insert a tee in  the
 
   document stream so that all documents  go to both  masters.
   Then use a load  balancer to make requests of   the masters.
   
   The tee itself then becomes a   possible  single point of failure,
 but
  
   you didn't  say anything about the   architecture of the document
 feed.
   Is
   
   that also   fault-tolerant?
   
   -Mike
   
   On 3/9/2011 1:06 AM,   Jonathan Rochkind  wrote:
I'd honestly think about buffer  the  incoming documents in some
  store
   that's actually  made for fail-over  persistence reliability,  maybe
CouchDB or something. And then  that's taking care of not   losing
   anything, and the problem becomes  how we make sure  that our solr
  master
   indexes are kept in sync with   the actual persistent store; which
 I'm
   still not sure about,  but  I'm thinking it's a simpler problem. The
  right
tool for the right  job, that kind of failover persistence is not
   solr's
specialty.
 
  From:  Otis Gospodnetic [otis_gospodne...@yahoo.com]
   Sent: Tuesday, March 08, 2011 11:45 PM
 To: solr-user@lucene.apache.org
   Subject: True master-master fail-over without data  gaps

 Hello,

What are some common or  good ways to handle  indexing  (master)
   fail-over?
 Imagine  you have a continuous stream of incoming  documents that
 you
have to
index without losing any of them (or withlosing as few of them
 as
   possible).
How do you  set up  you  masters?
In other words, you can't just  have 2 masters  where the
 secondary
  is
   the
 Repeater (or Slave) of  the primary master and  replicates  the
 index
   periodically:
   

Newb query question

2011-03-09 Thread Daniel Baughman
Is there a way to perform string logic on the key field using a subquery or
some other method.

 

IE. If the left 4 characters of the key are ABCD, then include or exclude
those from the search.

 

Here is the laymans pseudo code for what I'm wanting to do:

 

*:* AND LEFT(KEY, 4)  'abcd'

 

Anyone know that one?



Re: Newb query question

2011-03-09 Thread Erick Erickson
How about something like:

for exclusion
+*:* -KEY:abcd*

for inclusion
+*:* +KEY:abcd*

Best
Erick

On Wed, Mar 9, 2011 at 12:34 PM, Daniel Baughman da...@hostworks.com wrote:
 Is there a way to perform string logic on the key field using a subquery or
 some other method.



 IE. If the left 4 characters of the key are ABCD, then include or exclude
 those from the search.



 Here is the laymans pseudo code for what I'm wanting to do:



 *:* AND LEFT(KEY, 4)  'abcd'



 Anyone know that one?




Re: True master-master fail-over without data gaps

2011-03-09 Thread Jonathan Rochkind

On 3/9/2011 12:05 PM, Otis Gospodnetic wrote:
But check this! In some cases one is not allowed to save content to 
disk (think

copyrights).  I'm not making this up - we actually have a customer with this
cannot save to disk (but can index) requirement.


Do they realize that a Solr index is on disk, and if you save it to a 
Solr index it's being saved to disk?  If they prohibited you from 
putting the doc in a stored field in Solr, I guess that would at least 
be somewhat consistent, although annoying.


But I don't think it's our customers jobs to tell us HOW to implement 
our software to get the results they want. They can certainly make you 
promise not to distribute or use copyrighted material, and they can even 
ask to see your security procedures to make sure it doesn't get out.  
But if you need to buffer documents to achieve the application they 
want, but they won't let you... Solr can't help you with that.


As I suggested before though, I might rather buffer to a NoSQL store 
like MongoDB or CouchDB instead of actually to disk. Perhaps your 
customer won't notice those stores keep data on disk just like they 
haven't noticed Solr does.  I am not an expert in various kinds of NoSQL 
stores, but I think some of them in fact specialize in the area of 
concern here: Absolute failover reliability through replication.


Solr is not a store.


So buffering to disk is not an option, and buffering in memory is not practical
because of the input document rate and their size.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
Sent:  Tuesday, March 08, 2011 11:45 PM
To: solr-user@lucene.apache.org
Subject:  True master-master fail-over without data gaps

Hello,

What are  some common or good ways to handle indexing (master) fail-over?
Imagine you  have a continuous stream of incoming documents that you have to
index without  losing any of them (or with losing as few of them as possible).
How do you  set up you masters?
In other words, you can't just have 2 masters where the  secondary is the
Repeater (or Slave) of the primary master and replicates the  index
periodically:
you need to have 2 masters that are in sync at all  times!
How do you achieve that?

* Do you just put N masters behind a  LB VIP, configure them both to point to
the
index on some shared storage  (e.g. SAN), and count on the LB to fail-over to
the
secondary master when the  primary becomes unreachable?
If so, how do you deal with index locks?   You use the Native lock and count

on

it disappearing when the primary master  goes down?  That means you count on
the
whole JVM process dying, which  may not be the case...

* Or do you use tools like DRBD, Corosync,  Pacemaker, etc. to keep 2 masters
with 2 separate indices in sync, while  making sure you write to only 1 of

them

via LB VIP or otherwise?

* Or  ...


This thread is on a similar topic, but is inconclusive:
   http://search-lucene.com/m/aOsyN15f1qd1

Here is another similar  thread, but this one doesn't cover how 2 masters are
kept in sync at all  times:
   http://search-lucene.com/m/aOsyN15f1qd1

Thanks,
Otis

Sematext  :: http://sematext.com/ ::  Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




Re: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Otis Gospodnetic
Hi,

 
- Original Message 
 From: Walter Underwood wun...@wunderwood.org

 On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:
 
  You mean it's  not possible to have 2 masters that are in nearly real-time 
sync?
  How  about with DRBD?  I know people use DRBD to keep 2 Hadoop NNs (their 
edit 

  logs) in sync to avoid the current NN SPOF, for example, so I'm  thinking 
this 

  could be doable with Solr masters, too, no?
 
 If you add fault-tolerant, you run into the CAP  Theorem. Consistency, 
availability, partition: choose two. You cannot have it  all.

Right, so I'll take Consistency and Availability, and I'll put my 2 masters in 
the same rack (which has redundant switches, power supply, etc.) and thus 
minimize/avoid partitioning.
Assuming the above actually works, I think my Q remains:

How do you set up 2 Solr masters so they are in near real-time sync?  DRBD?

But here is maybe a simpler scenario that more people may be considering:

Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index 
on the shared storage (SAN) that also happens to live in the same rack.
2 Solr masters are behind 1 LB VIP that indexer talks to.
The VIP is configured so that all requests always get routed to the primary 
master (because only 1 master can be modifying an index at a time), except when 
this primary is down, in which case the requests are sent to the secondary 
master.

So in this case my Q is around automation of this, around Lucene index locks, 
around the need for manual intervention, and such.
Concretely, if you have these 2 master instances, the primary master has the 
Lucene index lock in the index dir.  When the secondary master needs to take 
over (i.e., when it starts receiving documents via LB), it needs to be able to 
write to that same index.  But what if that lock is still around?  One could 
use 
the Native lock to make the lock disappear if the primary master's JVM exited 
unexpectedly, and in that case everything *should* work and be completely 
transparent, right?  That is, the secondary will start getting new docs, it 
will 
use its IndexWriter to write to that same shared index, which won't be locked 
for writes because the lock is gone, and everyone will be happy.  Did I miss 
something important here?

Assuming the above is correct, what if the lock is *not* gone because the 
primary master's JVM is actually not dead, although maybe unresponsive, so LB 
thinks the primary master is dead.  Then the LB will route indexing requests to 
the secondary master, which will attempt to write to the index, but be denied 
because of the lock.  So a human needs to jump in, remove the lock, and 
manually 
reindex failed docs if the upstream component doesn't buffer docs that failed 
to 
get indexed and doesn't retry indexing them automatically.  Is this correct or 
is there a way to avoid humans here?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


Re: Newb query question

2011-03-09 Thread Otis Gospodnetic
Hi,

It sounds like if you put those 4 chars in a separate field at index time you 
could apply your logic on that at search time.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Daniel Baughman da...@hostworks.com
 To: solr-user@lucene.apache.org
 Sent: Wed, March 9, 2011 12:34:54 PM
 Subject: Newb query question
 
 Is there a way to perform string logic on the key field using a subquery  or
 some other method.
 
 
 
 IE. If the left 4 characters of the key  are ABCD, then include or exclude
 those from the search.
 
 
 
 Here  is the laymans pseudo code for what I'm wanting to do:
 
 
 
 *:* AND  LEFT(KEY, 4)  'abcd'
 
 
 
 Anyone know that one?
 
 


RE: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Robert Petersen
Can't you skip the SAN and keep the indexes locally?  Then you would
have two redundant copies of the index and no lock issues.  

Also, Can't master02 just be a slave to master01 (in the master farm and
separate from the slave farm) until such time as master01 fails?  Then
master02 would start receiving the new documents with an indexes
complete up to the last replication at least and the other slaves would
be directed by LB to poll master02 also...

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Wednesday, March 09, 2011 9:47 AM
To: solr-user@lucene.apache.org
Subject: Re: True master-master fail-over without data gaps (choosing CA
in CAP)

Hi,

 
- Original Message 
 From: Walter Underwood wun...@wunderwood.org

 On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:
 
  You mean it's  not possible to have 2 masters that are in nearly
real-time 
sync?
  How  about with DRBD?  I know people use DRBD to keep 2 Hadoop NNs
(their 
edit 

  logs) in sync to avoid the current NN SPOF, for example, so I'm
thinking 
this 

  could be doable with Solr masters, too, no?
 
 If you add fault-tolerant, you run into the CAP  Theorem. Consistency,

availability, partition: choose two. You cannot have it  all.

Right, so I'll take Consistency and Availability, and I'll put my 2
masters in 
the same rack (which has redundant switches, power supply, etc.) and
thus 
minimize/avoid partitioning.
Assuming the above actually works, I think my Q remains:

How do you set up 2 Solr masters so they are in near real-time sync?
DRBD?

But here is maybe a simpler scenario that more people may be
considering:

Imagine 2 masters on 2 different servers in 1 rack, pointing to the same
index 
on the shared storage (SAN) that also happens to live in the same rack.
2 Solr masters are behind 1 LB VIP that indexer talks to.
The VIP is configured so that all requests always get routed to the
primary 
master (because only 1 master can be modifying an index at a time),
except when 
this primary is down, in which case the requests are sent to the
secondary 
master.

So in this case my Q is around automation of this, around Lucene index
locks, 
around the need for manual intervention, and such.
Concretely, if you have these 2 master instances, the primary master has
the 
Lucene index lock in the index dir.  When the secondary master needs to
take 
over (i.e., when it starts receiving documents via LB), it needs to be
able to 
write to that same index.  But what if that lock is still around?  One
could use 
the Native lock to make the lock disappear if the primary master's JVM
exited 
unexpectedly, and in that case everything *should* work and be
completely 
transparent, right?  That is, the secondary will start getting new docs,
it will 
use its IndexWriter to write to that same shared index, which won't be
locked 
for writes because the lock is gone, and everyone will be happy.  Did I
miss 
something important here?

Assuming the above is correct, what if the lock is *not* gone because
the 
primary master's JVM is actually not dead, although maybe unresponsive,
so LB 
thinks the primary master is dead.  Then the LB will route indexing
requests to 
the secondary master, which will attempt to write to the index, but be
denied 
because of the lock.  So a human needs to jump in, remove the lock, and
manually 
reindex failed docs if the upstream component doesn't buffer docs that
failed to 
get indexed and doesn't retry indexing them automatically.  Is this
correct or 
is there a way to avoid humans here?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


Re: Solr Hanging all of sudden with update/csv

2011-03-09 Thread danomano
After About 4-5 hours the merge completed (ran out of heap)..as you
suggested..it was having memory issues..

Read queries during the merge were working just fine (they were taking
longer then normal ~30-60seconds).

I think I need to do more reading on understanding the merge/optimization
processes.

I am beginning to think what I need to do is have lots of segments? (i.e.
frequent merges..of smaller sized segments, wouldn't that speed up the
merging process when it actually runs?).

A couple things I'm trying to wrap my ahead around:

Increasing the segments will improve indexing speed on the whole.
The question I have is: when it needs to actually perform a merge: will
having more segments be better  (i.e. make the merge process faster)? or
longer? ..having a 4 hour merge aka (indexing request) is not really
acceptable (unless I can control when that merge happens).

We are using our Solr server differently then most: Frequent Inserts (in
batches), with few Reads.

I would say having a 'long' query time is acceptable (say ~60 seconds).





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Hanging-all-of-sudden-with-update-csv-tp2652903p2656457.html
Sent from the Solr - User mailing list archive at Nabble.com.


how would you design schema?

2011-03-09 Thread dan whelan

Hi,

I'm investigating how to set up a schema like this:

I want to index accounts and the products purchased (multiValued) by 
that account but I also need the ability to search by the date the 
product was purchased.


It would be easy if the purchase date wasn't part of the requirements.

How would the schema be designed? Is there a better approach?

Thanks,

Dan



Re: how would you design schema?

2011-03-09 Thread Geert-Jan Brits
Would having a solr-document represent a 'product purchase per account'
solve your problem?
You could then easily link the date of purchase to the document as well as
the account-number.

e.g:
fields: orderid (key), productid, product-characteristics,
order-characteristics (including date of purchase).

or in case of option of multiple products having a joined orderid:
fields: cat(orderid,productid) (key), orderid, productid,
product-characteristics, order-characteristics (including date of
purchase).

The difference to your setup (i.e: one document per account) is that the
suggested setup above may return multiple documents when you search by
account-nr, which may or may not be what you're after.

hth,
Geert-Jan

2011/3/9 dan whelan d...@adicio.com

 Hi,

 I'm investigating how to set up a schema like this:

 I want to index accounts and the products purchased (multiValued) by that
 account but I also need the ability to search by the date the product was
 purchased.

 It would be easy if the purchase date wasn't part of the requirements.

 How would the schema be designed? Is there a better approach?

 Thanks,

 Dan




Sorting

2011-03-09 Thread Brian Lamb
Hi all,

I know that I can add sort=score desc to the url to sort in descending
order. However, I would like to sort a MoreLikeThis response which returns
records like this:

lst name=moreLikeThis
  result name=3 numFound=113611 start=0 maxScore=0.4392774
  result name=2 numFound= start=0 maxScore=0.5392774
/lst

I don't want them grouped by result; I would just like have them all thrown
together and then sorted according to score. I have an XSLT which does put
them altogether and returns the following:

moreLikeThis
  similar
scorex./score
idsome_id/id
  /similar
/moreLikeThis

However it appears that it basically applies the stylesheet to result
name=3 then result name=2.

How can I make it so that with my XSLT, the results appear sorted by
score?


Re: docBoost

2011-03-09 Thread Brian Lamb
Anyone have any clue on this on?

On Tue, Mar 8, 2011 at 2:11 PM, Brian Lamb brian.l...@journalexperts.comwrote:

 Hi all,

 I am using dataimport to create my index and I want to use docBoost to
 assign some higher weights to certain docs. I understand the concept behind
 docBoost but I haven't been able to find an example anywhere that shows how
 to implement it. Assuming the following config file:

 document
entity name=animal
   dataSource=animals
   pk=id
   query=SELECT * FROM animals
 field column=id name=id /
 field column=genus name=genus /
 field column=species name=species /
 entity name=boosters
dataSource=boosts
query=SELECT boost_score FROM boosts WHERE animal_id = ${
 animal.id}
   field column=boost_score name=boost_score /
 /entity
   /entity
 /document

 How do I add in a docBoost score? The boost score is currently in a
 separate table as shown above.



Re: docBoost

2011-03-09 Thread Jayendra Patil
you can use the ScriptTransformer to perform the boost calcualtion and addition.
http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer

dataConfig
script![CDATA[
function f1(row)  {
// Add boost
row.put('$docBoost',1.5);
return row;
}
]]/script
document
entity name=e pk=id transformer=script:f1
query=select * from X

/entity
/document
/dataConfig

Regards,
Jayendra


On Wed, Mar 9, 2011 at 2:01 PM, Brian Lamb
brian.l...@journalexperts.com wrote:
 Anyone have any clue on this on?

 On Tue, Mar 8, 2011 at 2:11 PM, Brian Lamb 
 brian.l...@journalexperts.comwrote:

 Hi all,

 I am using dataimport to create my index and I want to use docBoost to
 assign some higher weights to certain docs. I understand the concept behind
 docBoost but I haven't been able to find an example anywhere that shows how
 to implement it. Assuming the following config file:

 document
    entity name=animal
               dataSource=animals
               pk=id
               query=SELECT * FROM animals
     field column=id name=id /
     field column=genus name=genus /
     field column=species name=species /
     entity name=boosters
                dataSource=boosts
                query=SELECT boost_score FROM boosts WHERE animal_id = ${
 animal.id}
       field column=boost_score name=boost_score /
     /entity
   /entity
 /document

 How do I add in a docBoost score? The boost score is currently in a
 separate table as shown above.




Re: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Otis Gospodnetic
Hi,

 Original Message 

 From: Robert Petersen rober...@buy.com

 Can't you skip the SAN and keep the indexes locally?  Then you  would
 have two redundant copies of the index and no lock issues.  

I could, but then I'd have the issue of keeping them in sync, which seems more 
fragile.  I think SAN makes things simpler overall.
 
 Also, Can't master02 just be a slave to master01 (in the master farm  and
 separate from the slave farm) until such time as master01 fails?   Then

No, because it wouldn't be in sync.  It would always be N minutes behind, and 
when the primary master fails, the secondary would not have all the docs - data 
loss.

 master02 would start receiving the new documents with an  indexes
 complete up to the last replication at least and the other slaves  would
 be directed by LB to poll master02 also...

Yeah, complete up to the last replication is the problem.  It's a data gap 
that now needs to be filled somehow.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


 -Original  Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
 Sent: Wednesday, March 09, 2011 9:47 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: True master-master fail-over without data gaps (choosing CA
 in  CAP)
 
 Hi,
 
 
 - Original Message 
  From: Walter  Underwood wun...@wunderwood.org
 
  On  Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:
  
   You mean  it's  not possible to have 2 masters that are in nearly
 real-time 
 sync?
   How  about with DRBD?  I know people use  DRBD to keep 2 Hadoop NNs
 (their 
 edit 
 
   logs) in  sync to avoid the current NN SPOF, for example, so I'm
 thinking 
 this 
 
   could be doable with Solr masters, too, no?
  
  If you add fault-tolerant, you run into the CAP  Theorem.  Consistency,
 
 availability, partition: choose two. You cannot have  it  all.
 
 Right, so I'll take Consistency and Availability, and I'll  put my 2
 masters in 
 the same rack (which has redundant switches, power  supply, etc.) and
 thus 
 minimize/avoid partitioning.
 Assuming the above  actually works, I think my Q remains:
 
 How do you set up 2 Solr masters so  they are in near real-time sync?
 DRBD?
 
 But here is maybe a simpler  scenario that more people may be
 considering:
 
 Imagine 2 masters on 2  different servers in 1 rack, pointing to the same
 index 
 on the shared  storage (SAN) that also happens to live in the same rack.
 2 Solr masters are  behind 1 LB VIP that indexer talks to.
 The VIP is configured so that all  requests always get routed to the
 primary 
 master (because only 1 master  can be modifying an index at a time),
 except when 
 this primary is down,  in which case the requests are sent to the
 secondary 
 master.
 
 So in  this case my Q is around automation of this, around Lucene index
 locks, 
 around the need for manual intervention, and such.
 Concretely, if you  have these 2 master instances, the primary master has
 the 
 Lucene index  lock in the index dir.  When the secondary master needs to
 take 
 over  (i.e., when it starts receiving documents via LB), it needs to be
 able to 
 write to that same index.  But what if that lock is still around?   One
 could use 
 the Native lock to make the lock disappear if the primary  master's JVM
 exited 
 unexpectedly, and in that case everything *should*  work and be
 completely 
 transparent, right?  That is, the secondary  will start getting new docs,
 it will 
 use its IndexWriter to write to that  same shared index, which won't be
 locked 
 for writes because the lock is  gone, and everyone will be happy.  Did I
 miss 
 something important  here?
 
 Assuming the above is correct, what if the lock is *not* gone  because
 the 
 primary master's JVM is actually not dead, although maybe  unresponsive,
 so LB 
 thinks the primary master is dead.  Then the LB  will route indexing
 requests to 
 the secondary master, which will attempt  to write to the index, but be
 denied 
 because of the lock.  So a  human needs to jump in, remove the lock, and
 manually 
 reindex failed docs  if the upstream component doesn't buffer docs that
 failed to 
 get indexed  and doesn't retry indexing them automatically.  Is this
 correct or 
 is there a way to avoid humans  here?
 
 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 


Re: Solr Hanging all of sudden with update/csv

2011-03-09 Thread Otis Gospodnetic
Hi,

You'll benefit from watching this segment merging video:
  http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

And you'll appreciate the graph at the bottom:
  http://code.google.com/p/zoie/wiki/ZoieMergePolicy

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: danomano dshopk...@earthlink.net
 To: solr-user@lucene.apache.org
 Sent: Wed, March 9, 2011 1:17:08 PM
 Subject: Re: Solr Hanging all of sudden with update/csv
 
 After About 4-5 hours the merge completed (ran out of heap)..as  you
 suggested..it was having memory issues..
 
 Read queries during the  merge were working just fine (they were taking
 longer then normal  ~30-60seconds).
 
 I think I need to do more reading on understanding the  merge/optimization
 processes.
 
 I am beginning to think what I need to  do is have lots of segments? (i.e.
 frequent merges..of smaller sized  segments, wouldn't that speed up the
 merging process when it actually  runs?).
 
 A couple things I'm trying to wrap my ahead  around:
 
 Increasing the segments will improve indexing speed on the  whole.
 The question I have is: when it needs to actually perform a merge:  will
 having more segments be better  (i.e. make the merge process  faster)? or
 longer? ..having a 4 hour merge aka (indexing request) is not  really
 acceptable (unless I can control when that merge happens).
 
 We  are using our Solr server differently then most: Frequent Inserts  (in
 batches), with few Reads.
 
 I would say having a 'long' query time  is acceptable (say ~60 seconds).
 
 
 
 
 
 --
 View this message  in context: 
http://lucene.472066.n3.nabble.com/Solr-Hanging-all-of-sudden-with-update-csv-tp2652903p2656457.html

 Sent  from the Solr - User mailing list archive at Nabble.com.
 


Re: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Jake Luciani
Hi Otis,

Have you considered using Solandra with Quorum writes
to achieve master/master with CA semantics?

-Jake


On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Hi,

  Original Message 

  From: Robert Petersen rober...@buy.com
 
  Can't you skip the SAN and keep the indexes locally?  Then you  would
  have two redundant copies of the index and no lock issues.

 I could, but then I'd have the issue of keeping them in sync, which seems
 more
 fragile.  I think SAN makes things simpler overall.

  Also, Can't master02 just be a slave to master01 (in the master farm  and
  separate from the slave farm) until such time as master01 fails?   Then

 No, because it wouldn't be in sync.  It would always be N minutes behind,
 and
 when the primary master fails, the secondary would not have all the docs -
 data
 loss.

  master02 would start receiving the new documents with an  indexes
  complete up to the last replication at least and the other slaves  would
  be directed by LB to poll master02 also...

 Yeah, complete up to the last replication is the problem.  It's a data
 gap
 that now needs to be filled somehow.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/


  -Original  Message-
  From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
  Sent: Wednesday, March 09, 2011 9:47 AM
  To: solr-user@lucene.apache.org
  Subject:  Re: True master-master fail-over without data gaps (choosing CA
  in  CAP)
 
  Hi,
 
 
  - Original Message 
   From: Walter  Underwood wun...@wunderwood.org
 
   On  Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:
  
You mean  it's  not possible to have 2 masters that are in nearly
  real-time
  sync?
How  about with DRBD?  I know people use  DRBD to keep 2 Hadoop NNs
  (their
  edit
  
logs) in  sync to avoid the current NN SPOF, for example, so I'm
  thinking
  this
  
could be doable with Solr masters, too, no?
  
   If you add fault-tolerant, you run into the CAP  Theorem.  Consistency,
 
  availability, partition: choose two. You cannot have  it  all.
 
  Right, so I'll take Consistency and Availability, and I'll  put my 2
  masters in
  the same rack (which has redundant switches, power  supply, etc.) and
  thus
  minimize/avoid partitioning.
  Assuming the above  actually works, I think my Q remains:
 
  How do you set up 2 Solr masters so  they are in near real-time sync?
  DRBD?
 
  But here is maybe a simpler  scenario that more people may be
  considering:
 
  Imagine 2 masters on 2  different servers in 1 rack, pointing to the same
  index
  on the shared  storage (SAN) that also happens to live in the same rack.
  2 Solr masters are  behind 1 LB VIP that indexer talks to.
  The VIP is configured so that all  requests always get routed to the
  primary
  master (because only 1 master  can be modifying an index at a time),
  except when
  this primary is down,  in which case the requests are sent to the
  secondary
  master.
 
  So in  this case my Q is around automation of this, around Lucene index
  locks,
  around the need for manual intervention, and such.
  Concretely, if you  have these 2 master instances, the primary master has
  the
  Lucene index  lock in the index dir.  When the secondary master needs to
  take
  over  (i.e., when it starts receiving documents via LB), it needs to be
  able to
  write to that same index.  But what if that lock is still around?   One
  could use
  the Native lock to make the lock disappear if the primary  master's JVM
  exited
  unexpectedly, and in that case everything *should*  work and be
  completely
  transparent, right?  That is, the secondary  will start getting new docs,
  it will
  use its IndexWriter to write to that  same shared index, which won't be
  locked
  for writes because the lock is  gone, and everyone will be happy.  Did I
  miss
  something important  here?
 
  Assuming the above is correct, what if the lock is *not* gone  because
  the
  primary master's JVM is actually not dead, although maybe  unresponsive,
  so LB
  thinks the primary master is dead.  Then the LB  will route indexing
  requests to
  the secondary master, which will attempt  to write to the index, but be
  denied
  because of the lock.  So a  human needs to jump in, remove the lock, and
  manually
  reindex failed docs  if the upstream component doesn't buffer docs that
  failed to
  get indexed  and doesn't retry indexing them automatically.  Is this
  correct or
  is there a way to avoid humans  here?
 
  Thanks,
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 




-- 
http://twitter.com/tjake


Re: Solr Hanging all of sudden with update/csv

2011-03-09 Thread Jason Rutherglen
You will need to cap the maximum segment size using
LogByteSizeMergePolicy.setMaxMergeMB.  As then you will only have
segments that are of an optimal size, and Lucene will not try to
create gigantic segments.  I think though on the query side you will
run out of heap space due to the terms index size.  What version are
you using?

On Wed, Mar 9, 2011 at 10:17 AM, danomano dshopk...@earthlink.net wrote:
 After About 4-5 hours the merge completed (ran out of heap)..as you
 suggested..it was having memory issues..

 Read queries during the merge were working just fine (they were taking
 longer then normal ~30-60seconds).

 I think I need to do more reading on understanding the merge/optimization
 processes.

 I am beginning to think what I need to do is have lots of segments? (i.e.
 frequent merges..of smaller sized segments, wouldn't that speed up the
 merging process when it actually runs?).

 A couple things I'm trying to wrap my ahead around:

 Increasing the segments will improve indexing speed on the whole.
 The question I have is: when it needs to actually perform a merge: will
 having more segments be better  (i.e. make the merge process faster)? or
 longer? ..having a 4 hour merge aka (indexing request) is not really
 acceptable (unless I can control when that merge happens).

 We are using our Solr server differently then most: Frequent Inserts (in
 batches), with few Reads.

 I would say having a 'long' query time is acceptable (say ~60 seconds).





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Hanging-all-of-sudden-with-update-csv-tp2652903p2656457.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: docBoost

2011-03-09 Thread Brian Lamb
That makes sense. As a follow up, is there a way to only conditionally use
the boost score? For example, in some cases I want to use the boost score
and in other cases I want all documents to be treated equally.

On Wed, Mar 9, 2011 at 2:42 PM, Jayendra Patil jayendra.patil@gmail.com
 wrote:

 you can use the ScriptTransformer to perform the boost calcualtion and
 addition.
 http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer

 dataConfig
script![CDATA[
function f1(row)  {
// Add boost
row.put('$docBoost',1.5);
return row;
}
]]/script
document
entity name=e pk=id transformer=script:f1
 query=select * from X

/entity
/document
 /dataConfig

 Regards,
 Jayendra


 On Wed, Mar 9, 2011 at 2:01 PM, Brian Lamb
 brian.l...@journalexperts.com wrote:
  Anyone have any clue on this on?
 
  On Tue, Mar 8, 2011 at 2:11 PM, Brian Lamb 
 brian.l...@journalexperts.comwrote:
 
  Hi all,
 
  I am using dataimport to create my index and I want to use docBoost to
  assign some higher weights to certain docs. I understand the concept
 behind
  docBoost but I haven't been able to find an example anywhere that shows
 how
  to implement it. Assuming the following config file:
 
  document
 entity name=animal
dataSource=animals
pk=id
query=SELECT * FROM animals
  field column=id name=id /
  field column=genus name=genus /
  field column=species name=species /
  entity name=boosters
 dataSource=boosts
 query=SELECT boost_score FROM boosts WHERE animal_id =
 ${
  animal.id}
field column=boost_score name=boost_score /
  /entity
/entity
  /document
 
  How do I add in a docBoost score? The boost score is currently in a
  separate table as shown above.
 
 



Excluding results from more like this

2011-03-09 Thread Brian Lamb
Hi all,

I'm using MoreLikeThis to find similar results but I'd like to exclude
records by the id number. For example, I use the following URL:

http://localhost:8983/solr/search/?q=id:(2 3
5)mlt=truemlt.fl=description,idfl=*,score

How would I exclude record 4 form the MoreLikeThis results?

I tried,

http://localhost:8983/solr/search/?q=id:(2 3
5)mlt=truemlt.fl=description,idfl=*,scoremlt.q=!4

But that still returned record 4 in the MoreLikeThisResults.


Fwd: some relational-type groupig with search

2011-03-09 Thread l . blevins

- Forwarded Message - 
From: l blevins l.blev...@comcast.net 
To: solr user mail solr-user-h...@lucene.apache.org 
Sent: Wednesday, March 9, 2011 4:03:06 PM 
Subject: some relational-type groupig with search 




I have a large database for which we have some good search capabilties now, but 
am interested to see if SOLR might be usable instead.  That would gain us the 
additional text-search features and eliminate the high fees for some of the 
database features. 



If I have fields such as person_id, document_date, and measurement_value. 
 I need to be able to fullfil the following types of searches that I cannot 
figure out how to do now: 



   * limit search to only the most recent (or earliest) document per person 
along with whatever other criteria is present (each person's LAST or FIRST 
document), 

   * search and only return the most recent document per person (LASTor FIRST 
meeting the other criteria), 

   * limit search to only the documents with the max or min measurement_value 
per person, 
   * search and return only the max or min measurement_value per person 



All of these boil down to limiting by the max or min of either a date or 
numeric field within a group (by person in this case).  I know these features 
are considered relational and that SOLR has declared that it is not really a 
relational search engine, but a number of highly placed persons that I work for 
are very interested in using SOLR.  If we could satisfy this type of query, 
SOLR could fit our needs so I feel compelled to ask this group if these 
searches are possible.

Re: Excluding results from more like this

2011-03-09 Thread Otis Gospodnetic
Brian,

...?q=id:(2  3 5) -4


Otis
---
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Brian Lamb brian.l...@journalexperts.com
 To: solr-user@lucene.apache.org
 Sent: Wed, March 9, 2011 4:05:10 PM
 Subject: Excluding results from more like this
 
 Hi all,
 
 I'm using MoreLikeThis to find similar results but I'd like to  exclude
 records by the id number. For example, I use the following  URL:
 
 http://localhost:8983/solr/search/?q=id:(2  3
 5)mlt=truemlt.fl=description,idfl=*,score
 
 How would I  exclude record 4 form the MoreLikeThis results?
 
 I tried,
 
 http://localhost:8983/solr/search/?q=id:(2  3
 5)mlt=truemlt.fl=description,idfl=*,scoremlt.q=!4
 
 But  that still returned record 4 in the MoreLikeThisResults.
 


Same index is ranking differently on 2 machines

2011-03-09 Thread Allistair Crossley
Hi,

I am seeing an issue I do not understand and hope that someone can shed some 
light on this. The issue is that for a particular search we are seeing a 
particular result rank in position 3 on one machine and position 8 on the 
production machine. The position 3 is our desired and roughly expected ranking.

I have a local machine with solr and a version deployed on a production server. 
My local machine's solr and the production version are both checked out from 
our project's SVN trunk. They are identical files except for the data files 
(not in SVN) and database connection settings.

The index is populated exclusively via data import handler queries to a 
database. 

I have exported the production database as-is to my local development machine 
so that my local machine and production have access to the self same data.

I execute a total full-import on both.

Still, I see a different position for this document that should surely rank in 
the same location, all else being equal.

I ran debugQuery diff to see how the scores were being computed. See appendix 
at foot of this email.

As far as I can tell every single query normalisation block of the debug is 
marginally different, e.g.

-0.021368012 = queryNorm (local)
+0.009944122 = queryNorm (production)

Which leads to a final score of -2 versus +1 which is enough to skew the 
results from correct to incorrect (in terms of what we expect to see).

- -2.286596 (local)
+1.0651637 = (production)

I cannot explain this difference. The database is the same. The configuration 
is the same. I have fully imported from scratch on both servers. What am I 
missing?

Thank you for your time

Allistair

- snip

APPENDIX - debugQuery=on DIFF 

--- untitled
+++ (clipboard)
@@ -1,51 +1,49 @@
-str name=L12411p
+str name=L12411
 
-2.286596 = (MATCH) sum of:
-  1.6891675 = (MATCH) sum of:
-1.3198489 = (MATCH) max plus 0.01 times others of:
-  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
-0.011795795 = queryWeight(text:dubai^0.1), product of:
-  0.1 = boost
+1.0651637 = (MATCH) sum of:
+  0.7871359 = (MATCH) sum of:
+0.6151879 = (MATCH) max plus 0.01 times others of:
+  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
+0.05489459 = queryWeight(text:dubai), product of:
   5.520305 = idf(docFreq=65, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
   1.4142135 = tf(termFreq(text:dubai)=2)
   5.520305 = idf(docFreq=65, maxDocs=6063)
   0.25 = fieldNorm(field=text, doc=1551)
-  1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
-0.32609802 = queryWeight(profile:dubai^2.0), product of:
+  0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
+0.15175761 = queryWeight(profile:dubai^2.0), product of:
   2.0 = boost
   7.6305184 = idf(docFreq=7, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of:
   1.4142135 = tf(termFreq(profile:dubai)=2)
   7.6305184 = idf(docFreq=7, maxDocs=6063)
   0.375 = fieldNorm(field=profile, doc=1551)
-0.36931866 = (MATCH) max plus 0.01 times others of:
-  0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of:
-0.003954251 = queryWeight(text:product^0.1), product of:
-  0.1 = boost
+0.17194802 = (MATCH) max plus 0.01 times others of:
+  0.00851347 = (MATCH) weight(text:product in 1551), product of:
+0.018402064 = queryWeight(text:product), product of:
   1.8505468 = idf(docFreq=2589, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 0.4626367 = (MATCH) fieldWeight(text:product in 1551), product of:
   1.0 = tf(termFreq(text:product)=1)
   1.8505468 = idf(docFreq=2589, maxDocs=6063)
   0.25 = fieldNorm(field=text, doc=1551)
-  0.36930037 = (MATCH) weight(profile:product^2.0 in 1551), product of:
-0.1725098 = queryWeight(profile:product^2.0), product of:
+  0.17186289 = (MATCH) weight(profile:product^2.0 in 1551), product of:
+0.08028162 = queryWeight(profile:product^2.0), product of:
   2.0 = boost
   4.036637 = idf(docFreq=290, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 2.14075 = (MATCH) fieldWeight(profile:product in 1551), product of:
   1.4142135 = tf(termFreq(profile:product)=2)
   4.036637 = idf(docFreq=290, maxDocs=6063)
   0.375 = fieldNorm(field=profile, doc=1551)
-  0.59742856 = (MATCH) max plus 0.01 times others of:
-0.59742856 = weight(profile:dubai product~10^0.5 in 1551), product of:
-  0.12465195 = queryWeight(profile:dubai product~10^0.5), product of:
+  

FunctionQueries and FieldCache and OOM

2011-03-09 Thread Markus Jelsma
Hi,

In one of the environments i'm working on (4 Solr 1.4.1. nodes with 
replication, 3+ million docs, ~5.5GB index size, high commit rate (~1-2min), 
high query rate (~50q/s), high number of updates (~1000docs/commit)) the nodes 
continuously run out of memory.

During development we frequently ran excessive stress tests and after tuning 
JVM and Solr settings all ran fine. A while ago i added the DisMax bq parameter 
for boosting recent documents, documents older than a day receive 50% less 
boost, similar to the example but with a much steeper slope. For clarity, i'm 
not using the ordinal function but the reciprocal version in the bq parameter 
which is warned against when using Solr 1.4.1 according to the wiki.

This week we started the stress tests and nodes are going down again. I've 
reconfigured the nodes to have different settings for the bq parameter (or no 
bq 
parameter).

It seems the bq the cause of the misery.

Issue SOLR- keeps popping up but it has not been resolved. Is there anyone 
who can confirm one of those patches fixes this issue before i waste hours of 
work finding out it doesn't? ;)

Am i correct when i assume that Lucene FieldCache entries are added for each 
unique function query?  In that case, every query is a unique cache entry 
because it operates on milliseconds. If all doesn't work i might be able to 
reduce precision by operating on minutes or even more instead of milli 
seconds. I, however, cannot use other nice math function in the ms() parameter 
so that might make things difficult.

However, date math seems available (NOW/HOUR) so i assume it would also work 
for SOME_DATE_FIELD/HOUR as well. This way i just might prevent useless 
entries.

My apologies for this long mail but it may prove useful for other users and 
hopefully we find the solution and can update the wiki to add this warning.

Cheers,


Re: Excluding results from more like this

2011-03-09 Thread Brian Lamb
That doesn't seem to do it. Record 4 is still showing up in the MoreLikeThis
results.

On Wed, Mar 9, 2011 at 4:12 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Brian,

 ...?q=id:(2  3 5) -4


 Otis
 ---
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Brian Lamb brian.l...@journalexperts.com
  To: solr-user@lucene.apache.org
  Sent: Wed, March 9, 2011 4:05:10 PM
  Subject: Excluding results from more like this
 
  Hi all,
 
  I'm using MoreLikeThis to find similar results but I'd like to  exclude
  records by the id number. For example, I use the following  URL:
 
  http://localhost:8983/solr/search/?q=id:(2  3
  5)mlt=truemlt.fl=description,idfl=*,score
 
  How would I  exclude record 4 form the MoreLikeThis results?
 
  I tried,
 
  http://localhost:8983/solr/search/?q=id:(2  3
  5)mlt=truemlt.fl=description,idfl=*,scoremlt.q=!4
 
  But  that still returned record 4 in the MoreLikeThisResults.
 



Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Jayendra Patil
queryNorm is just a normalizing factor and is the same value across
all the results for a query, to just make the scores comparable.
So even if it varies in different environment, you should not worried about.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
-
Defination - queryNorm(q) is just a normalizing factor used to make
scores between queries comparable. This factor does not affect
document ranking (since all ranked documents are multiplied by the
same factor), but rather just attempts to make scores from different
queries (or even different indexes) comparable

Regards,
Jayendra

On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Hi,

 I am seeing an issue I do not understand and hope that someone can shed some 
 light on this. The issue is that for a particular search we are seeing a 
 particular result rank in position 3 on one machine and position 8 on the 
 production machine. The position 3 is our desired and roughly expected 
 ranking.

 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both checked 
 out from our project's SVN trunk. They are identical files except for the 
 data files (not in SVN) and database connection settings.

 The index is populated exclusively via data import handler queries to a 
 database.

 I have exported the production database as-is to my local development machine 
 so that my local machine and production have access to the self same data.

 I execute a total full-import on both.

 Still, I see a different position for this document that should surely rank 
 in the same location, all else being equal.

 I ran debugQuery diff to see how the scores were being computed. See appendix 
 at foot of this email.

 As far as I can tell every single query normalisation block of the debug is 
 marginally different, e.g.

 -        0.021368012 = queryNorm (local)
 +        0.009944122 = queryNorm (production)

 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).

 - -2.286596 (local)
 +1.0651637 = (production)

 I cannot explain this difference. The database is the same. The configuration 
 is the same. I have fully imported from scratch on both servers. What am I 
 missing?

 Thank you for your time

 Allistair

 - snip

 APPENDIX - debugQuery=on DIFF

 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411

 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -    1.3198489 = (MATCH) max plus 0.01 times others of:
 -      0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -        0.011795795 = queryWeight(text:dubai^0.1), product of:
 -          0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +    0.6151879 = (MATCH) max plus 0.01 times others of:
 +      0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +        0.05489459 = queryWeight(text:dubai), product of:
           5.520305 = idf(docFreq=65, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
           1.4142135 = tf(termFreq(text:dubai)=2)
           5.520305 = idf(docFreq=65, maxDocs=6063)
           0.25 = fieldNorm(field=text, doc=1551)
 -      1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 -        0.32609802 = queryWeight(profile:dubai^2.0), product of:
 +      0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 +        0.15175761 = queryWeight(profile:dubai^2.0), product of:
           2.0 = boost
           7.6305184 = idf(docFreq=7, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of:
           1.4142135 = tf(termFreq(profile:dubai)=2)
           7.6305184 = idf(docFreq=7, maxDocs=6063)
           0.375 = fieldNorm(field=profile, doc=1551)
 -    0.36931866 = (MATCH) max plus 0.01 times others of:
 -      0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of:
 -        0.003954251 = queryWeight(text:product^0.1), product of:
 -          0.1 = boost
 +    0.17194802 = (MATCH) max plus 0.01 times others of:
 +      0.00851347 = (MATCH) weight(text:product in 1551), product of:
 +        0.018402064 = queryWeight(text:product), product of:
           1.8505468 = idf(docFreq=2589, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         0.4626367 = (MATCH) fieldWeight(text:product in 1551), product of:
           1.0 = tf(termFreq(text:product)=1)
           1.8505468 = idf(docFreq=2589, maxDocs=6063)
           0.25 = fieldNorm(field=text, doc=1551)
 -      0.36930037 = (MATCH) weight(profile:product^2.0 in 1551), product of:
 -        0.1725098 = 

Re: Excluding results from more like this

2011-03-09 Thread Jonathan Rochkind
Yeah, that just restricts what items are in your main result set (and 
adding -4 has no real effect).


The more like this set is constructed based on your main result set, for 
each document in it.


As far as I can see from here: http://wiki.apache.org/solr/MoreLikeThis

..there seems to be no built-in way to customize the 'more like this' 
results in the way you want, excluding certain document id's.  I don't 
entirely understand what mlt.boost  does, but I don't think it does 
anything useful for this case.


So, if that's so,  you are out of luck, unless you want to write Java 
code. In which case you could try customizing or adding that feature to 
the MoreLikeThis search component, and either suggest your new code back 
as a patch, or just use your own customized version of MoreLikeThis.


On 3/9/2011 4:29 PM, Brian Lamb wrote:

That doesn't seem to do it. Record 4 is still showing up in the MoreLikeThis
results.

On Wed, Mar 9, 2011 at 4:12 PM, Otis Gospodneticotis_gospodne...@yahoo.com

wrote:
Brian,

...?q=id:(2  3 5) -4


Otis
---
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 

From: Brian Lambbrian.l...@journalexperts.com
To: solr-user@lucene.apache.org
Sent: Wed, March 9, 2011 4:05:10 PM
Subject: Excluding results from more like this

Hi all,

I'm using MoreLikeThis to find similar results but I'd like to  exclude
records by the id number. For example, I use the following  URL:

http://localhost:8983/solr/search/?q=id:(2  3
5)mlt=truemlt.fl=description,idfl=*,score

How would I  exclude record 4 form the MoreLikeThis results?

I tried,

http://localhost:8983/solr/search/?q=id:(2  3
5)mlt=truemlt.fl=description,idfl=*,scoremlt.q=!4

But  that still returned record 4 in the MoreLikeThisResults.



Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Jonathan Rochkind
Yes, but the identical index with the identical solrconfig.xml and the 
identical query and the identical version of Solr on two different 
machines should preduce identical results.


So it's a legitimate question why it's not.  But perhaps queryNorm isn't 
enough to answer that. Sorry, it's out of my league to try and figure 
out it out.


But are you absolutely sure you have identical indexes, identical 
solrconfig.xml, identical queries, and identical versions of Solr and 
any other installed Java libraries... on both machines?  One of these 
being different seems more likely than a bug in Solr, although that's 
possible.


On 3/9/2011 4:34 PM, Jayendra Patil wrote:

queryNorm is just a normalizing factor and is the same value across
all the results for a query, to just make the scores comparable.
So even if it varies in different environment, you should not worried about.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
-
Defination - queryNorm(q) is just a normalizing factor used to make
scores between queries comparable. This factor does not affect
document ranking (since all ranked documents are multiplied by the
same factor), but rather just attempts to make scores from different
queries (or even different indexes) comparable

Regards,
Jayendra

On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk  wrote:

Hi,

I am seeing an issue I do not understand and hope that someone can shed some 
light on this. The issue is that for a particular search we are seeing a 
particular result rank in position 3 on one machine and position 8 on the 
production machine. The position 3 is our desired and roughly expected ranking.

I have a local machine with solr and a version deployed on a production server. 
My local machine's solr and the production version are both checked out from 
our project's SVN trunk. They are identical files except for the data files 
(not in SVN) and database connection settings.

The index is populated exclusively via data import handler queries to a 
database.

I have exported the production database as-is to my local development machine 
so that my local machine and production have access to the self same data.

I execute a total full-import on both.

Still, I see a different position for this document that should surely rank in 
the same location, all else being equal.

I ran debugQuery diff to see how the scores were being computed. See appendix 
at foot of this email.

As far as I can tell every single query normalisation block of the debug is 
marginally different, e.g.

-0.021368012 = queryNorm (local)
+0.009944122 = queryNorm (production)

Which leads to a final score of -2 versus +1 which is enough to skew the 
results from correct to incorrect (in terms of what we expect to see).

- -2.286596 (local)
+1.0651637 = (production)

I cannot explain this difference. The database is the same. The configuration 
is the same. I have fully imported from scratch on both servers. What am I 
missing?

Thank you for your time

Allistair

- snip

APPENDIX - debugQuery=on DIFF

--- untitled
+++ (clipboard)
@@ -1,51 +1,49 @@
-str name=L12411p
+str name=L12411

-2.286596 = (MATCH) sum of:
-  1.6891675 = (MATCH) sum of:
-1.3198489 = (MATCH) max plus 0.01 times others of:
-  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
-0.011795795 = queryWeight(text:dubai^0.1), product of:
-  0.1 = boost
+1.0651637 = (MATCH) sum of:
+  0.7871359 = (MATCH) sum of:
+0.6151879 = (MATCH) max plus 0.01 times others of:
+  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
+0.05489459 = queryWeight(text:dubai), product of:
   5.520305 = idf(docFreq=65, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
   1.4142135 = tf(termFreq(text:dubai)=2)
   5.520305 = idf(docFreq=65, maxDocs=6063)
   0.25 = fieldNorm(field=text, doc=1551)
-  1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
-0.32609802 = queryWeight(profile:dubai^2.0), product of:
+  0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
+0.15175761 = queryWeight(profile:dubai^2.0), product of:
   2.0 = boost
   7.6305184 = idf(docFreq=7, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of:
   1.4142135 = tf(termFreq(profile:dubai)=2)
   7.6305184 = idf(docFreq=7, maxDocs=6063)
   0.375 = fieldNorm(field=profile, doc=1551)
-0.36931866 = (MATCH) max plus 0.01 times others of:
-  0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of:
-0.003954251 = queryWeight(text:product^0.1), product of:
-  0.1 = boost
+0.17194802 = (MATCH) max 

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Allistair Crossley
Thanks. Good to know, but even so my problem remains - the end score should not 
be different and is causing a dramatically different ranking of a document (3 
versus 7 is dramatic for my client). This must be down to the scoring debug 
differences - it's the only difference I can find :(

On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote:

 queryNorm is just a normalizing factor and is the same value across
 all the results for a query, to just make the scores comparable.
 So even if it varies in different environment, you should not worried about.
 
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 -
 Defination - queryNorm(q) is just a normalizing factor used to make
 scores between queries comparable. This factor does not affect
 document ranking (since all ranked documents are multiplied by the
 same factor), but rather just attempts to make scores from different
 queries (or even different indexes) comparable
 
 Regards,
 Jayendra
 
 On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Hi,
 
 I am seeing an issue I do not understand and hope that someone can shed some 
 light on this. The issue is that for a particular search we are seeing a 
 particular result rank in position 3 on one machine and position 8 on the 
 production machine. The position 3 is our desired and roughly expected 
 ranking.
 
 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both checked 
 out from our project's SVN trunk. They are identical files except for the 
 data files (not in SVN) and database connection settings.
 
 The index is populated exclusively via data import handler queries to a 
 database.
 
 I have exported the production database as-is to my local development 
 machine so that my local machine and production have access to the self same 
 data.
 
 I execute a total full-import on both.
 
 Still, I see a different position for this document that should surely rank 
 in the same location, all else being equal.
 
 I ran debugQuery diff to see how the scores were being computed. See 
 appendix at foot of this email.
 
 As far as I can tell every single query normalisation block of the debug is 
 marginally different, e.g.
 
 -0.021368012 = queryNorm (local)
 +0.009944122 = queryNorm (production)
 
 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).
 
 - -2.286596 (local)
 +1.0651637 = (production)
 
 I cannot explain this difference. The database is the same. The 
 configuration is the same. I have fully imported from scratch on both 
 servers. What am I missing?
 
 Thank you for your time
 
 Allistair
 
 - snip
 
 APPENDIX - debugQuery=on DIFF
 
 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411
 
 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -1.3198489 = (MATCH) max plus 0.01 times others of:
 -  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -0.011795795 = queryWeight(text:dubai^0.1), product of:
 -  0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +0.6151879 = (MATCH) max plus 0.01 times others of:
 +  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +0.05489459 = queryWeight(text:dubai), product of:
   5.520305 = idf(docFreq=65, maxDocs=6063)
 -  0.021368012 = queryNorm
 +  0.009944122 = queryNorm
 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
   1.4142135 = tf(termFreq(text:dubai)=2)
   5.520305 = idf(docFreq=65, maxDocs=6063)
   0.25 = fieldNorm(field=text, doc=1551)
 -  1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 -0.32609802 = queryWeight(profile:dubai^2.0), product of:
 +  0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 +0.15175761 = queryWeight(profile:dubai^2.0), product of:
   2.0 = boost
   7.6305184 = idf(docFreq=7, maxDocs=6063)
 -  0.021368012 = queryNorm
 +  0.009944122 = queryNorm
 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of:
   1.4142135 = tf(termFreq(profile:dubai)=2)
   7.6305184 = idf(docFreq=7, maxDocs=6063)
   0.375 = fieldNorm(field=profile, doc=1551)
 -0.36931866 = (MATCH) max plus 0.01 times others of:
 -  0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of:
 -0.003954251 = queryWeight(text:product^0.1), product of:
 -  0.1 = boost
 +0.17194802 = (MATCH) max plus 0.01 times others of:
 +  0.00851347 = (MATCH) weight(text:product in 1551), product of:
 +0.018402064 = queryWeight(text:product), product of:
   1.8505468 = idf(docFreq=2589, maxDocs=6063)
 -  0.021368012 = 

Indexing a text string for faceting

2011-03-09 Thread Greg Georges
Hello all,

I have a small problem with my faceting fields. In all I create a new faceting 
field which is indexed and not stored, and use copyField. The problem is I 
facet on category names which have examples like this

Policies  Documentation 
(37)http://localhost:8080/apache-solr-1.4.1/select?q=Checklist%20Employee%20Hiringfacet=onfacet.field=fcategoryNamefq=fcategoryName:Policies%20%20Documentation
Forms  Checklists 
(22)http://localhost:8080/apache-solr-1.4.1/select?q=Checklist%20Employee%20Hiringfacet=onfacet.field=fcategoryNamefq=fcategoryName:Forms%20%20Checklists

Right now my fields were using the string type, which is not got because I 
think by default it is using a tokenizer etc.. I think I must define a new type 
field so that my category names will be properly indexed as a facet field. Here 
is what I have now

field name=categoryName type=text indexed=true stored=true /
field name=typeName type=text indexed=true stored=true /
field name=ftypeName type=string indexed=true stored=false 
multiValued=true/
field name=fcategoryName type=string indexed=true stored=false 
multiValued=true/

copyField source=typeName dest=ftypeName/
copyField source=categoryName dest=fcategoryName/

Can someone give me a type configuration which will support my category names 
which have whitespaces and ampersands?

Thanks in advance

Greg


Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Allistair Crossley
That's what I think, glad I am not going mad.

I've spent 1/2 a day comparing the config files, checking out from SVN again 
and ensuring the databases are identical. I cannot see what else I can do to 
make them equivalent. Both servers checkout directly from SVN, I am convinced 
the files are the same. The database is definately the same. 

Not sure what you mean about having identical indices - that's my problem - I 
don't - or do you mean something else I've missed? But yes everything else you 
mention is identical, I am as certain as I can be. 

I too think there must be a difference I have missed but I have run out of 
ideas for what to check!

Frustrating :)

On Mar 9, 2011, at 4:38 PM, Jonathan Rochkind wrote:

 Yes, but the identical index with the identical solrconfig.xml and the 
 identical query and the identical version of Solr on two different machines 
 should preduce identical results.
 
 So it's a legitimate question why it's not.  But perhaps queryNorm isn't 
 enough to answer that. Sorry, it's out of my league to try and figure out it 
 out.
 
 But are you absolutely sure you have identical indexes, identical 
 solrconfig.xml, identical queries, and identical versions of Solr and any 
 other installed Java libraries... on both machines?  One of these being 
 different seems more likely than a bug in Solr, although that's possible.
 
 On 3/9/2011 4:34 PM, Jayendra Patil wrote:
 queryNorm is just a normalizing factor and is the same value across
 all the results for a query, to just make the scores comparable.
 So even if it varies in different environment, you should not worried about.
 
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 -
 Defination - queryNorm(q) is just a normalizing factor used to make
 scores between queries comparable. This factor does not affect
 document ranking (since all ranked documents are multiplied by the
 same factor), but rather just attempts to make scores from different
 queries (or even different indexes) comparable
 
 Regards,
 Jayendra
 
 On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk  wrote:
 Hi,
 
 I am seeing an issue I do not understand and hope that someone can shed 
 some light on this. The issue is that for a particular search we are seeing 
 a particular result rank in position 3 on one machine and position 8 on the 
 production machine. The position 3 is our desired and roughly expected 
 ranking.
 
 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both checked 
 out from our project's SVN trunk. They are identical files except for the 
 data files (not in SVN) and database connection settings.
 
 The index is populated exclusively via data import handler queries to a 
 database.
 
 I have exported the production database as-is to my local development 
 machine so that my local machine and production have access to the self 
 same data.
 
 I execute a total full-import on both.
 
 Still, I see a different position for this document that should surely rank 
 in the same location, all else being equal.
 
 I ran debugQuery diff to see how the scores were being computed. See 
 appendix at foot of this email.
 
 As far as I can tell every single query normalisation block of the debug is 
 marginally different, e.g.
 
 -0.021368012 = queryNorm (local)
 +0.009944122 = queryNorm (production)
 
 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).
 
 - -2.286596 (local)
 +1.0651637 = (production)
 
 I cannot explain this difference. The database is the same. The 
 configuration is the same. I have fully imported from scratch on both 
 servers. What am I missing?
 
 Thank you for your time
 
 Allistair
 
 - snip
 
 APPENDIX - debugQuery=on DIFF
 
 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411
 
 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -1.3198489 = (MATCH) max plus 0.01 times others of:
 -  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -0.011795795 = queryWeight(text:dubai^0.1), product of:
 -  0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +0.6151879 = (MATCH) max plus 0.01 times others of:
 +  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +0.05489459 = queryWeight(text:dubai), product of:
   5.520305 = idf(docFreq=65, maxDocs=6063)
 -  0.021368012 = queryNorm
 +  0.009944122 = queryNorm
 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
   1.4142135 = tf(termFreq(text:dubai)=2)
   5.520305 = idf(docFreq=65, maxDocs=6063)
   0.25 = fieldNorm(field=text, doc=1551)
 -  1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 -

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Jayendra Patil
Are you sure you have the same config ...
The boost seems different for the field text - text:dubai^0.1  text:dubai

-2.286596 = (MATCH) sum of:
-  1.6891675 = (MATCH) sum of:
-1.3198489 = (MATCH) max plus 0.01 times others of:
-  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
-0.011795795 = queryWeight(text:dubai^0.1), product of:
-  0.1 = boost
+1.0651637 = (MATCH) sum of:
+  0.7871359 = (MATCH) sum of:
+0.6151879 = (MATCH) max plus 0.01 times others of:
+  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
+0.05489459 = queryWeight(text:dubai), product of:

Regards,
Jayendra

On Wed, Mar 9, 2011 at 4:38 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Thanks. Good to know, but even so my problem remains - the end score should 
 not be different and is causing a dramatically different ranking of a 
 document (3 versus 7 is dramatic for my client). This must be down to the 
 scoring debug differences - it's the only difference I can find :(

 On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote:

 queryNorm is just a normalizing factor and is the same value across
 all the results for a query, to just make the scores comparable.
 So even if it varies in different environment, you should not worried about.

 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 -
 Defination - queryNorm(q) is just a normalizing factor used to make
 scores between queries comparable. This factor does not affect
 document ranking (since all ranked documents are multiplied by the
 same factor), but rather just attempts to make scores from different
 queries (or even different indexes) comparable

 Regards,
 Jayendra

 On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Hi,

 I am seeing an issue I do not understand and hope that someone can shed 
 some light on this. The issue is that for a particular search we are seeing 
 a particular result rank in position 3 on one machine and position 8 on the 
 production machine. The position 3 is our desired and roughly expected 
 ranking.

 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both checked 
 out from our project's SVN trunk. They are identical files except for the 
 data files (not in SVN) and database connection settings.

 The index is populated exclusively via data import handler queries to a 
 database.

 I have exported the production database as-is to my local development 
 machine so that my local machine and production have access to the self 
 same data.

 I execute a total full-import on both.

 Still, I see a different position for this document that should surely rank 
 in the same location, all else being equal.

 I ran debugQuery diff to see how the scores were being computed. See 
 appendix at foot of this email.

 As far as I can tell every single query normalisation block of the debug is 
 marginally different, e.g.

 -        0.021368012 = queryNorm (local)
 +        0.009944122 = queryNorm (production)

 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).

 - -2.286596 (local)
 +1.0651637 = (production)

 I cannot explain this difference. The database is the same. The 
 configuration is the same. I have fully imported from scratch on both 
 servers. What am I missing?

 Thank you for your time

 Allistair

 - snip

 APPENDIX - debugQuery=on DIFF

 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411

 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -    1.3198489 = (MATCH) max plus 0.01 times others of:
 -      0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -        0.011795795 = queryWeight(text:dubai^0.1), product of:
 -          0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +    0.6151879 = (MATCH) max plus 0.01 times others of:
 +      0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +        0.05489459 = queryWeight(text:dubai), product of:
           5.520305 = idf(docFreq=65, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
           1.4142135 = tf(termFreq(text:dubai)=2)
           5.520305 = idf(docFreq=65, maxDocs=6063)
           0.25 = fieldNorm(field=text, doc=1551)
 -      1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 -        0.32609802 = queryWeight(profile:dubai^2.0), product of:
 +      0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 +        0.15175761 = queryWeight(profile:dubai^2.0), product of:
           2.0 = boost
           7.6305184 = idf(docFreq=7, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         4.0466933 

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Yonik Seeley
On Wed, Mar 9, 2011 at 4:49 PM, Jayendra Patil
jayendra.patil@gmail.com wrote:
 Are you sure you have the same config ...
 The boost seems different for the field text - text:dubai^0.1  text:dubai

Yep...
Try adding echoParams=all and see all the parameters solr is acting on.
http://wiki.apache.org/solr/CoreQueryParameters#echoParams

-Yonik
http://lucidimagination.com


 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -    1.3198489 = (MATCH) max plus 0.01 times others of:
 -      0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -        0.011795795 = queryWeight(text:dubai^0.1), product of:
 -          0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +    0.6151879 = (MATCH) max plus 0.01 times others of:
 +      0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +        0.05489459 = queryWeight(text:dubai), product of:

 Regards,
 Jayendra

 On Wed, Mar 9, 2011 at 4:38 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Thanks. Good to know, but even so my problem remains - the end score should 
 not be different and is causing a dramatically different ranking of a 
 document (3 versus 7 is dramatic for my client). This must be down to the 
 scoring debug differences - it's the only difference I can find :(

 On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote:

 queryNorm is just a normalizing factor and is the same value across
 all the results for a query, to just make the scores comparable.
 So even if it varies in different environment, you should not worried about.

 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 -
 Defination - queryNorm(q) is just a normalizing factor used to make
 scores between queries comparable. This factor does not affect
 document ranking (since all ranked documents are multiplied by the
 same factor), but rather just attempts to make scores from different
 queries (or even different indexes) comparable

 Regards,
 Jayendra

 On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk 
 wrote:
 Hi,

 I am seeing an issue I do not understand and hope that someone can shed 
 some light on this. The issue is that for a particular search we are 
 seeing a particular result rank in position 3 on one machine and position 
 8 on the production machine. The position 3 is our desired and roughly 
 expected ranking.

 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both 
 checked out from our project's SVN trunk. They are identical files except 
 for the data files (not in SVN) and database connection settings.

 The index is populated exclusively via data import handler queries to a 
 database.

 I have exported the production database as-is to my local development 
 machine so that my local machine and production have access to the self 
 same data.

 I execute a total full-import on both.

 Still, I see a different position for this document that should surely 
 rank in the same location, all else being equal.

 I ran debugQuery diff to see how the scores were being computed. See 
 appendix at foot of this email.

 As far as I can tell every single query normalisation block of the debug 
 is marginally different, e.g.

 -        0.021368012 = queryNorm (local)
 +        0.009944122 = queryNorm (production)

 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).

 - -2.286596 (local)
 +1.0651637 = (production)

 I cannot explain this difference. The database is the same. The 
 configuration is the same. I have fully imported from scratch on both 
 servers. What am I missing?

 Thank you for your time

 Allistair

 - snip

 APPENDIX - debugQuery=on DIFF

 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411

 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -    1.3198489 = (MATCH) max plus 0.01 times others of:
 -      0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -        0.011795795 = queryWeight(text:dubai^0.1), product of:
 -          0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +    0.6151879 = (MATCH) max plus 0.01 times others of:
 +      0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +        0.05489459 = queryWeight(text:dubai), product of:
           5.520305 = idf(docFreq=65, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
           1.4142135 = tf(termFreq(text:dubai)=2)
           5.520305 = idf(docFreq=65, maxDocs=6063)
           0.25 = fieldNorm(field=text, doc=1551)
 -      1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 -        0.32609802 = queryWeight(profile:dubai^2.0), product of:
 +      0.6141165 = (MATCH) 

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Allistair Crossley
Oh wow, how did I miss that?

My apologies to anyone who read this post. I should have diffed my custom 
dismax handler. Looks like my SVN merge didn't work properly.

Embarassing.

Thanks everyone ;)

On Mar 9, 2011, at 4:51 PM, Yonik Seeley wrote:

 On Wed, Mar 9, 2011 at 4:49 PM, Jayendra Patil
 jayendra.patil@gmail.com wrote:
 Are you sure you have the same config ...
 The boost seems different for the field text - text:dubai^0.1  text:dubai
 
 Yep...
 Try adding echoParams=all and see all the parameters solr is acting on.
 http://wiki.apache.org/solr/CoreQueryParameters#echoParams
 
 -Yonik
 http://lucidimagination.com
 
 
 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -1.3198489 = (MATCH) max plus 0.01 times others of:
 -  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -0.011795795 = queryWeight(text:dubai^0.1), product of:
 -  0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +0.6151879 = (MATCH) max plus 0.01 times others of:
 +  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +0.05489459 = queryWeight(text:dubai), product of:
 
 Regards,
 Jayendra
 
 On Wed, Mar 9, 2011 at 4:38 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Thanks. Good to know, but even so my problem remains - the end score should 
 not be different and is causing a dramatically different ranking of a 
 document (3 versus 7 is dramatic for my client). This must be down to the 
 scoring debug differences - it's the only difference I can find :(
 
 On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote:
 
 queryNorm is just a normalizing factor and is the same value across
 all the results for a query, to just make the scores comparable.
 So even if it varies in different environment, you should not worried 
 about.
 
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 -
 Defination - queryNorm(q) is just a normalizing factor used to make
 scores between queries comparable. This factor does not affect
 document ranking (since all ranked documents are multiplied by the
 same factor), but rather just attempts to make scores from different
 queries (or even different indexes) comparable
 
 Regards,
 Jayendra
 
 On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk 
 wrote:
 Hi,
 
 I am seeing an issue I do not understand and hope that someone can shed 
 some light on this. The issue is that for a particular search we are 
 seeing a particular result rank in position 3 on one machine and position 
 8 on the production machine. The position 3 is our desired and roughly 
 expected ranking.
 
 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both 
 checked out from our project's SVN trunk. They are identical files except 
 for the data files (not in SVN) and database connection settings.
 
 The index is populated exclusively via data import handler queries to a 
 database.
 
 I have exported the production database as-is to my local development 
 machine so that my local machine and production have access to the self 
 same data.
 
 I execute a total full-import on both.
 
 Still, I see a different position for this document that should surely 
 rank in the same location, all else being equal.
 
 I ran debugQuery diff to see how the scores were being computed. See 
 appendix at foot of this email.
 
 As far as I can tell every single query normalisation block of the debug 
 is marginally different, e.g.
 
 -0.021368012 = queryNorm (local)
 +0.009944122 = queryNorm (production)
 
 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).
 
 - -2.286596 (local)
 +1.0651637 = (production)
 
 I cannot explain this difference. The database is the same. The 
 configuration is the same. I have fully imported from scratch on both 
 servers. What am I missing?
 
 Thank you for your time
 
 Allistair
 
 - snip
 
 APPENDIX - debugQuery=on DIFF
 
 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411
 
 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -1.3198489 = (MATCH) max plus 0.01 times others of:
 -  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -0.011795795 = queryWeight(text:dubai^0.1), product of:
 -  0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +0.6151879 = (MATCH) max plus 0.01 times others of:
 +  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +0.05489459 = queryWeight(text:dubai), product of:
   5.520305 = idf(docFreq=65, maxDocs=6063)
 -  0.021368012 = queryNorm
 +  0.009944122 = queryNorm
 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
   1.4142135 = 

Math-generated fields during query

2011-03-09 Thread Peter Sturge
Hi,

I was wondering if it is possible during a query to create a returned
field 'on the fly' (like function query, but for concrete values, not
score).

For example, if I input this query:
   q=_val_:product(15,3)fl=*,score

For every returned document, I get score = 45.

If I change it slightly to add *:* like this:
   q=*:* _val_:product(15,3)fl=*,score

I get score = 32.526913.

If I try my use case of _val_:product(qty_ordered,unit_price), I get
varying scores depending on...well depending on something.

I understand this is doing relevance scoring, but it doesn't seem to
tally with the FunctionQuery Wiki
[example at the bottom of the page]:

   q=boxname:findbox+_val_:product(product(x,y),z)fl=*,score
...where score will contain the resultant volume.

Is there a trick to getting not a score, but the actual value of
quantity*price (e.g. product(5,2.21) == 11.05)?

Many thanks


Re: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Smiley, David W.
I was just about to jump in this conversation to mention Solandra and go fig, 
Solandra's committer comes in. :-)   It was nice to meet you at Strata, Jake.

I haven't dug into the code yet but Solandra strikes me as a killer way to 
scale Solr. I'm looking forward to playing with it; particularly looking at 
disk requirements and performance measurements.

~ David Smiley

On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote:

 Hi Otis,
 
 Have you considered using Solandra with Quorum writes
 to achieve master/master with CA semantics?
 
 -Jake
 
 
 On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:
 
 Hi,
 
  Original Message 
 
 From: Robert Petersen rober...@buy.com
 
 Can't you skip the SAN and keep the indexes locally?  Then you  would
 have two redundant copies of the index and no lock issues.
 
 I could, but then I'd have the issue of keeping them in sync, which seems
 more
 fragile.  I think SAN makes things simpler overall.
 
 Also, Can't master02 just be a slave to master01 (in the master farm  and
 separate from the slave farm) until such time as master01 fails?   Then
 
 No, because it wouldn't be in sync.  It would always be N minutes behind,
 and
 when the primary master fails, the secondary would not have all the docs -
 data
 loss.
 
 master02 would start receiving the new documents with an  indexes
 complete up to the last replication at least and the other slaves  would
 be directed by LB to poll master02 also...
 
 Yeah, complete up to the last replication is the problem.  It's a data
 gap
 that now needs to be filled somehow.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 -Original  Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, March 09, 2011 9:47 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: True master-master fail-over without data gaps (choosing CA
 in  CAP)
 
 Hi,
 
 
 - Original Message 
 From: Walter  Underwood wun...@wunderwood.org
 
 On  Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:
 
 You mean  it's  not possible to have 2 masters that are in nearly
 real-time
 sync?
 How  about with DRBD?  I know people use  DRBD to keep 2 Hadoop NNs
 (their
 edit
 
 logs) in  sync to avoid the current NN SPOF, for example, so I'm
 thinking
 this
 
 could be doable with Solr masters, too, no?
 
 If you add fault-tolerant, you run into the CAP  Theorem.  Consistency,
 
 availability, partition: choose two. You cannot have  it  all.
 
 Right, so I'll take Consistency and Availability, and I'll  put my 2
 masters in
 the same rack (which has redundant switches, power  supply, etc.) and
 thus
 minimize/avoid partitioning.
 Assuming the above  actually works, I think my Q remains:
 
 How do you set up 2 Solr masters so  they are in near real-time sync?
 DRBD?
 
 But here is maybe a simpler  scenario that more people may be
 considering:
 
 Imagine 2 masters on 2  different servers in 1 rack, pointing to the same
 index
 on the shared  storage (SAN) that also happens to live in the same rack.
 2 Solr masters are  behind 1 LB VIP that indexer talks to.
 The VIP is configured so that all  requests always get routed to the
 primary
 master (because only 1 master  can be modifying an index at a time),
 except when
 this primary is down,  in which case the requests are sent to the
 secondary
 master.
 
 So in  this case my Q is around automation of this, around Lucene index
 locks,
 around the need for manual intervention, and such.
 Concretely, if you  have these 2 master instances, the primary master has
 the
 Lucene index  lock in the index dir.  When the secondary master needs to
 take
 over  (i.e., when it starts receiving documents via LB), it needs to be
 able to
 write to that same index.  But what if that lock is still around?   One
 could use
 the Native lock to make the lock disappear if the primary  master's JVM
 exited
 unexpectedly, and in that case everything *should*  work and be
 completely
 transparent, right?  That is, the secondary  will start getting new docs,
 it will
 use its IndexWriter to write to that  same shared index, which won't be
 locked
 for writes because the lock is  gone, and everyone will be happy.  Did I
 miss
 something important  here?
 
 Assuming the above is correct, what if the lock is *not* gone  because
 the
 primary master's JVM is actually not dead, although maybe  unresponsive,
 so LB
 thinks the primary master is dead.  Then the LB  will route indexing
 requests to
 the secondary master, which will attempt  to write to the index, but be
 denied
 because of the lock.  So a  human needs to jump in, remove the lock, and
 manually
 reindex failed docs  if the upstream component doesn't buffer docs that
 failed to
 get indexed  and doesn't retry indexing them automatically.  Is this
 correct or
 is there a way to avoid humans  here?
 
 Thanks,
 Otis
 
 Sematext :: 

Re: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Jason Rutherglen
Doesn't Solandra partition by term instead of document?

On Wed, Mar 9, 2011 at 2:13 PM, Smiley, David W. dsmi...@mitre.org wrote:
 I was just about to jump in this conversation to mention Solandra and go fig, 
 Solandra's committer comes in. :-)   It was nice to meet you at Strata, Jake.

 I haven't dug into the code yet but Solandra strikes me as a killer way to 
 scale Solr. I'm looking forward to playing with it; particularly looking at 
 disk requirements and performance measurements.

 ~ David Smiley

 On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote:

 Hi Otis,

 Have you considered using Solandra with Quorum writes
 to achieve master/master with CA semantics?

 -Jake


 On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Hi,

  Original Message 

 From: Robert Petersen rober...@buy.com

 Can't you skip the SAN and keep the indexes locally?  Then you  would
 have two redundant copies of the index and no lock issues.

 I could, but then I'd have the issue of keeping them in sync, which seems
 more
 fragile.  I think SAN makes things simpler overall.

 Also, Can't master02 just be a slave to master01 (in the master farm  and
 separate from the slave farm) until such time as master01 fails?   Then

 No, because it wouldn't be in sync.  It would always be N minutes behind,
 and
 when the primary master fails, the secondary would not have all the docs -
 data
 loss.

 master02 would start receiving the new documents with an  indexes
 complete up to the last replication at least and the other slaves  would
 be directed by LB to poll master02 also...

 Yeah, complete up to the last replication is the problem.  It's a data
 gap
 that now needs to be filled somehow.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/


 -Original  Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, March 09, 2011 9:47 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: True master-master fail-over without data gaps (choosing CA
 in  CAP)

 Hi,


 - Original Message 
 From: Walter  Underwood wun...@wunderwood.org

 On  Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:

 You mean  it's  not possible to have 2 masters that are in nearly
 real-time
 sync?
 How  about with DRBD?  I know people use  DRBD to keep 2 Hadoop NNs
 (their
 edit

 logs) in  sync to avoid the current NN SPOF, for example, so I'm
 thinking
 this

 could be doable with Solr masters, too, no?

 If you add fault-tolerant, you run into the CAP  Theorem.  Consistency,

 availability, partition: choose two. You cannot have  it  all.

 Right, so I'll take Consistency and Availability, and I'll  put my 2
 masters in
 the same rack (which has redundant switches, power  supply, etc.) and
 thus
 minimize/avoid partitioning.
 Assuming the above  actually works, I think my Q remains:

 How do you set up 2 Solr masters so  they are in near real-time sync?
 DRBD?

 But here is maybe a simpler  scenario that more people may be
 considering:

 Imagine 2 masters on 2  different servers in 1 rack, pointing to the same
 index
 on the shared  storage (SAN) that also happens to live in the same rack.
 2 Solr masters are  behind 1 LB VIP that indexer talks to.
 The VIP is configured so that all  requests always get routed to the
 primary
 master (because only 1 master  can be modifying an index at a time),
 except when
 this primary is down,  in which case the requests are sent to the
 secondary
 master.

 So in  this case my Q is around automation of this, around Lucene index
 locks,
 around the need for manual intervention, and such.
 Concretely, if you  have these 2 master instances, the primary master has
 the
 Lucene index  lock in the index dir.  When the secondary master needs to
 take
 over  (i.e., when it starts receiving documents via LB), it needs to be
 able to
 write to that same index.  But what if that lock is still around?   One
 could use
 the Native lock to make the lock disappear if the primary  master's JVM
 exited
 unexpectedly, and in that case everything *should*  work and be
 completely
 transparent, right?  That is, the secondary  will start getting new docs,
 it will
 use its IndexWriter to write to that  same shared index, which won't be
 locked
 for writes because the lock is  gone, and everyone will be happy.  Did I
 miss
 something important  here?

 Assuming the above is correct, what if the lock is *not* gone  because
 the
 primary master's JVM is actually not dead, although maybe  unresponsive,
 so LB
 thinks the primary master is dead.  Then the LB  will route indexing
 requests to
 the secondary master, which will attempt  to write to the index, but be
 denied
 because of the lock.  So a  human needs to jump in, remove the lock, and
 manually
 reindex failed docs  if the upstream component doesn't buffer docs that
 failed to
 get indexed  and doesn't retry indexing them 

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Jonathan Rochkind
Wait, if you don't have identical indexes, then why would you expect 
identical results?


If your indexes are different, one would expect the results for the same 
query to be different -- there are different documents in the index!   
The iDF portion of the TF/iDF type algorithm at the base of Solr's 
relevancy will also be different in different indexes. 
http://en.wikipedia.org/wiki/Tf%E2%80%93idf


Maybe I'm misunderstanding you.  But if you have different indexes -- 
not exactly the same collection of documents indexed using exactly the 
same field definitions and rules -- then one should expect different 
relevance results.


Jonathan

On 3/9/2011 4:48 PM, Allistair Crossley wrote:

That's what I think, glad I am not going mad.

I've spent 1/2 a day comparing the config files, checking out from SVN again 
and ensuring the databases are identical. I cannot see what else I can do to 
make them equivalent. Both servers checkout directly from SVN, I am convinced 
the files are the same. The database is definately the same.

Not sure what you mean about having identical indices - that's my problem - I 
don't - or do you mean something else I've missed? But yes everything else you 
mention is identical, I am as certain as I can be.

I too think there must be a difference I have missed but I have run out of 
ideas for what to check!

Frustrating :)

On Mar 9, 2011, at 4:38 PM, Jonathan Rochkind wrote:


Yes, but the identical index with the identical solrconfig.xml and the 
identical query and the identical version of Solr on two different machines 
should preduce identical results.

So it's a legitimate question why it's not.  But perhaps queryNorm isn't enough 
to answer that. Sorry, it's out of my league to try and figure out it out.

But are you absolutely sure you have identical indexes, identical 
solrconfig.xml, identical queries, and identical versions of Solr and any other 
installed Java libraries... on both machines?  One of these being different 
seems more likely than a bug in Solr, although that's possible.

On 3/9/2011 4:34 PM, Jayendra Patil wrote:

queryNorm is just a normalizing factor and is the same value across
all the results for a query, to just make the scores comparable.
So even if it varies in different environment, you should not worried about.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
-
Defination - queryNorm(q) is just a normalizing factor used to make
scores between queries comparable. This factor does not affect
document ranking (since all ranked documents are multiplied by the
same factor), but rather just attempts to make scores from different
queries (or even different indexes) comparable

Regards,
Jayendra

On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk   wrote:

Hi,

I am seeing an issue I do not understand and hope that someone can shed some 
light on this. The issue is that for a particular search we are seeing a 
particular result rank in position 3 on one machine and position 8 on the 
production machine. The position 3 is our desired and roughly expected ranking.

I have a local machine with solr and a version deployed on a production server. 
My local machine's solr and the production version are both checked out from 
our project's SVN trunk. They are identical files except for the data files 
(not in SVN) and database connection settings.

The index is populated exclusively via data import handler queries to a 
database.

I have exported the production database as-is to my local development machine 
so that my local machine and production have access to the self same data.

I execute a total full-import on both.

Still, I see a different position for this document that should surely rank in 
the same location, all else being equal.

I ran debugQuery diff to see how the scores were being computed. See appendix 
at foot of this email.

As far as I can tell every single query normalisation block of the debug is 
marginally different, e.g.

-0.021368012 = queryNorm (local)
+0.009944122 = queryNorm (production)

Which leads to a final score of -2 versus +1 which is enough to skew the 
results from correct to incorrect (in terms of what we expect to see).

- -2.286596 (local)
+1.0651637 = (production)

I cannot explain this difference. The database is the same. The configuration 
is the same. I have fully imported from scratch on both servers. What am I 
missing?

Thank you for your time

Allistair

- snip

APPENDIX - debugQuery=on DIFF

--- untitled
+++ (clipboard)
@@ -1,51 +1,49 @@
-str name=L12411p
+str name=L12411

-2.286596 = (MATCH) sum of:
-  1.6891675 = (MATCH) sum of:
-1.3198489 = (MATCH) max plus 0.01 times others of:
-  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
-0.011795795 = queryWeight(text:dubai^0.1), product of:
-  0.1 = boost
+1.0651637 = (MATCH) sum of:
+  0.7871359 = (MATCH) sum of:
+0.6151879 = 

Re: NRT in Solr

2011-03-09 Thread Smiley, David W.
Zoie adds NRT to Solr:
http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin

I haven't tried it yet but looks cool.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote:

 Jae,
 
 NRT hasn't been implemented NRT as of yet in Solr, I think partially
 because major features such as replication, caching, and uninverted
 faceting suddenly are no longer viable, eg, it's another round of
 testing etc.  It's doable, however I think the best approach is a
 separate request call path, to avoid altering to current [working]
 API.
 
 On Tue, Mar 8, 2011 at 1:27 PM, Jae Joo jaejo...@gmail.com wrote:
 Hi,
 Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not
 find the configuration for NRT.
 
 Regards
 
 Jae
 







Re: NRT in Solr

2011-03-09 Thread Jonathan Rochkind
Interesting, does anyone have a summary of what techniques zoie uses to 
do this?  I don't see any docs on the technical details.


On 3/9/2011 5:29 PM, Smiley, David W. wrote:

Zoie adds NRT to Solr:
http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin

I haven't tried it yet but looks cool.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote:


Jae,

NRT hasn't been implemented NRT as of yet in Solr, I think partially
because major features such as replication, caching, and uninverted
faceting suddenly are no longer viable, eg, it's another round of
testing etc.  It's doable, however I think the best approach is a
separate request call path, to avoid altering to current [working]
API.

On Tue, Mar 8, 2011 at 1:27 PM, Jae Joojaejo...@gmail.com  wrote:

Hi,
Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not
find the configuration for NRT.

Regards

Jae









Re: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Jake Luciani
Jason,

It's predecessor did, Lucandra. But Solandra is a new approach that manages 
shards of documents across the cluster for you and uses solrs distributed 
search to query indexes. 

Jake

On Mar 9, 2011, at 5:15 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote:

 Doesn't Solandra partition by term instead of document?
 
 On Wed, Mar 9, 2011 at 2:13 PM, Smiley, David W. dsmi...@mitre.org wrote:
 I was just about to jump in this conversation to mention Solandra and go 
 fig, Solandra's committer comes in. :-)   It was nice to meet you at Strata, 
 Jake.
 
 I haven't dug into the code yet but Solandra strikes me as a killer way to 
 scale Solr. I'm looking forward to playing with it; particularly looking at 
 disk requirements and performance measurements.
 
 ~ David Smiley
 
 On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote:
 
 Hi Otis,
 
 Have you considered using Solandra with Quorum writes
 to achieve master/master with CA semantics?
 
 -Jake
 
 
 On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:
 
 Hi,
 
  Original Message 
 
 From: Robert Petersen rober...@buy.com
 
 Can't you skip the SAN and keep the indexes locally?  Then you  would
 have two redundant copies of the index and no lock issues.
 
 I could, but then I'd have the issue of keeping them in sync, which seems
 more
 fragile.  I think SAN makes things simpler overall.
 
 Also, Can't master02 just be a slave to master01 (in the master farm  and
 separate from the slave farm) until such time as master01 fails?   Then
 
 No, because it wouldn't be in sync.  It would always be N minutes behind,
 and
 when the primary master fails, the secondary would not have all the docs -
 data
 loss.
 
 master02 would start receiving the new documents with an  indexes
 complete up to the last replication at least and the other slaves  would
 be directed by LB to poll master02 also...
 
 Yeah, complete up to the last replication is the problem.  It's a data
 gap
 that now needs to be filled somehow.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 -Original  Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, March 09, 2011 9:47 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: True master-master fail-over without data gaps (choosing CA
 in  CAP)
 
 Hi,
 
 
 - Original Message 
 From: Walter  Underwood wun...@wunderwood.org
 
 On  Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:
 
 You mean  it's  not possible to have 2 masters that are in nearly
 real-time
 sync?
 How  about with DRBD?  I know people use  DRBD to keep 2 Hadoop NNs
 (their
 edit
 
 logs) in  sync to avoid the current NN SPOF, for example, so I'm
 thinking
 this
 
 could be doable with Solr masters, too, no?
 
 If you add fault-tolerant, you run into the CAP  Theorem.  Consistency,
 
 availability, partition: choose two. You cannot have  it  all.
 
 Right, so I'll take Consistency and Availability, and I'll  put my 2
 masters in
 the same rack (which has redundant switches, power  supply, etc.) and
 thus
 minimize/avoid partitioning.
 Assuming the above  actually works, I think my Q remains:
 
 How do you set up 2 Solr masters so  they are in near real-time sync?
 DRBD?
 
 But here is maybe a simpler  scenario that more people may be
 considering:
 
 Imagine 2 masters on 2  different servers in 1 rack, pointing to the same
 index
 on the shared  storage (SAN) that also happens to live in the same rack.
 2 Solr masters are  behind 1 LB VIP that indexer talks to.
 The VIP is configured so that all  requests always get routed to the
 primary
 master (because only 1 master  can be modifying an index at a time),
 except when
 this primary is down,  in which case the requests are sent to the
 secondary
 master.
 
 So in  this case my Q is around automation of this, around Lucene index
 locks,
 around the need for manual intervention, and such.
 Concretely, if you  have these 2 master instances, the primary master has
 the
 Lucene index  lock in the index dir.  When the secondary master needs to
 take
 over  (i.e., when it starts receiving documents via LB), it needs to be
 able to
 write to that same index.  But what if that lock is still around?   One
 could use
 the Native lock to make the lock disappear if the primary  master's JVM
 exited
 unexpectedly, and in that case everything *should*  work and be
 completely
 transparent, right?  That is, the secondary  will start getting new docs,
 it will
 use its IndexWriter to write to that  same shared index, which won't be
 locked
 for writes because the lock is  gone, and everyone will be happy.  Did I
 miss
 something important  here?
 
 Assuming the above is correct, what if the lock is *not* gone  because
 the
 primary master's JVM is actually not dead, although maybe  unresponsive,
 so LB
 thinks the primary master is dead.  Then the LB  will route indexing
 

Re: NRT in Solr

2011-03-09 Thread Otis Gospodnetic
Jonathan, they have a Wiki up these somewhere, including pretty diagrams.  If 
you have Lucene in Action, Zoie is one of the case studies and is described in 
a 
lot of detail.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Jonathan Rochkind rochk...@jhu.edu
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: Smiley, David W. dsmi...@mitre.org
 Sent: Wed, March 9, 2011 5:34:01 PM
 Subject: Re: NRT in Solr
 
 Interesting, does anyone have a summary of what techniques zoie uses to 
 do  this?  I don't see any docs on the technical details.
 
 On 3/9/2011  5:29 PM, Smiley, David W. wrote:
  Zoie adds NRT to Solr:
  http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin
 
   I haven't tried it yet but looks cool.
 
  ~ David Smiley
   Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
 
   On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote:
 
   Jae,
 
  NRT hasn't been implemented NRT as of yet in Solr,  I think partially
  because major features such as replication,  caching, and uninverted
  faceting suddenly are no longer viable, eg,  it's another round of
  testing etc.  It's doable, however I  think the best approach is a
  separate request call path, to avoid  altering to current [working]
  API.
 
  On Tue,  Mar 8, 2011 at 1:27 PM, Jae Joojaejo...@gmail.com   wrote:
  Hi,
  Is NRT in Solr 4.0 from trunk? I have  checkouted from Trunk, but could 
not
  find the configuration for  NRT.
 
  Regards
 
   Jae
 
 
 
 
 
 
 


Re: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Otis Gospodnetic
Jake,

Maybe it's time to come up with the Solandra/Solr matrix so we can see 
Solandra's strengths (e.g. RT, no replication) and weaknesses (e.g. I think I 
saw a mention of some big indices?) or missing feature (e.g. no delete by 
query), etc.

Thanks!
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Jake Luciani jak...@gmail.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Wed, March 9, 2011 6:04:13 PM
 Subject: Re: True master-master fail-over without data gaps (choosing CA in 
CAP)
 
 Jason,
 
 It's predecessor did, Lucandra. But Solandra is a new approach  that manages 
shards of documents across the cluster for you and uses solrs  distributed 
search to query indexes. 

 
 Jake
 
 On Mar 9, 2011, at 5:15  PM, Jason Rutherglen jason.rutherg...@gmail.com  
wrote:
 
  Doesn't Solandra partition by term instead of  document?
  
  On Wed, Mar 9, 2011 at 2:13 PM, Smiley, David W.  dsmi...@mitre.org wrote:
  I  was just about to jump in this conversation to mention Solandra and go 
fig,  Solandra's committer comes in. :-)   It was nice to meet you at Strata,  
Jake.
  
  I haven't dug into the code yet but Solandra  strikes me as a killer way 
  to 
scale Solr. I'm looking forward to playing with  it; particularly looking at 
disk requirements and performance  measurements.
  
  ~ David Smiley
  
   On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote:
  
  Hi  Otis,
  
  Have you considered using Solandra with  Quorum writes
  to achieve master/master with CA  semantics?
  
  -Jake
  
  
  On Wed, Mar 9, 2011 at 2:48 PM, Otis  Gospodnetic 
otis_gospodne...@yahoo.com
   wrote:
  
  Hi,
  
   Original Message 
  
  From: Robert Petersen rober...@buy.com
  
  Can't you skip the SAN and keep the indexes  locally?  Then you  would
  have two redundant  copies of the index and no lock issues.
  
   I could, but then I'd have the issue of keeping them in sync, which  
seems
  more
  fragile.  I think SAN  makes things simpler overall.
  
  Also,  Can't master02 just be a slave to master01 (in the master farm   
and
  separate from the slave farm) until such time as  master01 fails?   Then
  
  No, because  it wouldn't be in sync.  It would always be N minutes  
behind,
  and
  when the primary master  fails, the secondary would not have all the 
  docs 
-
   data
  loss.
  
   master02 would start receiving the new documents with an   indexes
  complete up to the last replication at least and  the other slaves  
would
  be directed by LB to poll  master02 also...
  
  Yeah, complete up to  the last replication is the problem.  It's a data
   gap
  that now needs to be filled somehow.
  
  Otis
  
  Sematext  :: http://sematext.com/ ::  Solr - Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
  
  
  -Original   Message-
  From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
   Sent: Wednesday, March 09, 2011 9:47 AM
  To: solr-user@lucene.apache.org
   Subject:  Re: True master-master fail-over without data gaps (choosing 
   
CA
  in  CAP)
  
  Hi,
  
  
  - Original Message 
   From: Walter  Underwood wun...@wunderwood.org
  
  On  Mar 9, 2011, at 9:02 AM, Otis Gospodnetic  wrote:
  
  You  mean  it's  not possible to have 2 masters that are in  nearly
  real-time
   sync?
  How  about with DRBD?  I know  people use  DRBD to keep 2 Hadoop NNs
   (their
  edit
  
  logs) in  sync to avoid the current NN  SPOF, for example, so I'm
   thinking
  this
  
  could be doable with Solr masters, too,  no?
  
  If you add  fault-tolerant, you run into the CAP  Theorem.   
Consistency,
  
  availability,  partition: choose two. You cannot have  it   all.
  
  Right, so I'll take  Consistency and Availability, and I'll  put my 2
   masters in
  the same rack (which has redundant switches,  power  supply, etc.) and
   thus
  minimize/avoid  partitioning.
  Assuming the above  actually works, I  think my Q remains:
  
  How do you  set up 2 Solr masters so  they are in near real-time  sync?
  DRBD?
  
  But here is maybe a simpler  scenario that more  people may be
  considering:
  
  Imagine 2 masters on 2  different servers in 1  rack, pointing to the 
same
  index
   on the shared  storage (SAN) that also happens to live in the same  
rack.
  2 Solr masters are  behind 1 LB VIP that  indexer talks to.
  The VIP is configured so that  all  requests always get routed to the
   primary
  master (because only 1 master  can be  modifying an index at a time),
  except  when
  this primary is down,  in which case the  requests are sent to the
   secondary
  master.
  
  So in  this case my Q is around automation of  this, around Lucene index
  locks,
   around the need for manual intervention, and such.
   Concretely, if you  have these 2 master instances, the primary master  
has
  the
  Lucene index  lock  in the index dir.  When the secondary 

Re: Fwd: some relational-type groupig with search

2011-03-09 Thread Michael Sokolov
Probably you can just sort by date (one way and then the other) and 
limit your result set to a single document.  That should free up enough 
budget for the bonuses of the highly-placed people, I think :)


On 3/9/2011 4:05 PM, l.blev...@comcast.net wrote:

- Forwarded Message -
From: l blevinsl.blev...@comcast.net
To: solr user mailsolr-user-h...@lucene.apache.org
Sent: Wednesday, March 9, 2011 4:03:06 PM
Subject: some relational-type groupig with search




I have a large database for which we have some good search capabilties now, but 
am interested to see if SOLR might be usable instead.  That would gain us the 
additional text-search features and eliminate the high fees for some of the 
database features.



If I have fields such asperson_id,document_date, andmeasurement_value.  I 
need to be able to fullfil the following types of searches that I cannot figure out how to do now:



* limit search to only the most recent (or earliest) document per person 
along with whatever other criteria is present (each person's LAST or FIRST 
document),

* search and only return the most recent document per person (LASTor FIRST 
meeting the other criteria),

* limit search to only the documents with the max or minmeasurement_value 
 per person,
* search and return only the max or minmeasurement_value  per person



All of these boil down to limiting by the max or min of either a date or 
numeric field within a group (by person in this case).  I know these features 
are considered relational and that SOLR has declared that it is not really a 
relational search engine, but a number of highly placed persons that I work for 
are very interested in using SOLR.  If we could satisfy this type of query, 
SOLR could fit our needs so I feel compelled to ask this group if these 
searches are possible.




Re: some relational-type groupig with search

2011-03-09 Thread l . blevins


It is not just one document that would be returned, it one document per 
person.  That is a little trickier. 


- Original Message - 
From: Michael Sokolov soko...@ifactory.com 
To: solr-user@lucene.apache.org 
Cc: l blevins l.blev...@comcast.net 
Sent: Wednesday, March 9, 2011 7:46:10 PM 
Subject: Re: Fwd: some relational-type groupig with search 

Probably you can just sort by date (one way and then the other) and 
limit your result set to a single document.  That should free up enough 
budget for the bonuses of the highly-placed people, I think :) 

On 3/9/2011 4:05 PM, l.blev...@comcast.net wrote: 
 - Forwarded Message - 
 From: l blevinsl.blev...@comcast.net 
 To: solr user mailsolr-user-h...@lucene.apache.org 
 Sent: Wednesday, March 9, 2011 4:03:06 PM 
 Subject: some relational-type groupig with search 
 
 
 
 
 I have a large database for which we have some good search capabilties now, 
 but am interested to see if SOLR might be usable instead.  That would gain us 
 the additional text-search features and eliminate the high fees for some of 
 the database features. 
 
 
 
 If I have fields such asperson_id,document_date, andmeasurement_value.  
 I need to be able to fullfil the following types of searches that I cannot 
 figure out how to do now: 
 
 
 
     * limit search to only the most recent (or earliest) document per person 
 along with whatever other criteria is present (each person's LAST or FIRST 
 document), 
 
     * search and only return the most recent document per person (LASTor 
 FIRST meeting the other criteria), 
 
     * limit search to only the documents with the max or 
 minmeasurement_value  per person, 
     * search and return only the max or minmeasurement_value  per person 
 
 
 
 All of these boil down to limiting by the max or min of either a date or 
 numeric field within a group (by person in this case).  I know these features 
 are considered relational and that SOLR has declared that it is not really a 
 relational search engine, but a number of highly placed persons that I work 
 for are very interested in using SOLR.  If we could satisfy this type of 
 query, SOLR could fit our needs so I feel compelled to ask this group if 
 these searches are possible. 



Re: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Jake Luciani
Yeah sure.  Let me update this on the Solandra wiki. I'll send across the
link

I think you hit the main two shortcomings atm.

-Jake

On Wed, Mar 9, 2011 at 6:17 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Jake,

 Maybe it's time to come up with the Solandra/Solr matrix so we can see
 Solandra's strengths (e.g. RT, no replication) and weaknesses (e.g. I think
 I
 saw a mention of some big indices?) or missing feature (e.g. no delete by
 query), etc.

 Thanks!
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Jake Luciani jak...@gmail.com
  To: solr-user@lucene.apache.org solr-user@lucene.apache.org
  Sent: Wed, March 9, 2011 6:04:13 PM
  Subject: Re: True master-master fail-over without data gaps (choosing CA
 in
 CAP)
 
  Jason,
 
  It's predecessor did, Lucandra. But Solandra is a new approach  that
 manages
 shards of documents across the cluster for you and uses solrs  distributed
 search to query indexes.
 
 
  Jake
 
  On Mar 9, 2011, at 5:15  PM, Jason Rutherglen 
 jason.rutherg...@gmail.com
 wrote:
 
   Doesn't Solandra partition by term instead of  document?
  
   On Wed, Mar 9, 2011 at 2:13 PM, Smiley, David W.  dsmi...@mitre.org
 wrote:
   I  was just about to jump in this conversation to mention Solandra and
 go
 fig,  Solandra's committer comes in. :-)   It was nice to meet you at
 Strata,
 Jake.
  
   I haven't dug into the code yet but Solandra  strikes me as a killer
 way to
 scale Solr. I'm looking forward to playing with  it; particularly looking
 at
 disk requirements and performance  measurements.
  
   ~ David Smiley
  
On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote:
  
   Hi  Otis,
  
   Have you considered using Solandra with  Quorum writes
   to achieve master/master with CA  semantics?
  
   -Jake
  
  
   On Wed, Mar 9, 2011 at 2:48 PM, Otis  Gospodnetic
 otis_gospodne...@yahoo.com
wrote:
  
   Hi,
  
    Original Message 
  
   From: Robert Petersen rober...@buy.com
  
   Can't you skip the SAN and keep the indexes  locally?  Then you
  would
   have two redundant  copies of the index and no lock issues.
  
I could, but then I'd have the issue of keeping them in sync, which
 seems
   more
   fragile.  I think SAN  makes things simpler overall.
  
   Also,  Can't master02 just be a slave to master01 (in the master
 farm
 and
   separate from the slave farm) until such time as  master01 fails?
 Then
  
   No, because  it wouldn't be in sync.  It would always be N minutes
 behind,
   and
   when the primary master  fails, the secondary would not have all the
 docs
 -
data
   loss.
  
master02 would start receiving the new documents with an   indexes
   complete up to the last replication at least and  the other slaves
 would
   be directed by LB to poll  master02 also...
  
   Yeah, complete up to  the last replication is the problem.  It's a
 data
gap
   that now needs to be filled somehow.
  
   Otis
   
   Sematext  :: http://sematext.com/ ::  Solr - Lucene - Nutch
   Lucene ecosystem search :: http://search-lucene.com/
  
  
   -Original   Message-
   From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
Sent: Wednesday, March 09, 2011 9:47 AM
   To: solr-user@lucene.apache.org
Subject:  Re: True master-master fail-over without data gaps
 (choosing
 CA
   in  CAP)
  
   Hi,
  
  
   - Original Message 
From: Walter  Underwood wun...@wunderwood.org
  
   On  Mar 9, 2011, at 9:02 AM, Otis Gospodnetic  wrote:
  
   You  mean  it's  not possible to have 2 masters that are in
  nearly
   real-time
sync?
   How  about with DRBD?  I know  people use  DRBD to keep 2 Hadoop
 NNs
(their
   edit
  
   logs) in  sync to avoid the current NN  SPOF, for example, so I'm
thinking
   this
  
   could be doable with Solr masters, too,  no?
  
   If you add  fault-tolerant, you run into the CAP  Theorem.
 Consistency,
  
   availability,  partition: choose two. You cannot have  it   all.
  
   Right, so I'll take  Consistency and Availability, and I'll  put my
 2
masters in
   the same rack (which has redundant switches,  power  supply, etc.)
 and
thus
   minimize/avoid  partitioning.
   Assuming the above  actually works, I  think my Q remains:
  
   How do you  set up 2 Solr masters so  they are in near real-time
  sync?
   DRBD?
  
   But here is maybe a simpler  scenario that more  people may be
   considering:
  
   Imagine 2 masters on 2  different servers in 1  rack, pointing to
 the
 same
   index
on the shared  storage (SAN) that also happens to live in the same
 rack.
   2 Solr masters are  behind 1 LB VIP that  indexer talks to.
   The VIP is configured so that  all  requests always get routed to
 the
primary
   master (because only 1 master  can be  modifying an index at a
 time),
   except  when
   this primary is down,  in which case the  requests are sent to 

java.lang.ClassCastException being thrown seemingly at random

2011-03-09 Thread harish.agarwal
Hello, 

I'm using a recent build of the trunk (from 3/1).  I've noticed that after
the index is up and running for some time I start to get intermittent errors
that look like this: 


Mar 2, 2011 9:26:01 AM org.apache.solr.common.SolrException log 
SEVERE: java.lang.ClassCastException 


The querys I get the error against are seemingly random and do not
consistently throw the error - in fact, every time I test a query I receive
this error on again, it completes successfully.  This is also the total
extent of the error recorded in the logs, there is no traceback.   

I'm not even sure how to begin debugging the problem, any suggestions or
pointers as to what may be going wrong would be greatly appreciated. 

-Harish

--
View this message in context: 
http://lucene.472066.n3.nabble.com/java-lang-ClassCastException-being-thrown-seemingly-at-random-tp2658331p2658331.html
Sent from the Solr - User mailing list archive at Nabble.com.


Caching filter question / code review

2011-03-09 Thread Mark
I created the following SearchComponent that wraps a deduplicate filter 
around the current query and added it to last-components. It appears to 
be working, but is there any way I can improve the performance? Would 
this be considered and added to the filtercache? Am I even caching 
correctly?


Thanks for any input/suggestions

...
 private MapString, Filter filtersByField = new HashMapString, 
Filter();


  @Override
  public void prepare(ResponseBuilder rb) throws IOException {
SolrParams params = rb.req.getParams();

if (params.getBool(DuplicateParams.DEDUPLICATE, false)) {
  String field = params.get(DuplicateParams.DUPLICATE_FIELD);

  if (field == null) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, 
Deduplicate field is required);

  }

  Filter filter = filtersByField.get(field);

  if (filter == null) {
filter = new CachingWrapperFilter(new DuplicateFilter(field, 
DuplicateFilter.KM_USE_FIRST_OCCURRENCE, 
DuplicateFilter.PM_FAST_INVALIDATION));

filtersByField.put(field, filter);
  }

  rb.getFilters().add(new FilteredQuery(rb.getQuery(), filter));
}
  }
...


Re: java.lang.ClassCastException being thrown seemingly at random

2011-03-09 Thread Yonik Seeley
On Wed, Mar 9, 2011 at 8:34 PM, harish.agarwal harish.agar...@gmail.com wrote:
 I'm using a recent build of the trunk (from 3/1).  I've noticed that after
 the index is up and running for some time I start to get intermittent errors
 that look like this:

 Mar 2, 2011 9:26:01 AM org.apache.solr.common.SolrException log
 SEVERE: java.lang.ClassCastException

This was probably fixed today:
https://issues.apache.org/jira/browse/LUCENE-2953

-Yonik
http://lucidimagination.com


Re: NRT in Solr

2011-03-09 Thread Bill Bell
So it looks like can handle adding new documents, and expiring old
documents. Updating a document is not part of the game.
This would work well for message boards or tweet type solutions.

Solr can do this as well directly. Why wouldn't you just improve the
document and facet caching so that when you append there is not a huge hit
to Solr? Also we could add a expiration to documents as well.

The big issue for me is that when I update Solr I need to replicate that
change quickly to all slaves. If we changed replication to stream to the
slaves in Near Real Time and not have to create a whole new index version,
warming, etc, that would be awesome. That combined with better caching
smarts and we have a near perfect solution.

Thanks.

On 3/9/11 3:29 PM, Smiley, David W. dsmi...@mitre.org wrote:

Zoie adds NRT to Solr:
http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin

I haven't tried it yet but looks cool.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote:

 Jae,
 
 NRT hasn't been implemented NRT as of yet in Solr, I think partially
 because major features such as replication, caching, and uninverted
 faceting suddenly are no longer viable, eg, it's another round of
 testing etc.  It's doable, however I think the best approach is a
 separate request call path, to avoid altering to current [working]
 API.
 
 On Tue, Mar 8, 2011 at 1:27 PM, Jae Joo jaejo...@gmail.com wrote:
 Hi,
 Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could
not
 find the configuration for NRT.
 
 Regards
 
 Jae
 









Re: docBoost

2011-03-09 Thread Bill Bell
Yes just add if statement based on a field type and do a row.put() only if
that other value is a certain value.



On 3/9/11 1:39 PM, Brian Lamb brian.l...@journalexperts.com wrote:

That makes sense. As a follow up, is there a way to only conditionally use
the boost score? For example, in some cases I want to use the boost score
and in other cases I want all documents to be treated equally.

On Wed, Mar 9, 2011 at 2:42 PM, Jayendra Patil
jayendra.patil@gmail.com
 wrote:

 you can use the ScriptTransformer to perform the boost calcualtion and
 addition.
 http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer

 dataConfig
script![CDATA[
function f1(row)  {
// Add boost
row.put('$docBoost',1.5);
return row;
}
]]/script
document
entity name=e pk=id transformer=script:f1
 query=select * from X

/entity
/document
 /dataConfig

 Regards,
 Jayendra


 On Wed, Mar 9, 2011 at 2:01 PM, Brian Lamb
 brian.l...@journalexperts.com wrote:
  Anyone have any clue on this on?
 
  On Tue, Mar 8, 2011 at 2:11 PM, Brian Lamb 
 brian.l...@journalexperts.comwrote:
 
  Hi all,
 
  I am using dataimport to create my index and I want to use docBoost
to
  assign some higher weights to certain docs. I understand the concept
 behind
  docBoost but I haven't been able to find an example anywhere that
shows
 how
  to implement it. Assuming the following config file:
 
  document
 entity name=animal
dataSource=animals
pk=id
query=SELECT * FROM animals
  field column=id name=id /
  field column=genus name=genus /
  field column=species name=species /
  entity name=boosters
 dataSource=boosts
 query=SELECT boost_score FROM boosts WHERE animal_id
=
 ${
  animal.id}
field column=boost_score name=boost_score /
  /entity
/entity
  /document
 
  How do I add in a docBoost score? The boost score is currently in a
  separate table as shown above.
 
 





Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

2011-03-09 Thread Karthik Shiraly
Hi,

I'm using Solr 1.4.1.
The scenario involves user uploading multiple files. These have content
extracted using SolrCell, then indexed by Solr along with other information
about the user.

ContentStreamUpdateRequest seemed like the right choice for this - use
addFile() to send file data, and use setParam() to add normal data fields.

However, when I do multiple addFile() to ContentStreamUpdateRequest, I
observed that at the server side, even the file parts of this multipart post
are interpreted as regular form fields by the FileUpload component.
I found that FileUpload does so because the filename value in
Content-Disposition headers of each part are not being set.
Digging a bit further, it seems the actual root cause is in the client side
solrj API ... the CommonsHttpSolrServer class is not setting filename
value in Content-Disposition header while creating multipart Part
instances (from HttpClient framework).

I solved this problem by a hack - in CommonsHttpSolrServer.request() method
where the PartBase instances are created, I overrode
sendDispositionHeader() and added filename value. That solved the
problem.

However, my questions are:
1. Am I using ContentStreamUpdateRequest wrong, or is this actually a bug?
Should I be using something else?

2. My end goal is to map contents of each file to *separate* fields, not a
common field. Since the regular ExtractingRequestHandler maps all content to
just one field, I believe I've to create a custom RequestHandler (possibly
reusing existing SolrCell classes).
Is this approach right?

Thanks
Karthik


Re: NRT in Solr

2011-03-09 Thread Lance Norskog
Please start new threads for new conversations.

On Wed, Mar 9, 2011 at 2:27 AM, stockii stock.jo...@googlemail.com wrote:
 question: http://wiki.apache.org/solr/NearRealtimeSearchTuning


 'PERFORMANCE WARNING: Overlapping onDeckSearchers=x

 i got this message.
 in my solrconfig.xml: maxWarmingSearchers=4, if i set this to 1 or 2 i got
 exception. with 4 i got nothing, but the Performance Warning. the
 wiki-articel says, that the best solution is to set the warmingSearcher to
 1!!! how can this work ?

 -
 --- System 
 

 One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
 1 Core with 31 Million Documents other Cores  100.000

 - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
 - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/NRT-in-Solr-tp2652689p2654696.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Re: Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

2011-03-09 Thread Karthik Shiraly
In case the exact problem was not clear to somebody:
The problem with FileUpload interpreting file data as regular form fields is
that, Solr thinks there are no content streams in the request and throws a
missing_content_stream exception.

On Thu, Mar 10, 2011 at 10:59 AM, Karthik Shiraly 
karthikshiral...@gmail.com wrote:

 Hi,

 I'm using Solr 1.4.1.
 The scenario involves user uploading multiple files. These have content
 extracted using SolrCell, then indexed by Solr along with other information
 about the user.

 ContentStreamUpdateRequest seemed like the right choice for this - use
 addFile() to send file data, and use setParam() to add normal data fields.

 However, when I do multiple addFile() to ContentStreamUpdateRequest, I
 observed that at the server side, even the file parts of this multipart post
 are interpreted as regular form fields by the FileUpload component.
 I found that FileUpload does so because the filename value in
 Content-Disposition headers of each part are not being set.
 Digging a bit further, it seems the actual root cause is in the client side
 solrj API ... the CommonsHttpSolrServer class is not setting filename
 value in Content-Disposition header while creating multipart Part
 instances (from HttpClient framework).

 I solved this problem by a hack - in CommonsHttpSolrServer.request() method
 where the PartBase instances are created, I overrode
 sendDispositionHeader() and added filename value. That solved the
 problem.

 However, my questions are:
 1. Am I using ContentStreamUpdateRequest wrong, or is this actually a bug?
 Should I be using something else?

 2. My end goal is to map contents of each file to *separate* fields, not a
 common field. Since the regular ExtractingRequestHandler maps all content to
 just one field, I believe I've to create a custom RequestHandler (possibly
 reusing existing SolrCell classes).
 Is this approach right?

 Thanks
 Karthik