Re: when to change rows param?

2011-04-12 Thread Paul Libbrecht
Hoss,

as of now I managed to adjust this in the client code before it touches the 
server so it is not urgent at all anymore.

I wanted to avoid touching the client code (which is giving, oh great fun, MSIE 
concurrency miseries) hence I wanted a server-side rewrite of the maximum 
number of hits returned. Thus far my server customization, except a custom 
solrconfig and schema, are a query-component and response-handler. 

I thought that injecting the rows param in the query-component would have been 
enough (from the limits param my client is giving). But it seems not to be 
the case.

paul


Le 12 avr. 2011 à 02:07, Chris Hostetter a écrit :

 
 Paul: can you elaborate a little bit on what exactly your problem is?
 
 - what is the full component list you are using?
 - how are you changing the param value (ie: what does the code look like)
 - what isn't working the way you expect?
 
 : I've been using my own QueryComponent (that extends the search one) 
 : successfully to rewrite web-received parameters that are sent from the 
 : (ExtJS-based) javascript client. This allows an amount of 
 : query-rewriting, that's good. I tried to change the rows parameter there 
 : (which is limit in the query, as per the underpinnings of ExtJS) but 
 : it seems that this is not enough.
 : 
 : Which component should I subclass to change the rows parameter?
 
 -Hoss



Re: Can I set up a config-based distributed search

2011-04-12 Thread Ran Peled
Thanks, Ludovic and Jonathan.  Yes, this configuration default is exactly
what I was looking for.

Ran


On Mon, Apr 11, 2011 at 7:12 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 I have not worked with shards/distributed, but I think you can probably
 specify them as defaults in your requesthandler in your solrconfig.xml
 instead.

 Somewhere there is (or was) a wiki page on this I can't find right now.
 There's a way to specify (for a particular request handler) a default
 parameter value, such as for 'shards', that will be used if none were given
 with the request. There's also a way to specify an invariant that will
 always be used even if something else is passed in on the request.

 Ah, found it: http://wiki.apache.org/solr/SearchHandler#Configuration


 On 4/11/2011 8:31 AM, Ran Peled wrote:

 In the Distributed Search page (
 http://wiki.apache.org/solr/DistributedSearch), it is documented that in
 order to perform a distributed search over a sharded index, I should use
 the
 shards request parameter, listing the shards to participate in the
 search
 (e.g. ?shards=localhost:8983/solr,localhost:7574/solr).   I am planning a
 new pretty large index (1B+ items).  Say I have a 100 shards, specifying
 the
 shards on the request URL becomes unrealistic due to length of URL.  It is
 also redundant to do that on every request.

 Is there a way to specify the list of shards in a configuration file,
 instead of on the query URL?  I have seen references to relevant config in
 SolrCloud, but as I understand it planned to be released only in Solr 4.0.

 Thanks,
 Ran




exceeded limit of maxWarmingSearchers = 4 =(

2011-04-12 Thread stockii
hello.

my NRT-Search is not correctly configured =( 

2 Solr-Instances. one searcher and one updater 

the updater start every minute an update of around 3000 documents. and the
searcher start an commit ervery minute to refresh the index and read the new
doc`s

these are my Cache values for an 36 Million Document Index:




after a restart my warmuptime is about 1700 MS.
do you think i need to set autowarmcount of every cache to near of zero ? 




-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/exceeded-limit-of-maxWarmingSearchers-4-tp2810380p2810380.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 3.1 performance compared to 1.4.1

2011-04-12 Thread Marius van Zwijndregt
Hi Lance,

Well not actually copied over the whole configuration files, instead i just
added in the missing configuration (into a fresh copy of the example
directory).

By the directory implementation do you mean the readers used by
SolrIndexSearcher ?

These are:
reader : SolrIndexReader{this=1cb04a0,r=ReadOnlyDirectoryReader@1cb04a0
,refCnt=1,segments=1}
readerDir : 
org.apache.lucene.store.NIOFSDirectory@/opt/solr3/example/solr/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@1efc208

But it seems the performance is actually still becoming better, at the
moment the average has dropped even lower to 28ms (in comparison to 43ms in
1.4.1)

Cheers!

Marius

2011/4/12 Lance Norskog goks...@gmail.com

 Marius: I have copied the configuration from 1.4.1 to the 3.1.

 Does the Directory implementation show up in the JMX beans? In
 admin/statistics.jsp ? Or the Solr startup logs? (Sorry, don't have a
 Solr available.)

 Yonik:
  What platform are you on?  I believe the Lucene Directory
  implementation now tries to be smarter (compared to lucene 2.9) about
  picking the best default (but it may not be working out for you for
  some reason)

 Lance

 On Sun, Apr 10, 2011 at 12:46 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
  On Fri, Apr 8, 2011 at 9:53 AM, Marius van Zwijndregt
  pionw...@gmail.com wrote:
  Hello !
 
  I'm new to the list, have been using SOLR for roughly 6 months and love
 it.
 
  Currently i'm setting up a 3.1 installation, next to a 1.4.1
 installation
  (Ubuntu server, same JVM params). I have copied the configuration from
 1.4.1
  to the 3.1.
  Both version are running fine, but one thing ive noticed, is that the
 QTime
  on 3.1, is much slower for initial searches than on the (currently
  production) 1.4.1 installation.
 
  For example:
 
  Searching with 3.1; http://mysite:9983/solr/select?q=grasmaaier: QTime
  returns 371
  Searching with 1.4.1: http://mysite:8983/solr/select?q=grasmaaier:
 QTime
  returns 59
 
  Using debugQuery=true, i can see that the main time is spend in the
 query
  component itself (org.apache.solr.handler.component.QueryComponent).
 
  Can someone explain this, and how can i analyze this further ? Does it
 take
  time to build up a decent query, so could i switch to 3.1 without having
 to
  worry ?
 
  Thanks for the report... there's no reason that anything should really
  be much slower, so it would be great to get to the bottom of this!
 
  Is this using the same index as the 1.4.1 server, or did you rebuild it?
 
  Are there any other query parameters (that are perhaps added by
  default, like faceting or anything else that could take up time) or is
  this truly just a term query?
 
  What platform are you on?  I believe the Lucene Directory
  implementation now tries to be smarter (compared to lucene 2.9) about
  picking the best default (but it may not be working out for you for
  some reason).
 
  -Yonik
  http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
  25-26, San Francisco
 



 --
 Lance Norskog
 goks...@gmail.com



High (io) load and org.mortbay.jetty.EofException

2011-04-12 Thread Marius van Zwijndregt
Hello !

Every night within my maintenance window, during high load caused by
postgresql (vacuum analyze), i see a few (10-30) messages showing up in the
solr 3.1 logfile.

SEVERE: org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)
at
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
at
org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:278)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
at org.apache.solr.common.util.FastWriter.flush(FastWriter.java:115)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:344)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at org.mortbay.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:368)
at org.mortbay.io.bio.StreamEndPoint.flush(StreamEndPoint.java:129)
at org.mortbay.io.bio.StreamEndPoint.flush(StreamEndPoint.java:161)
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:714)
... 25 more


The client application will return a 408 Request Timeout, and the search is
stopped.

Does anyone know what might cause this, and how i can prevent it from
happening ? I think this might be the time Jetty is willing to wait before
my client starts sending the http request, or the client stops the request
premature.

Cheers!
Marius


Re: exceeded limit of maxWarmingSearchers = 4 =(

2011-04-12 Thread stockii
i start a commit on searcher-Core with:
.../core/update?commit=truewaitFlush=false

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/exceeded-limit-of-maxWarmingSearchers-4-tp2810380p2810458.html
Sent from the Solr - User mailing list archive at Nabble.com.


Berlin Buzzwords - conference schedule released

2011-04-12 Thread Simon Willnauer
Hey folks,

The Berlin Buzzwords team recently released the schedule for
the conference on high scalability. The conference focuses on the
topics search,
data analysis and NoSQL. It is to take place on June 6/7th 2011 in Berlin.

We are looking forward to two awesome keynote speakers who shaped the world of
open source data analysis: Doug Cutting, founder of Apache Lucene and
Hadoop) as
well as Ted Dunning (Chief Application Architect at MapR Technologies
and active
developer at Apache Hadoop and Mahout).

This year the program has been extended by one additional track. The first
conference day focuses on the topics Apache Lucene, NoSQL, messaging and data
mining. Speakers include Jakob Homan from Yahoo! who will give in introduction
to the new Hadoop security features, Daniel Einspanjer is going to show how
NoSQL and Hadoop are being used at Mozilla Socorro. In addition Chris
Male gives
a presentation on how to integrate Solr with J2EE applications.

The second day features presentations by Jonathan Gray on Facebook's use of
HBase in their Messaging architecture, Dawid Weiss, Simon Willnauer
and Uwe Schindler are
showing the latest Apache Lucene developments, Mark Miller provides
insights into Solr Performance
and Mathias Stearn is discussing MongoDB scalability questions.

For our developers Berlin Buzzwords is a great chance to introduce our open
source project Couchbase (based on Apache CouchDB and Memcached), get in touch
with interested users and discuss their technical questions on site. says Jan
Lehnardt, Co-Founder of Couchbase (merged CouchOne and Membase (formerly
Northscale) [1].

Registration is open, regular tickets are available for 440,- Euro. There is a
group discount. Prizes include coffee break and lunch catering.

After the conference there will be trainings on topics related to Berlin
Buzzwords such as Enterprise Search with Apache Lucene and Solr [2]. For the
very first time we will also have community organised hackathons, that give
Berlin Buzzwords visitors the opportunity to work together with the projects'
developers on interesting tasks.

Berlin Buzzwords is produced by newthinking communications in
collaboration with Isabel Drost (Member of the Apache Software
Foundation, PMC member Apache community development and co-founder of
Apache Mahout), Jan Lehnardt (PMC Chair Apache CouchDB) and Simon
Willnauer (PMC member Apache Lucene).

[1] http://www.heise.de/open/meldung/NoSQL-CouchOne-und-Membase-fusionieren-zu -
Couchbase-1185227.html
[2] http://www.jteam.nl/training/2-day-training-Lucene-Solr.html


Re: exceeded limit of maxWarmingSearchers = 4 =(

2011-04-12 Thread stockii
my filterCache has a warmupTime from ~6000 ... but my config is like this:
LRU Cache(maxSize=3000, initialSize=50, autowarmCount=50 ...)

should i set maxSize to 50 or similar value ? 

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/exceeded-limit-of-maxWarmingSearchers-4-tp2810380p2810561.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: exceeded limit of maxWarmingSearchers = 4 =(

2011-04-12 Thread stockii
oooh. my queryResultCache has a warmupTime from 54000 = ~1 Minute 

any suggestions ??


-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/exceeded-limit-of-maxWarmingSearchers-4-tp2810380p2810572.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Decrease warmupTime

2011-04-12 Thread stockii
i fighting with the same problem but with jetty.

its in this case necessary to delete also the jetty work-DIR ???

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Decrease-warmupTime-tp494023p2810607.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing Best Practice

2011-04-12 Thread Darx Oman
Hi Lance

thanx for your reply, but I have a question
is this patch committed to trunk?


AbstractSolrTestCase and Solr 3.1.0

2011-04-12 Thread Tommaso Teofili
Hi all,
I am porting a previously series of Solr plugins developed for 1.4.1 version
to 3.1.0, I've written some integration tests extending the
AbstractSolrTestCase [1] utility class but now it seems that wasn't included
in the solr-core 3.1.0 artifact as it's in the solr/src/test directory. Was
that a choice for the release or it's me missing something (or both)? Should
I replace it with a different class with same scope or should I refactor
my integration tests in a different way?
Thanks in advance for any feedback.
Regards,
Tommaso

[1] :
http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_1/solr/src/test/org/apache/solr/util/AbstractSolrTestCase.java


function query apply only in the subset of the query

2011-04-12 Thread Marco Martinez
Hi everyone,

My situation is the next, I need to sum the value of a field to the score to
the docs returned in the query, but not to all the docs, example:

q=car returns 3 docs

1-
name=car ford
marketValue=1
score=1.3

2-
name=car citroen
marketValue=2
score=1.3

3-
name=car mercedes
marketValue=0.5
score=1.3

but if want to sum the marketValue to the score, my returned list is the
next:

q=car+_val_:marketValue

1-
name=bus
marketValue=5
score=5

2-
name=car citroen
marketValue=2
score=3.3

3-
name=car ford
marketValue=1
score=2.3

4-
name=car mercedes
marketValue=0.5
score=1.8


Its possible to apply the function query only to the documents returned in
the first query?


Thanks in advance,

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


Help with Nested Query

2011-04-12 Thread Hasnain
Hi,

Im trying to do somethinglike this in Solr 1.4.1
fq=category_id:(24 79)

However the values inside the parenthesis will be fetched through another
query, so far I’ve tried using _query_ but it doesnt work the way I want it
to. Here is what im trying

fq=category_id:(_query_:”{!lucene fl=category_id} video”)

any suggestions on this?
thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-with-Nested-Query-tp2811038p2811038.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solrj retry handling - prevent ProtocolException: Unbuffered entity enclosing request can not be repeated

2011-04-12 Thread Martin Grotzke
Hi,

from time to time we're seeing a ProtocolException: Unbuffered entity
enclosing request can not be repeated. in the logs when sending ~500
docs to solr (the stack trace is at the end of the email).

I'm aware that this was discussed before (e.g. [1]) and our solution was
already to reduce the number of docs that are sent to solr.

However, I think that the issue might be solved in solrj. This
discussion on the httpclient-dev mailing list [2] points out the
solution under option 3) re-instantiate the input stream and retry the
request manually.

AFAICS CommonsHttpSolrServer.request when _maxRetries is set to s.th. 
0 (see [3]) already does some retry stuff, but not around the actual
http method execution (_httpClient.executeMethod(method)). Not sure for
what the several tries are implemented, but I'd say that if the user
sets maxRetries to s.th.  0 also http method execution should be retried.

Another thing is the actually seen ProtocolException: AFAICS this is
thrown as httpclient (HttpMethodDirector.executeWithRetry) performs a
retry itself (see [4]) while the actually processed HttpMethod does not
support this.

As HttpMethodDirector.executeWithRetry already checks for a
HttpMethodRetryHandler (under param HttpMethodParams.RETRY_HANDLER,
[5]), it seems as if it would be enough to add such a handler for the
update/POST requests to prevent the ProtocolException.

So in summary I suggest two things:
1) Retry http method execution when maxRetiries is  0
2) Prevent HttpClient from doing retries (by adding HttpMethodRetryHandler)

I first wanted to post it here on the list to see if there are
objections or other solutions. Or if there are plans to replace commons
httpclient (3.x) by s.th. like apache httpclient 4.x or async-http-client.

If there's an agreement that the proposed solution is the way to go ATM
I'd submit an appropriate issue for this.

Any comments?

Cheers,
Martin



[1]
http://lucene.472066.n3.nabble.com/Unbuffered-entity-enclosing-request-can-not-be-repeated-tt788186.html

[2]
http://www.mail-archive.com/commons-httpclient-dev@jakarta.apache.org/msg06723.html

[3]
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/src/solrj/org/apache/solr/client/solrj/impl/CommonsHttpSolrServer.java?view=markup#l281

[4]
http://svn.apache.org/viewvc/httpcomponents/oac.hc3x/trunk/src/java/org/apache/commons/httpclient/HttpMethodDirector.java?view=markup#l366

[5]
http://svn.apache.org/viewvc/httpcomponents/oac.hc3x/trunk/src/java/org/apache/commons/httpclient/HttpMethodDirector.java?view=markup#l426


Stack trace:

Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered
entity enclosing request can not be repeated.
at
org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487)
at
org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2110)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1088)
at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)


-- 
Martin Grotzke
http://twitter.com/martin_grotzke



signature.asc
Description: OpenPGP digital signature


Updates during Optimize

2011-04-12 Thread stockii
Hello.

When is start an optimize (which takes more than 4 hours) no updates from
DIH are possible.
i thougt solr is copy the hole index and then start an optimize from the
copy and not lock the index and optimize this ... =(

any way to do both in the same time ? 




-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Updates-during-Optimize-tp2811183p2811183.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: AbstractSolrTestCase and Solr 3.1.0

2011-04-12 Thread Robert Muir
On Tue, Apr 12, 2011 at 6:44 AM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:
 Hi all,
 I am porting a previously series of Solr plugins developed for 1.4.1 version
 to 3.1.0, I've written some integration tests extending the
 AbstractSolrTestCase [1] utility class but now it seems that wasn't included
 in the solr-core 3.1.0 artifact as it's in the solr/src/test directory. Was
 that a choice for the release or it's me missing something (or both)? Should
 I replace it with a different class with same scope or should I refactor
 my integration tests in a different way?
 Thanks in advance for any feedback.

Hi Tommaso:

this class (and other test code) was changed to depend upon lucene's
test code... due to this it moved to src/test. The issue to make a
solr test-framework jar file didnt make 3.1,
https://issues.apache.org/jira/browse/SOLR-2061, however its committed
to the 3.x branch.

note: the class as is in solr 3.1 is un-extendable by an outside
project I think, I cleaned up these classes some in SOLR-2061 and
tested it all with an external project, so it should be ok now in the
branch.


Re: XML not coming through from nabble to Gmail

2011-04-12 Thread Erick Erickson
Chris:

Here's the nabble URL:

http://lucene.472066.n3.nabble.com/Strip-spaces-and-new-line-characters-from-data-tp2795453p2795453.html

The message in the Solr list is from alexei on 8-April. Strip spaces and
newline characters from data.

This started happening a couple (?) of weeks ago and I don't remember
changing anything. Yeah, sure, they all say that

This bit of XML that alexei included just doesn't come through to my gmail
account, it'll be interesting to see if it makes it out

fieldType name=sint class=solr.SortableIntField sortMissingLast=true
omitNorms=true
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.TrimFilterFactory /
/fieldType


Thanks,
Erick

On Mon, Apr 11, 2011 at 9:06 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I see the same problem (missing markup) in Thunderbird. Seems like Nabble
 : might be the culprit?

 if someone can cite some specific examples (by email message-id, or
 subject, or date+sender, or url from nabble, or url from any public
 archive, or anything more specific then posts from nabble containing
 xml) we can check the official apache mail archive which contains the
 raw message as recieved by ezmlm., such as..


 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201104.mbox/raw/%3cbanlktimcpthzalstrwhn3rtzpxdzkbo...@mail.gmail.com%3E



 -Hoss



Re: XML not coming through from nabble to Gmail

2011-04-12 Thread Erick Erickson
FWIW, I see the xml I just sent in gMail, so I'm guessing things are over on
the nabble side, but I have very little evidence..

Erick

P.S. It's not a huge deal, getting to the correct message on nabble is just
a click away. But it is a bit annoying.

On Tue, Apr 12, 2011 at 8:38 AM, Erick Erickson erickerick...@gmail.comwrote:

 Chris:

 Here's the nabble URL:


 http://lucene.472066.n3.nabble.com/Strip-spaces-and-new-line-characters-from-data-tp2795453p2795453.html

 The message in the Solr list is from alexei on 8-April. Strip spaces and
 newline characters from data.

 This started happening a couple (?) of weeks ago and I don't remember
 changing anything. Yeah, sure, they all say that

 This bit of XML that alexei included just doesn't come through to my gmail
 account, it'll be interesting to see if it makes it out

 fieldType name=sint class=solr.SortableIntField sortMissingLast=true
 omitNorms=true
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.TrimFilterFactory /
 /fieldType


 Thanks,
 Erick

 On Mon, Apr 11, 2011 at 9:06 PM, Chris Hostetter hossman_luc...@fucit.org
  wrote:


 : I see the same problem (missing markup) in Thunderbird. Seems like
 Nabble
 : might be the culprit?

 if someone can cite some specific examples (by email message-id, or
 subject, or date+sender, or url from nabble, or url from any public
 archive, or anything more specific then posts from nabble containing
 xml) we can check the official apache mail archive which contains the
 raw message as recieved by ezmlm., such as..


 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201104.mbox/raw/%3cbanlktimcpthzalstrwhn3rtzpxdzkbo...@mail.gmail.com%3E



 -Hoss





Re: DIH OutOfMemoryError?

2011-04-12 Thread stockii
Make sure streaming is on. 
-- how to check ? 

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-OutOfMemoryError-tp2759013p2811270.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrException: Unavailable Service

2011-04-12 Thread Phong Dais
Hi,

I did not want to hijack this thread (
http://www.mail-archive.com/solr-user@lucene.apache.org/msg34181.html)
but I am experiencing the same exact problem mentioned here.

To sum up the issue, I am getting intermittent Unavailable Service
exception during indexing commit phase.
I know that I am calling commit very often but I do not see any way around
this.  This is my situation, I am
indexing a huge amount of documents using multiple instance of SolrJ client
running on multiple servers.  There is no way
for me control when commit is called from these clients, so two different
clients can call commit at the same time.
I am not sure if I can/should use auto/timed commit because I need to know
if a commit failed so I can rollback the batch that failed.

What kind of options do I have?
Should I try to catch the exception and keep trying to recommit until it
goes through?  I can see some potential of problems with this approach.
Do I need to write a request broker to queue up all these commit and send
them to solr one by one in a timely manner?

Just wanted to know if anyone has a solution for this problem before I dive
off the deep end.

Thanks,
Phong


RE: XML not coming through from nabble to Gmail

2011-04-12 Thread Steven A Rowe
I've asked on Nabble if they know of a fix for the problem:

http://nabble-support.1.n2.nabble.com/solr-dev-mailing-list-tp6023495p6264955.html

Steve

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Tuesday, April 12, 2011 8:43 AM
 To: Chris Hostetter
 Cc: solr-user@lucene.apache.org
 Subject: Re: XML not coming through from nabble to Gmail
 
 FWIW, I see the xml I just sent in gMail, so I'm guessing things are over
 on
 the nabble side, but I have very little evidence..
 
 Erick
 
 P.S. It's not a huge deal, getting to the correct message on nabble is
 just
 a click away. But it is a bit annoying.
 
 On Tue, Apr 12, 2011 at 8:38 AM, Erick Erickson
 erickerick...@gmail.comwrote:
 
  Chris:
 
  Here's the nabble URL:
 
 
  http://lucene.472066.n3.nabble.com/Strip-spaces-and-new-line-characters-
 from-data-tp2795453p2795453.html
 
  The message in the Solr list is from alexei on 8-April. Strip spaces
 and
  newline characters from data.
 
  This started happening a couple (?) of weeks ago and I don't remember
  changing anything. Yeah, sure, they all say that
 
  This bit of XML that alexei included just doesn't come through to my
 gmail
  account, it'll be interesting to see if it makes it out
 
  fieldType name=sint class=solr.SortableIntField
 sortMissingLast=true
  omitNorms=true
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.TrimFilterFactory /
  /fieldType
 
 
  Thanks,
  Erick
 
  On Mon, Apr 11, 2011 at 9:06 PM, Chris Hostetter
 hossman_luc...@fucit.org
   wrote:
 
 
  : I see the same problem (missing markup) in Thunderbird. Seems like
  Nabble
  : might be the culprit?
 
  if someone can cite some specific examples (by email message-id, or
  subject, or date+sender, or url from nabble, or url from any public
  archive, or anything more specific then posts from nabble containing
  xml) we can check the official apache mail archive which contains the
  raw message as recieved by ezmlm., such as..
 
 
  http://mail-archives.apache.org/mod_mbox/lucene-solr-
 user/201104.mbox/raw/%3cbanlktimcpthzalstrwhn3rtzpxdzkbo...@mail.gmail.com
 %3E
 
 
 
  -Hoss
 
 
 


Re: SolrException: Unavailable Service

2011-04-12 Thread Erick Erickson
If your commit from the client fails, you don't really know the
state of your index anyway. All the threads you have sending
documents to Solr are adding them to a single internal buffer.
Committing flushes that buffer.

So if thread 1 gets an error on commit, it will presumably
have some documents from thread 2 in the commit. But
thread 2 won't necessarily see the results. So I don't think
your statement about needing to know if a commit fails
is really

On Tue, Apr 12, 2011 at 8:50 AM, Phong Dais phong.gd...@gmail.com wrote:

 Hi,

 I did not want to hijack this thread (
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg34181.html)
 but I am experiencing the same exact problem mentioned here.

 To sum up the issue, I am getting intermittent Unavailable Service
 exception during indexing commit phase.
 I know that I am calling commit very often but I do not see any way
 around
 this.  This is my situation, I am
 indexing a huge amount of documents using multiple instance of SolrJ client
 running on multiple servers.  There is no way
 for me control when commit is called from these clients, so two different
 clients can call commit at the same time.
 I am not sure if I can/should use auto/timed commit because I need to know
 if a commit failed so I can rollback the batch that failed.

 What kind of options do I have?
 Should I try to catch the exception and keep trying to recommit until it
 goes through?  I can see some potential of problems with this approach.
 Do I need to write a request broker to queue up all these commit and send
 them to solr one by one in a timely manner?

 Just wanted to know if anyone has a solution for this problem before I dive
 off the deep end.

 Thanks,
 Phong



Re: SolrException: Unavailable Service

2011-04-12 Thread Erick Erickson
Sorry, fat fingers. Sent that last e-mail inadvertently.

Anyway, if I have this correct, I'd recommend going to
autocommit and NOT committing from the clients. That's
usually the recommended procedure.

This is especially true if you have a master/slave setup,
because each commit from each client will trigger
(potentially) a replication.

Best
Erick

On Tue, Apr 12, 2011 at 9:07 AM, Erick Erickson erickerick...@gmail.comwrote:

 If your commit from the client fails, you don't really know the
 state of your index anyway. All the threads you have sending
 documents to Solr are adding them to a single internal buffer.
 Committing flushes that buffer.

 So if thread 1 gets an error on commit, it will presumably
 have some documents from thread 2 in the commit. But
 thread 2 won't necessarily see the results. So I don't think
 your statement about needing to know if a commit fails
 is really


 On Tue, Apr 12, 2011 at 8:50 AM, Phong Dais phong.gd...@gmail.com wrote:

 Hi,

 I did not want to hijack this thread (
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg34181.html)
 but I am experiencing the same exact problem mentioned here.

 To sum up the issue, I am getting intermittent Unavailable Service
 exception during indexing commit phase.
 I know that I am calling commit very often but I do not see any way
 around
 this.  This is my situation, I am
 indexing a huge amount of documents using multiple instance of SolrJ
 client
 running on multiple servers.  There is no way
 for me control when commit is called from these clients, so two
 different
 clients can call commit at the same time.
 I am not sure if I can/should use auto/timed commit because I need to know
 if a commit failed so I can rollback the batch that failed.

 What kind of options do I have?
 Should I try to catch the exception and keep trying to recommit until it
 goes through?  I can see some potential of problems with this approach.
 Do I need to write a request broker to queue up all these commit and send
 them to solr one by one in a timely manner?

 Just wanted to know if anyone has a solution for this problem before I
 dive
 off the deep end.

 Thanks,
 Phong





Searching during postcommit

2011-04-12 Thread Reeza Edah Tally
Hi,

 

I have been trying to perform a search using a CommonsHttpSolrServer when my
postCommit event listener is called.

I am not able to find the documents just commited; the post in postCommit
caused me to assume that I would; it seems that the commit only takes effect
when all postCommit have returned.

 

Am I missing something or is there another way I can do this?

 

Thanks,

Reeza



Re: function query apply only in the subset of the query

2011-04-12 Thread Erik Hatcher
Try using AND (or set q.op):

   q=car+AND+_val_:marketValue

On Apr 12, 2011, at 07:11 , Marco Martinez wrote:

 Hi everyone,
 
 My situation is the next, I need to sum the value of a field to the score to
 the docs returned in the query, but not to all the docs, example:
 
 q=car returns 3 docs
 
 1-
 name=car ford
 marketValue=1
 score=1.3
 
 2-
 name=car citroen
 marketValue=2
 score=1.3
 
 3-
 name=car mercedes
 marketValue=0.5
 score=1.3
 
 but if want to sum the marketValue to the score, my returned list is the
 next:
 
 q=car+_val_:marketValue
 
 1-
 name=bus
 marketValue=5
 score=5
 
 2-
 name=car citroen
 marketValue=2
 score=3.3
 
 3-
 name=car ford
 marketValue=1
 score=2.3
 
 4-
 name=car mercedes
 marketValue=0.5
 score=1.8
 
 
 Its possible to apply the function query only to the documents returned in
 the first query?
 
 
 Thanks in advance,
 
 Marco Martínez Bautista
 http://www.paradigmatecnologico.com
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón
 Tel.: 91 352 59 42



Analysing all tokens in a stream

2011-04-12 Thread bjornbear
Hi

I would like to build a component that during indexing analyses all tokens
in a stream and adds metadata to a new field based on my analysis. I have
different tasks that I would like to perform, like basic classification and
certain more advanced phrase detections. How would I do this? A normal
TokenFilter can only look at one token a time, but I need to access a larger
context.

I've noticed that there is a TeeSinkTokenFilter that might be useful in
someway since It is also useful for doing things like entity extraction or
proper noun analysis, but I don't understand how.

Can someone help me with some super-simple stub or similar? What I'm looking
for is something like:

class MySmartFilter  {

  public AnalyzeTokens(tokenList)
 {
   metadataTokens = DoTheAnalysis(tokenList);
   AddToField(metadata, metadataTokens);
 }
}

Any help is much appreciated!
Thanks
/Bjorn

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Analysing-all-tokens-in-a-stream-tp2811516p2811516.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: AbstractSolrTestCase and Solr 3.1.0

2011-04-12 Thread Tommaso Teofili
Thanks Robert, that was very useful :)
Tommaso

2011/4/12 Robert Muir rcm...@gmail.com

 On Tue, Apr 12, 2011 at 6:44 AM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  Hi all,
  I am porting a previously series of Solr plugins developed for 1.4.1
 version
  to 3.1.0, I've written some integration tests extending the
  AbstractSolrTestCase [1] utility class but now it seems that wasn't
 included
  in the solr-core 3.1.0 artifact as it's in the solr/src/test directory.
 Was
  that a choice for the release or it's me missing something (or both)?
 Should
  I replace it with a different class with same scope or should I
 refactor
  my integration tests in a different way?
  Thanks in advance for any feedback.

 Hi Tommaso:

 this class (and other test code) was changed to depend upon lucene's
 test code... due to this it moved to src/test. The issue to make a
 solr test-framework jar file didnt make 3.1,
 https://issues.apache.org/jira/browse/SOLR-2061, however its committed
 to the 3.x branch.

 note: the class as is in solr 3.1 is un-extendable by an outside
 project I think, I cleaned up these classes some in SOLR-2061 and
 tested it all with an external project, so it should be ok now in the
 branch.



Re: function query apply only in the subset of the query

2011-04-12 Thread Marco Martinez
Thanks but I tried this and I saw that this work in a standard scenario, but
in my query i use a my own query parser and it seems that they dont doing
the AND and returns all the docs in the index:

My query:
_query_:{!bm25}car AND _val_:marketValue - 67000 docs returned


Solr query parser
car AND _val_:marketValue - 300 docs returned


Thanks,


Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2011/4/12 Erik Hatcher erik.hatc...@gmail.com

 Try using AND (or set q.op):

   q=car+AND+_val_:marketValue

 On Apr 12, 2011, at 07:11 , Marco Martinez wrote:

  Hi everyone,
 
  My situation is the next, I need to sum the value of a field to the score
 to
  the docs returned in the query, but not to all the docs, example:
 
  q=car returns 3 docs
 
  1-
  name=car ford
  marketValue=1
  score=1.3
 
  2-
  name=car citroen
  marketValue=2
  score=1.3
 
  3-
  name=car mercedes
  marketValue=0.5
  score=1.3
 
  but if want to sum the marketValue to the score, my returned list is the
  next:
 
  q=car+_val_:marketValue
 
  1-
  name=bus
  marketValue=5
  score=5
 
  2-
  name=car citroen
  marketValue=2
  score=3.3
 
  3-
  name=car ford
  marketValue=1
  score=2.3
 
  4-
  name=car mercedes
  marketValue=0.5
  score=1.8
 
 
  Its possible to apply the function query only to the documents returned
 in
  the first query?
 
 
  Thanks in advance,
 
  Marco Martínez Bautista
  http://www.paradigmatecnologico.com
  Avenida de Europa, 26. Ática 5. 3ª Planta
  28224 Pozuelo de Alarcón
  Tel.: 91 352 59 42




Re: XML not coming through from nabble to Gmail

2011-04-12 Thread Chris Hostetter
: 
: Here's the nabble URL:
: 
: 
http://lucene.472066.n3.nabble.com/Strip-spaces-and-new-line-characters-from-data-tp2795453p2795453.html
: 
: The message in the Solr list is from alexei on 8-April. Strip spaces and
: newline characters from data.

And the raw message as recieved by apache...

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201104.mbox/raw/%3c1302272763508-2795453.p...@n3.nabble.com%3E

...no XML.

So whatever the problem is it's the mail client (ie: nabble)


-Hoss


Re: Updates during Optimize

2011-04-12 Thread Shawn Heisey

On 4/12/2011 6:21 AM, stockii wrote:

Hello.

When is start an optimize (which takes more than 4 hours) no updates from
DIH are possible.
i thougt solr is copy the hole index and then start an optimize from the
copy and not lock the index and optimize this ... =(

any way to do both in the same time ?


You can't index and optimize at the same time, and I'm pretty sure that 
there isn't any way to make it possible that wouldn't involve a major 
rewrite of Lucene, and possibly Solr.  The devs would have to say 
differently if my understanding is wrong.


The optimize takes place at the Lucene level.  I can't give you much 
in-depth information, but I can give you some high level stuff.  What 
it's doing is equivalent to a merge, down to one segment.  This is not 
the same as a straight file copy.  It must read the entire Lucene data 
structure and build a new one from scratch.  The process removes deleted 
documents and will also upgrade the version number of the index if it 
was written with an older version of Lucene.  It's very likely that the 
reading side of the process is nearly as comprehensive as the CheckIndex 
program, but it also has to write out a new index segment.


The net result -- the process gives your CPU and especially your I/O 
subsystem a workout, simultaneously.  If you were to make your I/O 
subsystem faster, you would probably see a major improvement in your 
optimize times.


On my installation, it takes about 11 minutes to optimize one my 16GB 
shards, each with 9 million docs.  These live in virtual machines that 
are stored on a six-drive RAID10 array using 7200RPM SATA disks.  One of 
my pie-in-the-sky upgrade dreams is to replace that with a four-drive 
RAID10 array using SSD, the other two drives would be regular SATA -- a 
mirrored OS partition.


Thanks,
Shawn



Re: Updates during Optimize

2011-04-12 Thread Jason Rutherglen
You can index and optimize at the same time.  The current limitation
or pause is when the ram buffer is flushing to disk, however that's
changing with the DocumentsWriterPerThread implementation, eg,
LUCENE-2324.

On Tue, Apr 12, 2011 at 8:34 AM, Shawn Heisey s...@elyograg.org wrote:
 On 4/12/2011 6:21 AM, stockii wrote:

 Hello.

 When is start an optimize (which takes more than 4 hours) no updates from
 DIH are possible.
 i thougt solr is copy the hole index and then start an optimize from the
 copy and not lock the index and optimize this ... =(

 any way to do both in the same time ?

 You can't index and optimize at the same time, and I'm pretty sure that
 there isn't any way to make it possible that wouldn't involve a major
 rewrite of Lucene, and possibly Solr.  The devs would have to say
 differently if my understanding is wrong.

 The optimize takes place at the Lucene level.  I can't give you much
 in-depth information, but I can give you some high level stuff.  What it's
 doing is equivalent to a merge, down to one segment.  This is not the same
 as a straight file copy.  It must read the entire Lucene data structure and
 build a new one from scratch.  The process removes deleted documents and
 will also upgrade the version number of the index if it was written with an
 older version of Lucene.  It's very likely that the reading side of the
 process is nearly as comprehensive as the CheckIndex program, but it also
 has to write out a new index segment.

 The net result -- the process gives your CPU and especially your I/O
 subsystem a workout, simultaneously.  If you were to make your I/O subsystem
 faster, you would probably see a major improvement in your optimize times.

 On my installation, it takes about 11 minutes to optimize one my 16GB
 shards, each with 9 million docs.  These live in virtual machines that are
 stored on a six-drive RAID10 array using 7200RPM SATA disks.  One of my
 pie-in-the-sky upgrade dreams is to replace that with a four-drive RAID10
 array using SSD, the other two drives would be regular SATA -- a mirrored OS
 partition.

 Thanks,
 Shawn




Re: Fwd: machine tags, copy fields and pattern tokenizers

2011-04-12 Thread straup
I'm not sure it's a 100% solution but the new path hierarchy tokenizer 
seems promising. I've only played with a little bit with a little too 
booze and not enough sleep (in the sky) so apologies for the 
potty-mouth-ness of this blog post.


http://www.aaronland.info/weblog/2011/04/02/status/#sky

Cheers,

On 3/29/11 6:00 PM, sukhdev wrote:

Hi,

Was you able to solve machine tag problem in solr.  Actually I am
also looking if machine tags can be stored as index in solr and
search in efficient way.

Regards


-- View this message in context:
http://lucene.472066.n3.nabble.com/Fwd-machine-tags-copy-fields-and-pattern-tokenizers-tp506491p2751745.html



Sent from the Solr - User mailing list archive at Nabble.com.






Solr 1.30 Collection Distribution Search

2011-04-12 Thread Li Tan
I have 1 master, and 2 slaves setup with 1.30 collection distribution. My
frontwed web application does query to the master, do I need to change any
code in the web application to query on the slaves? or does the master
requests query from the slaves automatcially? Please help thx.


Re: SolrException: Unavailable Service

2011-04-12 Thread Phong Dais
Erick,

My setup is not quite the way you described.  I have multiple threads
indexing simultaneously, but I only have 1 thread doing the commit after all
indexing threads finished.  I have multiple instances of this running each
in their own java vm.  I'm ok with throwing out all the docs indexed so far
if the commit fail.

I did not know that the recommended procedure is to use auto commit.  I will
explore this avenue.  I was not aware of the master slave setup neither.

The first thing that comes to mind is how do I know which docs did not get
committed if the auto commit ever fails?  What is the recommended procedure
for handling failure?  Any failed docs will need to be index at some point
in the future.

Thanks for the valuable inputs.

Phong


On Tue, Apr 12, 2011 at 9:09 AM, Erick Erickson erickerick...@gmail.comwrote:

 Sorry, fat fingers. Sent that last e-mail inadvertently.

 Anyway, if I have this correct, I'd recommend going to
 autocommit and NOT committing from the clients. That's
 usually the recommended procedure.

 This is especially true if you have a master/slave setup,
 because each commit from each client will trigger
 (potentially) a replication.

 Best
 Erick

 On Tue, Apr 12, 2011 at 9:07 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  If your commit from the client fails, you don't really know the
  state of your index anyway. All the threads you have sending
  documents to Solr are adding them to a single internal buffer.
  Committing flushes that buffer.
 
  So if thread 1 gets an error on commit, it will presumably
  have some documents from thread 2 in the commit. But
  thread 2 won't necessarily see the results. So I don't think
  your statement about needing to know if a commit fails
  is really
 
 
  On Tue, Apr 12, 2011 at 8:50 AM, Phong Dais phong.gd...@gmail.com
 wrote:
 
  Hi,
 
  I did not want to hijack this thread (
  http://www.mail-archive.com/solr-user@lucene.apache.org/msg34181.html)
  but I am experiencing the same exact problem mentioned here.
 
  To sum up the issue, I am getting intermittent Unavailable Service
  exception during indexing commit phase.
  I know that I am calling commit very often but I do not see any way
  around
  this.  This is my situation, I am
  indexing a huge amount of documents using multiple instance of SolrJ
  client
  running on multiple servers.  There is no way
  for me control when commit is called from these clients, so two
  different
  clients can call commit at the same time.
  I am not sure if I can/should use auto/timed commit because I need to
 know
  if a commit failed so I can rollback the batch that failed.
 
  What kind of options do I have?
  Should I try to catch the exception and keep trying to recommit until
 it
  goes through?  I can see some potential of problems with this approach.
  Do I need to write a request broker to queue up all these commit and
 send
  them to solr one by one in a timely manner?
 
  Just wanted to know if anyone has a solution for this problem before I
  dive
  off the deep end.
 
  Thanks,
  Phong
 
 
 



Re: Solr 1.30 Collection Distribution Search

2011-04-12 Thread Erick Erickson
Yes. You need to put, say, a load balancer on front of your slaves
and distribute the requests to the slave.

Best
Erick

On Tue, Apr 12, 2011 at 2:20 PM, Li Tan litan1...@gmail.com wrote:

 I have 1 master, and 2 slaves setup with 1.30 collection distribution. My
 frontwed web application does query to the master, do I need to change any
 code in the web application to query on the slaves? or does the master
 requests query from the slaves automatcially? Please help thx.



Re: SolrException: Unavailable Service

2011-04-12 Thread Erick Erickson
See below:

On Tue, Apr 12, 2011 at 2:21 PM, Phong Dais phong.gd...@gmail.com wrote:

 Erick,

 My setup is not quite the way you described.  I have multiple threads
 indexing simultaneously, but I only have 1 thread doing the commit after
 all
 indexing threads finished.  I have multiple instances of this running each
 in their own java vm.  I'm ok with throwing out all the docs indexed so far
 if the commit fail.


But this is really the same thing. On the back end, Solr is piping them all
into
a common index and that is where the autocommit happens.

The fact that it's happening in separate JVMs doesn't alter the concept, you
should
let autocommit handle things. The problem here is knowing what hasn't
indexed.


 I did not know that the recommended procedure is to use auto commit.  I
 will
 explore this avenue.  I was not aware of the master slave setup neither.

 The first thing that comes to mind is how do I know which docs did not get
 committed if the auto commit ever fails?  What is the recommended procedure
 for handling failure?  Any failed docs will need to be index at some point
 in the future.


Assuming that you have a uniqueKey defined, you can look at the logs to
see failures.
Then you can choose to re-index all the documents that have changed around
that
time (backing up as far as you need to to be safe) . The key here is that
you can re-index and the old copy (if any) will be replaced by the
re-indexed
copy.

There's nothing really built into Solr that does this for you, you really
have to build this
part yourself.

Best
Erick



 Thanks for the valuable inputs.

 Phong


 On Tue, Apr 12, 2011 at 9:09 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  Sorry, fat fingers. Sent that last e-mail inadvertently.
 
  Anyway, if I have this correct, I'd recommend going to
  autocommit and NOT committing from the clients. That's
  usually the recommended procedure.
 
  This is especially true if you have a master/slave setup,
  because each commit from each client will trigger
  (potentially) a replication.
 
  Best
  Erick
 
  On Tue, Apr 12, 2011 at 9:07 AM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   If your commit from the client fails, you don't really know the
   state of your index anyway. All the threads you have sending
   documents to Solr are adding them to a single internal buffer.
   Committing flushes that buffer.
  
   So if thread 1 gets an error on commit, it will presumably
   have some documents from thread 2 in the commit. But
   thread 2 won't necessarily see the results. So I don't think
   your statement about needing to know if a commit fails
   is really
  
  
   On Tue, Apr 12, 2011 at 8:50 AM, Phong Dais phong.gd...@gmail.com
  wrote:
  
   Hi,
  
   I did not want to hijack this thread (
   http://www.mail-archive.com/solr-user@lucene.apache.org/msg34181.html
 )
   but I am experiencing the same exact problem mentioned here.
  
   To sum up the issue, I am getting intermittent Unavailable Service
   exception during indexing commit phase.
   I know that I am calling commit very often but I do not see any way
   around
   this.  This is my situation, I am
   indexing a huge amount of documents using multiple instance of SolrJ
   client
   running on multiple servers.  There is no way
   for me control when commit is called from these clients, so two
   different
   clients can call commit at the same time.
   I am not sure if I can/should use auto/timed commit because I need to
  know
   if a commit failed so I can rollback the batch that failed.
  
   What kind of options do I have?
   Should I try to catch the exception and keep trying to recommit
 until
  it
   goes through?  I can see some potential of problems with this
 approach.
   Do I need to write a request broker to queue up all these commit and
  send
   them to solr one by one in a timely manner?
  
   Just wanted to know if anyone has a solution for this problem before I
   dive
   off the deep end.
  
   Thanks,
   Phong
  
  
  
 



Spellchecking in the Chinese Lanugage

2011-04-12 Thread alexw
Hi,

I have been trying to get spellcheck to work in the Chinese language. So far
I have not had any luck. Can someone shed some light here as a general guide
line in terms of what need to happen?

I am using the CJKAnalyzer in the text field type and searching works fine,
but spelling does not work. Here are the things I have tried:

1. Put CJKAnalyzer in the textSpell field type.
2. Set the characterEncoding param to utf-8 in the spellcheck search
component.
3. Using Luke, I can see the Chinese characters in the spell field in the
main index.
4. After building the spelling index, I don't see Chinese characters in the
spellchecker index, only terms in English.
5. Tried adding the NGramFilterFactory to the CJKAnalyzer with no luck
either.

Thanks!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellchecking-in-the-Chinese-Lanugage-tp2812726p2812726.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing Flickr and Panaramio

2011-04-12 Thread Estrada Groups
Did this go to the list? I think I may need to resubscribe...

Sent from my iPhone

On Apr 12, 2011, at 12:55 AM, Estrada Groups estrada.adam.gro...@gmail.com 
wrote:

 Has anyone tried doing this? Got any tips for someone getting started?
 
 Thanks,
 Adam
 
 Sent from my iPhone


Re: Solr 1.30 Collection Distribution Search

2011-04-12 Thread Li
Thanks Eric, I thought the master does automatically when you setup collection 
distribution. I wish there are more document for 1.3 collection distribution. 
Do you know how to show the slave stats on the Master admin page, the 
distribution tab? Thanks in advance guys.

Sent from my iPhone

On Apr 12, 2011, at 11:47 AM, Erick Erickson erickerick...@gmail.com wrote:

 Yes. You need to put, say, a load balancer on front of your slaves
 and distribute the requests to the slave.
 
 Best
 Erick
 
 On Tue, Apr 12, 2011 at 2:20 PM, Li Tan litan1...@gmail.com wrote:
 
 I have 1 master, and 2 slaves setup with 1.30 collection distribution. My
 frontwed web application does query to the master, do I need to change any
 code in the web application to query on the slaves? or does the master
 requests query from the slaves automatcially? Please help thx.
 


Re: Indexing Flickr and Panaramio

2011-04-12 Thread Otis Gospodnetic
It did: http://search-lucene.com/?q=panaramio

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Estrada Groups estrada.adam.gro...@gmail.com
 To: Estrada Groups estrada.adam.gro...@gmail.com
 Cc: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Tue, April 12, 2011 3:14:56 PM
 Subject: Re: Indexing Flickr and Panaramio
 
 Did this go to the list? I think I may need to resubscribe...
 
 Sent from  my iPhone
 
 On Apr 12, 2011, at 12:55 AM, Estrada Groups estrada.adam.gro...@gmail.com  
wrote:
 
  Has anyone tried doing this? Got any tips for someone  getting started?
  
  Thanks,
  Adam
  
  Sent  from my iPhone
 


Re: Spellchecking in the Chinese Lanugage

2011-04-12 Thread Otis Gospodnetic
Hi,

Does spellchecking in Chinese actually make sense?  I once asked a native 
Chinese speaker about that and the person told me it didn't really make sense.
Anyhow, with n-grams, I don't think this could technically work even if it made 
sense for Chinese, could it?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: alexw aw...@crossview.com
 To: solr-user@lucene.apache.org
 Sent: Tue, April 12, 2011 3:07:48 PM
 Subject: Spellchecking in the Chinese Lanugage
 
 Hi,
 
 I have been trying to get spellcheck to work in the Chinese language.  So far
 I have not had any luck. Can someone shed some light here as a general  guide
 line in terms of what need to happen?
 
 I am using the CJKAnalyzer  in the text field type and searching works fine,
 but spelling does not work.  Here are the things I have tried:
 
 1. Put CJKAnalyzer in the textSpell  field type.
 2. Set the characterEncoding param to utf-8 in the spellcheck  search
 component.
 3. Using Luke, I can see the Chinese characters in the  spell field in the
 main index.
 4. After building the spelling index, I  don't see Chinese characters in the
 spellchecker index, only terms in  English.
 5. Tried adding the NGramFilterFactory to the CJKAnalyzer with no  luck
 either.
 
 Thanks!
 
 
 --
 View this message in context: 
http://lucene.472066.n3.nabble.com/Spellchecking-in-the-Chinese-Lanugage-tp2812726p2812726.html

 Sent  from the Solr - User mailing list archive at Nabble.com.
 


Re: Searching during postcommit

2011-04-12 Thread Otis Gospodnetic
If I follow things correctly, I think you should be seeing new documents only 
after the commit is done and the new index searcher is open and available for 
search.  If you are searching before the new searcher is available, you are 
probably still hitting the old searcher.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Reeza Edah Tally re...@nova-hub.com
 To: solr-user@lucene.apache.org
 Sent: Tue, April 12, 2011 9:25:59 AM
 Subject: Searching during postcommit
 
 Hi,
 
 
 
 I have been trying to perform a search using a  CommonsHttpSolrServer when my
 postCommit event listener is called.
 
 I  am not able to find the documents just commited; the post in  postCommit
 caused me to assume that I would; it seems that the commit only  takes effect
 when all postCommit have returned.
 
 
 
 Am I missing  something or is there another way I can do this?
 
 
 
 Thanks,
 
 Reeza
 
 


Re: Indexing Flickr and Panaramio

2011-04-12 Thread Péter Király
Hi,

I did Flickr into Lucene about 3 years ago. There is a Flickr API,
which covers almost everything you need (as I remember, not always
Flickr feature was implemented at that time in the API, like the
collection was not searchable). You can harvest by user ID or
searching for a topic. You can use a language library (PHP, Java etc.)
to wrap the details of communication. It is possible, that you would
like to merge information into one entity before send to Solr (like
merging the user, collection and set info into each pictures). The
last step is to transform this information into a Solr document (again
either directly or with a language library). I am not sure if it helps
you, but if you ask more specific question, I try to answer.

regards,
Péter

2011/4/12 Estrada Groups estrada.adam.gro...@gmail.com:
 Has anyone tried doing this? Got any tips for someone getting started?

 Thanks,
 Adam

 Sent from my iPhone



Re: function query apply only in the subset of the query

2011-04-12 Thread Yonik Seeley
On Tue, Apr 12, 2011 at 10:25 AM, Marco Martinez
mmarti...@paradigmatecnologico.com wrote:
 Thanks but I tried this and I saw that this work in a standard scenario, but
 in my query i use a my own query parser and it seems that they dont doing
 the AND and returns all the docs in the index:

 My query:
 _query_:{!bm25}car AND _val_:marketValue - 67000 docs returned

This would seem to point to your generated query {!bm25}car
matching all docs for some reason?

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


Re: Spellchecking in the Chinese Lanugage

2011-04-12 Thread Luke Lu
It doesn't make sense to spell check individual character sized words,
but makes a lot of sense for phrases. Due to pervasive use of pinyin
IM, it's very easy to write phrases that are totally wrong in
semantics and but sounds correct. n-gram should work if it doesn't
mangle the characters.

On Tue, Apr 12, 2011 at 12:47 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hi,

 Does spellchecking in Chinese actually make sense?  I once asked a native
 Chinese speaker about that and the person told me it didn't really make sense.
 Anyhow, with n-grams, I don't think this could technically work even if it 
 made
 sense for Chinese, could it?

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
 From: alexw aw...@crossview.com
 To: solr-user@lucene.apache.org
 Sent: Tue, April 12, 2011 3:07:48 PM
 Subject: Spellchecking in the Chinese Lanugage

 Hi,

 I have been trying to get spellcheck to work in the Chinese language.  So far
 I have not had any luck. Can someone shed some light here as a general  guide
 line in terms of what need to happen?

 I am using the CJKAnalyzer  in the text field type and searching works fine,
 but spelling does not work.  Here are the things I have tried:

 1. Put CJKAnalyzer in the textSpell  field type.
 2. Set the characterEncoding param to utf-8 in the spellcheck  search
 component.
 3. Using Luke, I can see the Chinese characters in the  spell field in the
 main index.
 4. After building the spelling index, I  don't see Chinese characters in the
 spellchecker index, only terms in  English.
 5. Tried adding the NGramFilterFactory to the CJKAnalyzer with no  luck
 either.

 Thanks!


 --
 View this message in context:
http://lucene.472066.n3.nabble.com/Spellchecking-in-the-Chinese-Lanugage-tp2812726p2812726.html

 Sent  from the Solr - User mailing list archive at Nabble.com.




Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-12 Thread Renee Sun
Hi Hoss,
thanks for your response...

you are right I got a typo in my question, but I did use maxSegments, and
here is the exactly url I used:

 curl
'http://localhost:8080/solr/97/update?optimize=truemaxSegments=10waitFlush=true'

I used jconsole and du -sk to monitor each partial optimize, and I am sure
the optimize was done and
it always reduce segment files from 130+ to 65+ when I started with
maxSegments=10; when I run
again with maxSegments=9, it reduce to somewhere in 50.

when I use maxSegments=2, it always reduce the segment to 18; and
maxSegments=1 (full optimize)
will always reduce the core to 10 segment files.

this has been repeated for about dozen times.

I think the resulting files number is depending on the size of the core. I
have a core takes 10GB disk
space, and it has 4 million documents.

It perhaps also depends on other sole/lucene configurations? let me know if
I should give you any data
with our solr config.  

Here is the actual data from the test I run lately for your reference, you
can see it definitely finished
each partial optimize and the time spent is also included (please note I am
using a core id there which
is different from yours):

/tmp # ls /xxx/solr/data/32455077/index | wc   --- this is the
start point, 150 seg files
 150  150 946
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=10waitFlush=true'
real0m36.050s
user0m0.002s
sys0m0.003s

/tmp # ls /xxx/solr/data/32455077/index | wc- after first
partial optimize (10), reduce to 82
 82  82 746
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=9waitFlush=true'
real1m54.364s
user0m0.003s
sys0m0.002s

/tmp # ls /xxx/solr/data/32455077/index | wc
 74  74 674
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=8waitFlush=true'
real2m0.443s
user0m0.002s
sys0m0.003s

/tmp # ls /xxx/solr/data/32455077/index | wc
 66  66 602
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=7waitFlush=true'
?xml version=1.0 encoding=UTF-8?
real3m22.201s
user0m0.002s
sys0m0ls 

/tmp # ls /xxx/solr/data/32455077/index | wc
 58  58 530
/tmp #  time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=6w 
real3m29.277s
user0m0.001s
sys0m0.004s

/tmp # ls /xxx/solr/data/32455077/index | wc
 50  50 458
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=5w 
real3m41.514s
user0m0.003s
sys0m0.003s

/tmp # ls /xxx/solr/data/32455077/index | wc
 42  42 386
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=4w 
real5m35.697s
user0m0.003s
sys0m0.004s

/tmp # ls /xxx/solr/data/32455077/index | wc
 34  34 314
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=3wa 
real7m8.773s
user0m0.003s
sys0m0.002s

/tmp # ls /xxx/solr/data/32455077/index | wc 
 26  26 242
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=2w 
real9m18.814s
user0m0.004s
sys0m0.001s

/tmp # ls /xxx/solr/data/32455077/index | wc
 18  18 170
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=truemaxSegments=1w
(full optimize)
real16m6.599s
user0m0.003s
sys0m0.004s

Disk Space Usage:
first 3 runs took about 20% extra 
middle couple runs took about 50% extra 
last full optimize took 100% extra


--
View this message in context: 
http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2812415.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellchecking in the Chinese Lanugage

2011-04-12 Thread alexw
Thanks Otis and Luke.

Yes it does make sense to spellcheck phrases in Chinese. Looks like the
default Solr spellCheck component is already doing some kind of NGram-ing.
When examining the spellCheck index, I did see gram1, gram2, gram3, gram4...
The problem is no Chinese terms were indexed into the spellChecker index,
only English terms.

Regards,

Alex

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellchecking-in-the-Chinese-Lanugage-tp2812726p2813149.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-12 Thread Chris Hostetter

: /tmp # ls /xxx/solr/data/32455077/index | wc   --- this is the 
start point, 150 seg files
:  150  150 946
: /tmp # time curl


the number of files i nthe index directory is not the number of 
segments

the number of segments is an internal lucene concept that impacts the the 
number of files, but it is not an actual file count.  A segment can 
consist of multiple files depending on how your schema.xml is configured 
(and wether you are using the compound file format)

You can see the current number of segments by looking at the stats page...

http://localhost:8983/solr/admin/stats.jsp
SolrIndexReader{this=64a7c45e,r=ReadOnlyDirectoryReader@64a7c45e,refCnt=1,segments=10}
 

...that's from the solr example, where the index directory at the 
time of that request actually contained 93 files.


-Hoss


Re: Indexing Flickr and Panaramio

2011-04-12 Thread Estrada Groups
Thanks Peter! I am thinking that I may just use Nutch to do the crawl and index 
off of these sites. I need to check out the APIs for each to make sure I'm not 
missing anything related to the geospatial data for each image. Obviously both 
do the extraction when the images are uploaded so I'm guessing that it's also 
stored somewhere too ;-)

Adam 

Sent from my iPhone

On Apr 12, 2011, at 4:00 PM, Péter Király kirun...@gmail.com wrote:

 Hi,
 
 I did Flickr into Lucene about 3 years ago. There is a Flickr API,
 which covers almost everything you need (as I remember, not always
 Flickr feature was implemented at that time in the API, like the
 collection was not searchable). You can harvest by user ID or
 searching for a topic. You can use a language library (PHP, Java etc.)
 to wrap the details of communication. It is possible, that you would
 like to merge information into one entity before send to Solr (like
 merging the user, collection and set info into each pictures). The
 last step is to transform this information into a Solr document (again
 either directly or with a language library). I am not sure if it helps
 you, but if you ask more specific question, I try to answer.
 
 regards,
 Péter
 
 2011/4/12 Estrada Groups estrada.adam.gro...@gmail.com:
 Has anyone tried doing this? Got any tips for someone getting started?
 
 Thanks,
 Adam
 
 Sent from my iPhone
 


Vetting Our Architecture: 2 Repeaters and Slaves.

2011-04-12 Thread Parker Johnson


I am hoping to get some feedback on the architecture I've been planning
for a medium to high volume site.  This is my first time working
with Solr, so I want to be sure what I'm planning isn't totally weird,
unsupported, etc.

We've got a a pair of F5 loadbalancers and 4 hosts.  2 of those hosts will
be repeaters (master+slave), and 2 of those hosts will be pure slaves. One
of the F5 vips, Index-vip will have members HOST1 and HOST2, but HOST2
will be downed and not taking traffic from that vip.  The second vip,
Search-vip will have 3 members: HOST2, HOST3, and HOST4.  The
Index-vip is intended to be used to post and commit index changes.  The
Search-vip is intended to be customer facing.

Here is some ASCII art.  The line with the X's thru it denotes a
downed member of a vip, one that isn't taking any traffic.  The M:
denotes the value in the solrconfig.xml that the host uses as the master.


  Index-vip Search-vip
 / \ /   |   \
/   X   /|\
   / \ / | \
  /   X   /  |  \
 / \ /   |   \
/   X   /|\
   / \ / | \
 HOST1  HOST2  HOST3  HOST4
   REPEATERREPEATERSLAVE  SLAVE
  M:Index-vipM:Index-vip M:Index-vip  M:Index-vip


I've been working through a couple failure scenarios.  Recovering from a
failure of HOST2, HOST3, or HOST4 is pretty straightforward.  Loosing
HOST1 is my major concern.  My plan for recovering from a failure of HOST1
is as follows: Enable HOST2 as a member of the Index-vip, while disabling
member HOST1.  HOST2 effectively becomes the Master.  HOST2, 3, and 4
continue fielding customer requests and pulling indexes from Index-vip.
Since HOST2 is now in charge of crunching indexes and fielding customer
requests, I assume load will increase on that box.

When we recover HOST1, we will simply make sure it has replicated against
Index-vip and then re-enable HOST1 as a member of the Index-vip and
disable HOST2.

Hopefully this makes sense.  If all goes correctly, I've managed to keep
all services up and running without loosing any index data.

So, I have a few questions:

1. Has anyone else tried this dual repeater approach?
2. Am I going to have any semaphore/blocking issues if a repeater is
pulling index data from itself?
3. Is there a better way to do this?


Thanks,
Parker








Re: Vetting Our Architecture: 2 Repeaters and Slaves.

2011-04-12 Thread Erick Erickson
I think the repeaters are misleading you a bit here. The purpose of a
repeater is
usually to replicate across a slow network, say in a remote data
center, then slaves at that center can get more timely updates. I don't
think
they add anything to your disaster recovery scenario.

So I'll ignore repeaters for a bit here. The only difference between a
master
and a slave is a bit of configuration, and usually you'll allocate, say,
memory
differently on the two machines when you start the JVM. You might disable
caches on the master (since they're used for searching). You may..

Let's say
I have master M, and slaves S1, S2, S3. The slaves have an
up-to-date index as of the last replication (just like your repeater
would have). If any slave goes down, you can simply bring up another
machine as a slave, point it at your master, wait for replication on that
slave and then let your load balancer know it's there. This is the
HOST2-4 failure you outlined

Should the master fail you have two choices,
depending upon how long you can wait for *new* content to be searchable.
Let's say you can wait half a day in this situation. Spin up a new machine,
copy the index over from one of the slaves (via a simple copy or by
replicating). Point your indexing process at the master, point your slaves
at the master for replication and you're done.

Let's say you can't wait very long at all (and remember this better be quite
a rare
event). Then you could take a slave (let's say S1) it out of the loop that
serves
searches. Copy in the configuration files you use for your
masters to it, point the indexer and searchers at it and you're done.
Now spin up a new slave as above and your old configuration is back.

Note that in two of these cases, you temporarily have 2 slaves doing the
work
that 3 used to, so a bit of over-capacity may be in order.

But a really good question here is how to be sure all your data is in your
index.
After all, the slaves (and repeater for that matter) are only current up to
the last
replication. The simplest thing to do is simply re-index everything from the
last
known commit point. Assuming you have a uniqueKey defined, if you index
documents that are already in the index, they'll just be replaced, no harm
done.
So let's say your replication interval is 10 minutes (picking a number from
thin
air). When your system is back and you restart your indexer, restart
indexing from,
say, the time you noticed your master went down - 1 hour as the restart
point for
your indexer. You can be more deterministic than this by examining the log
on
the machine you're using to replace the master with and noting the last
replication
time and subtract your hour (or whatever) from that.

Anyway, hope I haven't confused you unduly! The take-away is that a that  a
slave can be made into a master as fast as a repeater can, the replication
process is the same and I just don't see what a repeater buys you in the
scenario you described.

Best
Erick


On Tue, Apr 12, 2011 at 6:33 PM, Parker Johnson parker_john...@gap.comwrote:



 I am hoping to get some feedback on the architecture I've been planning
 for a medium to high volume site.  This is my first time working
 with Solr, so I want to be sure what I'm planning isn't totally weird,
 unsupported, etc.

 We've got a a pair of F5 loadbalancers and 4 hosts.  2 of those hosts will
 be repeaters (master+slave), and 2 of those hosts will be pure slaves. One
 of the F5 vips, Index-vip will have members HOST1 and HOST2, but HOST2
 will be downed and not taking traffic from that vip.  The second vip,
 Search-vip will have 3 members: HOST2, HOST3, and HOST4.  The
 Index-vip is intended to be used to post and commit index changes.  The
 Search-vip is intended to be customer facing.

 Here is some ASCII art.  The line with the X's thru it denotes a
 downed member of a vip, one that isn't taking any traffic.  The M:
 denotes the value in the solrconfig.xml that the host uses as the master.


  Index-vip Search-vip
 / \ /   |   \
/   X   /|\
   / \ / | \
  /   X   /  |  \
 / \ /   |   \
/   X   /|\
   / \ / | \
 HOST1  HOST2  HOST3  HOST4
   REPEATERREPEATERSLAVE  SLAVE
  M:Index-vipM:Index-vip M:Index-vip  M:Index-vip


 I've been working through a couple failure scenarios.  Recovering from a
 failure of HOST2, HOST3, or HOST4 is pretty straightforward.  Loosing
 HOST1 is my major concern.  My plan for recovering from a failure of HOST1
 is as follows: Enable HOST2 as a member of the Index-vip, while disabling
 member HOST1.  HOST2 effectively becomes the Master.  HOST2, 3, and 4
 continue fielding customer requests and pulling indexes from Index-vip.
 Since HOST2 is now in charge of crunching 

Re: Solr and Permissions

2011-04-12 Thread Liam O'Boyle
ManifoldCF sounds like it might be the right solution, so long as it's
not secretly building a filter query in the back end, otherwise it
will hit the same limits.

In the meantime, I have made a minor improvement to my filter query;
it now scans the permitted IDs and attempts to build a filter query
using ranges (e.g. instead of 1 OR 2 OR 3 it will filter using [1 TO
3]) which will hopefully keep me going in the meantime.

Liam

On 12 March 2011 01:46, go canal goca...@yahoo.com wrote:
 Thank you Jan, I will take a look at the MainfoldCF.
 So it seems that the solution is basically to implement something outside of
 Solr for permission control.
 thanks,
 canal




 
 From: Jan Høydahl jan@cominvent.com
 To: solr-user@lucene.apache.org
 Sent: Fri, March 11, 2011 4:17:22 PM
 Subject: Re: Solr and Permissions

 Hi,

 Talk to the ManifoldCF guys - they have successfully implemented support for
 document level security for many repositories including CMC/ECMs and may have
 some hints for you to write your own Authority connector against your system,
 which will fetch the ACL for the document and index it with the document 
 itself.
 This eliminates long query-time filters.

 Re-indexing content for which ACLs have changed is a very common way of doing
 this, and you should not worry too much about performance implications before
 there is a real issue. In real world, you don't change folder permissions very
 often, and that will be a cost you'll have to live with. If you worry that 
 this
 lag between repository state and index state may cause people to see content
 they are not entitled to, it is possible to do late binding filtering of the
 result set as well, but I would avoid that if possible.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 11. mars 2011, at 06.48, go canal wrote:

 To be fair, I think there is a slight difference between a Content Management
 and a Search Engine.

 Access control at per document level, per type level, supporting dynamic role
 changes, etc.are more like  content management use cases; where search 
 solution

 like Solr focuses on different set of use cases;

 But in real world, any content management systems need full text search; so 
 the

 question is to how to support search with permission control.

 JackRabbit integrated with Lucene/Tika, this could be one solution but I do 
 not

 know its performance and scalability;

 CouchDB also integrates with Lucene/Tika, another option?

 I have yet to see a Search Engine that provides some sort of Content 
 Management

 features like we are discussing here (Solr, Elastic Search ?)


 Then the last option is probably to build an application that works with a
 document repository with all necessary content management features and Solr
 which provides search capability;  and handling the permissions outside Solr?
 thanks,
 canal




 
 From: Liam O'Boyle liam.obo...@intelligencebank.com
 To: solr-user@lucene.apache.org
 Cc: go canal goca...@yahoo.com
 Sent: Fri, March 11, 2011 2:28:19 PM
 Subject: Re: Solr and Permissions

 As Canal points out,  grouping into types is not always possible.

 In our case, permissions are not on a per-type level, but either on a per
 folder (of which there can be hundreds) or per item in some cases (of
 which there can be... any number at all).

 Reindexing is also to slow to really be an option; some of the items use
 Tika to extract content, which means that we need to reextract the content
 (variable length of time; average is about half a second, but on some
 documents it will sit there until the connection times out) .  Querying it,
 modifying then resubmitting without rerunning content extraction is still
 faster, but involves sending even more data over the network; either way is
 relatively slow.

 Liam

 On 11 March 2011 16:24, go canal goca...@yahoo.com wrote:

 I have similar requirements.

 Content type is one solution; but there are also other use cases where this
 not
 enough.

 Another requirement is, when the access permission is changed, we need to
 update
 the field - my understanding is we can not unless re-index the whole
 document
 again. Am I correct?
 thanks,
 canal




 
 From: Sujit Pal sujit@comcast.net
 To: solr-user@lucene.apache.org
 Sent: Fri, March 11, 2011 10:39:27 AM
 Subject: Re: Solr and Permissions

 How about assigning content types to documents in the index, and map
 users to a set of content types they are allowed to access? That way you
 will pass in fewer parameters in the fq.

 -sujit

 On Fri, 2011-03-11 at 11:53 +1100, Liam O'Boyle wrote:
 Morning,

 We use solr to index a range of content to which, within our application,
 access is restricted by a system of user groups and permissions.  In
 order
 to ensure that search results don't reveal information about items which
 the
 user doesn't have access to, we need to 

Re: Vetting Our Architecture: 2 Repeaters and Slaves.

2011-04-12 Thread Otis Gospodnetic
Hi Parker,

 Lovely ASCII art. :)

Yes, I think you can simplify this by introducing shared storage (e.g., SAN) 
that hosts the index to which you active/primary master writes.  When your 
primary master dies, you start your stand-by master that is configured to point 
to the same index.  If there are any left-over index locks from the primary 
master, they can be removed (these is a property for that in solrconfig.xml) 
when Solr starts.  Your Index VIP can then be pointed to the the new master.  
Slaves talk to the master via Index VIP, so they hardly notice this.  And since 
the index is on the SAN, your slaves could actually point to that same index 
and 
avoid the whole replication process, thus removing one more moving piece, plus 
eliminating OS cache-unfriendly disk IO caused by index replication as a bonus 
feature.

Repeaters are handy for DR (replication to the second DC) or when you have so 
many slaves that their (very frequent) replication requests and actual index 
replication are too much for a single master, but it doesn't sound like you 
need 
them here unless you really want to have your index or even mirror the whole 
cluster setup in a second DC.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Parker Johnson parker_john...@gap.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Tue, April 12, 2011 6:33:08 PM
 Subject: Vetting Our Architecture: 2 Repeaters and Slaves.
 
 
 
 I am hoping to get some feedback on the architecture I've been  planning
 for a medium to high volume site.  This is my first time  working
 with Solr, so I want to be sure what I'm planning isn't totally  weird,
 unsupported, etc.
 
 We've got a a pair of F5 loadbalancers and 4  hosts.  2 of those hosts will
 be repeaters (master+slave), and 2 of  those hosts will be pure slaves. One
 of the F5 vips, Index-vip will have  members HOST1 and HOST2, but HOST2
 will be downed and not taking traffic  from that vip.  The second vip,
 Search-vip will have 3 members: HOST2,  HOST3, and HOST4.  The
 Index-vip is intended to be used to post and  commit index changes.  The
 Search-vip is intended to be customer  facing.
 
 Here is some ASCII art.  The line with the X's thru it  denotes a
 downed member of a vip, one that isn't taking any traffic.   The M:
 denotes the value in the solrconfig.xml that the host uses as the  master.
 
 
Index-vip Search-vip
   / \  /   |   \
  /   X   /| \
/ \  / | \
/   X/  |  \
   / \ /|   \
  /   X   / |\
/  \ / |  \
  HOST1   HOST2  HOST3   HOST4
REPEATER REPEATERSLAVE  SLAVE
M:Index-vipM:Index-vip M:Index-vip   M:Index-vip
 
 
 I've been working through a couple failure  scenarios.  Recovering from a
 failure of HOST2, HOST3, or HOST4 is  pretty straightforward.  Loosing
 HOST1 is my major concern.  My  plan for recovering from a failure of HOST1
 is as follows: Enable HOST2 as a  member of the Index-vip, while disabling
 member HOST1.  HOST2  effectively becomes the Master.  HOST2, 3, and 4
 continue fielding  customer requests and pulling indexes from Index-vip.
 Since HOST2 is now in  charge of crunching indexes and fielding customer
 requests, I assume load  will increase on that box.
 
 When we recover HOST1, we will simply make  sure it has replicated against
 Index-vip and then re-enable HOST1 as a  member of the Index-vip and
 disable HOST2.
 
 Hopefully this makes  sense.  If all goes correctly, I've managed to keep
 all services up and  running without loosing any index data.
 
 So, I have a few  questions:
 
 1. Has anyone else tried this dual repeater approach?
 2. Am  I going to have any semaphore/blocking issues if a repeater is
 pulling index  data from itself?
 3. Is there a better way to do  this?
 
 
 Thanks,
 Parker