Query across multiple shards - key fields have different names

2013-05-03 Thread Benjamin Ryan
Hi,
  Sorry for the basic question - I can't get to the WiKi to find the answer.
  Version Solr 3.3.0
  I have two separate indexes (currently in two cores but can be moved to 
shards)
  One core holds metadata about educational resources, the other usage 
statistics
  They have a common value  named id in one core and search.resourceid in 
the other core.
  How can I construct a shard query (once I have moved one the cores to a 
different node) so that I can effectively get the statistics for each 
educational resource grouped by each resource?
  This is an offline reporting job that needs to list the usage events for 
educational resources over a time period (the usage events have a date/time 
field.

Regards,
   Ben

--
Dr Ben Ryan
Jorum Technical Manager

5.12 Roscoe Building
The University of Manchester
Oxford Road
Manchester
M13 9PL
Tel: 0160 275 6039
E-mail: 
benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk
--



Re: commit in solr4 takes a longer time

2013-05-03 Thread vicky desai
Hi sandeep,

I made the changes u mentioned and tested again for the same set of docs but
unfortunately the commit time increased.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060622.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: commit in solr4 takes a longer time

2013-05-03 Thread vicky desai
Hi Gopal,

I added the opensearcher parameter as mentioned by you but on checking logs
I found that apensearcher was still true on commit. it is only when I
removed the autosoftcommit parameter the opensearcher parameter worked and
provided faster updates as well. however I require soft commit in my
application.

Any suggestions.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060623.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: socket write error

2013-05-03 Thread Dmitry Kan
Digging in further, found this in HttpCommComponent class:

[code]
  static {
MultiThreadedHttpConnectionManager mgr = new
MultiThreadedHttpConnectionManager();
mgr.getParams().setDefaultMaxConnectionsPerHost(20);
mgr.getParams().setMaxTotalConnections(1);
mgr.getParams().setConnectionTimeout(SearchHandler.connectionTimeout);
mgr.getParams().setSoTimeout(SearchHandler.soTimeout);
// mgr.getParams().setStaleCheckingEnabled(false);
client = new HttpClient(mgr);
  }
[/code]

Could the value set by setDefaultMaxConnectionsPerHost(20) be to small for
80+ shards returning results to the router?

Dmitry



On Fri, May 3, 2013 at 6:50 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi, thanks.

 Solr 3.4.
 There is POST request everywhere, between client and router, router and
 shards.

 Do you do faceting across all shards? How many documents approx you have?
 On 2 May 2013 22:02, Patanachai Tangchaisin 
 patanachai.tangchai...@wizecommerce.com wrote:

 Hi,

 First, which version of Solr are you using?

 I also has 60 shards+ on Solr 4.2.1 and it doesn't seems to be a problem
 for me.

 - Make sure you use POST to send a query to Solr.
 - 'connection reset by peer' from client can indicate that there is
 something wrong with server e.g. server closes a connection etc.

 --
 Patanachai

 On 05/02/2013 05:05 AM, Dmitry Kan wrote:

 After some searching around, I see this:

 http://search-lucene.com/m/**ErEZUl7P5f2/%2522socket+write+**
 error%2522subj=Long+list+of+**shards+breaks+solrj+queryhttp://search-lucene.com/m/ErEZUl7P5f2/%2522socket+write+error%2522subj=Long+list+of+shards+breaks+solrj+query

 Seems like this has happened in the past with large amount of shards.

 To make it clear: the distributed search works with 20 shards.


 On Thu, May 2, 2013 at 1:57 PM, Dmitry Kan solrexp...@gmail.com wrote:

  Hi guys!

 We have solr router and shards. I see this in jetty log on the router:

 May 02, 2013 1:30:22 PM org.apache.commons.httpclient.**
 HttpMethodDirector
 executeWithRetry
 INFO: I/O exception (java.net.SocketException) caught when processing
 request: Connection reset by peer: socket write error

 and then:

 May 02, 2013 1:30:22 PM org.apache.commons.httpclient.**
 HttpMethodDirector
 executeWithRetry
 INFO: Retrying request

 followed by exception about Internal Server Error

 any ideas why this happens?

 We run 80+ shards distributed across several servers. Router runs on its
 own node.

 Is there anything in particular I should be looking into wrt ubuntu
 socket
 settings? Is this a known issue for solr's distributed search from the
 past?

 Thanks,
 Dmitry



 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.




Re: commit in solr4 takes a longer time

2013-05-03 Thread Sandeep Mestry
That's not ideal.
Can you post solrconfig.xml?
On 3 May 2013 07:41, vicky desai vicky.de...@germinait.com wrote:

 Hi sandeep,

 I made the changes u mentioned and tested again for the same set of docs
 but
 unfortunately the commit time increased.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060622.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: commit in solr4 takes a longer time

2013-05-03 Thread vicky desai
My solrconfig.xml is as follows

?xml version=1.0 encoding=UTF-8 ?
config
luceneMatchVersionLUCENE_40/luceneMatchVersion
indexConfig


maxFieldLength2147483647/maxFieldLength
lockTypesimple/lockType
unlockOnStartuptrue/unlockOnStartup
/indexConfig
updateHandler class=solr.DirectUpdateHandler2
autoSoftCommit
maxDocs500/maxDocs
maxTime1000/maxTime
/autoSoftCommit
autoCommit
maxDocs5/maxDocs 
maxTime30/maxTime 
openSearcherfalse/openSearcher
/autoCommit
/updateHandler

requestDispatcher handleSelect=true 
requestParsers enableRemoteStreaming=false
multipartUploadLimitInKB=204800 /
/requestDispatcher

requestHandler name=standard class=solr.StandardRequestHandler
default=true /
requestHandler name=/update class=solr.UpdateRequestHandler /
requestHandler name=/admin/
class=org.apache.solr.handler.admin.AdminHandlers /
requestHandler name=/replication class=solr.ReplicationHandler /
directoryFactory name=DirectoryFactory
class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory} / 
enableLazyFieldLoadingtrue/enableLazyFieldLoading
admin
defaultQuery*:*/defaultQuery
/admin 
/config



--
View this message in context: 
http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060628.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: What Happens to Consistency if I kill a Leader and Startup it again?

2013-05-03 Thread Furkan KAMACI
Shawn thanks for detailed answer, it explains everything. I think that
there is no problem. I will use 4.3. when it is available and if I see a
situation something like that I will report.

2013/5/3 Shawn Heisey s...@elyograg.org

 On 5/2/2013 2:19 PM, Furkan KAMACI wrote:
  I see that at my admin page:
 
  Replication (Slave)  Version  GenSize
  Master:  1367307652512   82  778.04 MB
  Slave:   1367307658862   82  781.05 MB
 
  and I started to figure about it so that's why I asked this question.

 As we've been trying to tell you, the sizes can (and will) be different
 between replicas on SolrCloud.  Also, if you're not running a recent
 release candidate of 4.3, then the version numbers on the replication
 screen are misleading.  See SOLR-4661 for more details.

 Your example of version numbers like 100, 90, and 95 wouldn't actually
 happen, because the version number is based on the current time in
 milliseconds since 1970-01-01 00:00:00 UTC.  If you index after killing
 the leader, the new leader's version number will be higher than the
 offline replica.

 If you can find actual proof of a problem with index updates related to
 killing the leader, then we can take the bug report and work on fixing
 it.  Here's how you would go about finding proof.  It would be easiest
 to have one shard, but if you want to make sure it's OK with multiple
 shards, you would have to kill all the leaders.

 * Start with a functional collection with two replicas.
 * Index a document with a recognizable ID like A.
 * Make sure you can find document A.
 * Kill the leader replica, let's say it was replica1.
 * Make sure replica2 becomes leader.
 * Make sure you can find document A.
 * Index document B.
 * Start replica1, wait for it to turn green.
 * Make sure you can still find document B.
 * Kill the leader again, this time it's replica2.
 * Make sure you can still find document B.

 To my knowledge, nobody has reported a real problem with proof.  I would
 imagine that more than one person has done testing like this to make
 sure that SolrCloud is reliable.

 Thanks,
 Shawn




Re: Does Near Real Time get not supported at SolrCloud?

2013-05-03 Thread Furkan KAMACI
Does soft commits distributes into nodes of SolrCloud?

2013/5/3 Otis Gospodnetic otis.gospodne...@gmail.com

 NRT works with SolrCloud.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/

 On May 2, 2013 5:34 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 
  Does Near Real Time get not supported at SolrCloud?
 
  I mean when a soft commit occurs at a leader I think that it doesn't
  distribute it to replicas(because it is not at storage, does indexes at
 RAM
  distributes to replicas too?) and a search query comes what happens?



Re: Rearranging Search Results of a Search?

2013-05-03 Thread Furkan KAMACI
I think this looks like what I search for:
https://issues.apache.org/jira/browse/SOLR-4465

How about post filter for Lucene, can it help me for my purpose?

2013/5/3 Otis Gospodnetic otis.gospodne...@gmail.com

 Hi,

 You should use search more often :)

 http://search-lucene.com/?q=scriptable+collectorsort=newestOnTopfc_project=Solrfc_type=issue

 Coincidentally, what you see there happens to be a good example of a
 Solr component that does something behind the scenes to deliver those
 search results even though my original query was bad.  Knd of
 similar to what you are after.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Thu, May 2, 2013 at 4:47 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  I know that I can use boosting at query for a field, for a searching
 term,
  at solrconfig.xml and query elevator so I can arrange the results of a
  search. However after I get top documents how can I change the order of a
  results? Does Lucene's postfilter stands for that?



Re: Delete from Solr Cloud 4.0 index..

2013-05-03 Thread Annette Newton
Thanks Shawn.

I have played around with Soft Commits before and didn't seem to have any
improvement, but with the current load testing I am doing I will give it
another go.

I have researched docValues and came across the fact that it would increase
the index size.  With the upgrade to 4.2.1 the index size has reduced by
approx 33% which is pleasing and I don't really want to lose that saving.

We do use the facet.enum method - which works really well, but I will
verify that we are using that in every instance, we have numerous
developers working on the product and maybe one or two have slipped
through.

Right from the first I upped the zkClientTimeout to 30 as I wanted to give
extra time for any network blips that we experience on AWS.  We only seem
to drop communication on a full garbage collection though.

I am coming to the conclusion that we need to have more shards to cope with
the writes, so I will play around with adding more shards and see how I go.


I appreciate you having a look over our setup and the advice.

Thanks again.

Netty.


On 2 May 2013 23:17, Shawn Heisey s...@elyograg.org wrote:

 On 5/2/2013 4:24 AM, Annette Newton wrote:
  Hi Shawn,
 
  Thanks so much for your response.  We basically are very write intensive
  and write throughput is pretty essential to our product.  Reads are
  sporadic and actually is functioning really well.
 
  We write on average (at the moment) 8-12 batches of 35 documents per
  minute.  But we really will be looking to write more in the future, so
 need
  to work out scaling of solr and how to cope with more volume.
 
  Schema (I have changed the names) :
 
  http://pastebin.com/x1ry7ieW
 
  Config:
 
  http://pastebin.com/pqjTCa7L

 This is very clean.  There's probably more you could remove/comment, but
 generally speaking I couldn't find any glaring issues.  In particular,
 you have disabled autowarming, which is a major contributor to commit
 speed problems.

 The first thing I think I'd try is increasing zkClientTimeout to 30 or
 60 seconds.  You can use the startup commandline or solr.xml, I would
 probably use the latter.  Here's a solr.xml fragment that uses a system
 property or a 15 second default:

 ?xml version=1.0 encoding=UTF-8 ?
 solr persistent=true sharedLib=lib
   cores adminPath=/admin/cores
 zkClientTimeout=${zkClientTimeout:15000} hostPort=${jetty.port:}
 hostContext=solr

 General thoughts, these changes might not help this particular issue:
 You've got autoCommit with openSearcher=true.  This is a hard commit.
 If it were me, I would set that up with openSearcher=false and either do
 explicit soft commits from my application or set up autoSoftCommit with
 a shorter timeframe than autoCommit.

 This might simply be a scaling issue, where you'll need to spread the
 load wider than four shards.  I know that there are financial
 considerations with that, and they might not be small, so let's leave
 that alone for now.

 The memory problems might be a symptom/cause of the scaling issue I just
 mentioned.  You said you're using facets, which can be a real memory hog
 even with only a few of them.  Have you tried facet.method=enum to see
 how it performs?  You'd need to switch to it exclusively, never go with
 the default of fc.  You could put that in the defaults or invariants
 section of your request handler(s).

 Another way to reduce memory usage for facets is to use disk-based
 docValues on version 4.2 or later for the facet fields, but this will
 increase your index size, and your index is already quite large.
 Depending on your index contents, the increase may be small or large.

 Something to just mention: It looks like your solrconfig.xml has
 hard-coded absolute paths for dataDir and updateLog.  This is fine if
 you'll only ever have one core/collection on each server, but it'll be a
 disaster if you have multiples.  I could be wrong about how these get
 interpreted in SolrCloud -- they might actually be relative despite
 starting with a slash.

 Thanks,
 Shawn




-- 

Annette Newton

Database Administrator

ServiceTick Ltd



T:+44(0)1603 618326



Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ

www.servicetick.com

*www.sessioncam.com*

-- 
*This message is confidential and is intended to be read solely by the 
addressee. The contents should not be disclosed to any other person or 
copies taken unless authorised to do so. If you are not the intended 
recipient, please notify the sender and permanently delete this message. As 
Internet communications are not secure ServiceTick accepts neither legal 
responsibility for the contents of this message nor responsibility for any 
change made to this message after it was forwarded by the original author.*


Re: Delete from Solr Cloud 4.0 index..

2013-05-03 Thread Annette Newton
One question Shawn - did you ever get any costings around Zing? Did you
trial it?

Thanks.


On 3 May 2013 10:03, Annette Newton annette.new...@servicetick.com wrote:

 Thanks Shawn.

 I have played around with Soft Commits before and didn't seem to have any
 improvement, but with the current load testing I am doing I will give it
 another go.

 I have researched docValues and came across the fact that it would
 increase the index size.  With the upgrade to 4.2.1 the index size has
 reduced by approx 33% which is pleasing and I don't really want to lose
 that saving.

 We do use the facet.enum method - which works really well, but I will
 verify that we are using that in every instance, we have numerous
 developers working on the product and maybe one or two have slipped
 through.

 Right from the first I upped the zkClientTimeout to 30 as I wanted to give
 extra time for any network blips that we experience on AWS.  We only seem
 to drop communication on a full garbage collection though.

 I am coming to the conclusion that we need to have more shards to cope
 with the writes, so I will play around with adding more shards and see how
 I go.

 I appreciate you having a look over our setup and the advice.

 Thanks again.

 Netty.


 On 2 May 2013 23:17, Shawn Heisey s...@elyograg.org wrote:

 On 5/2/2013 4:24 AM, Annette Newton wrote:
  Hi Shawn,
 
  Thanks so much for your response.  We basically are very write intensive
  and write throughput is pretty essential to our product.  Reads are
  sporadic and actually is functioning really well.
 
  We write on average (at the moment) 8-12 batches of 35 documents per
  minute.  But we really will be looking to write more in the future, so
 need
  to work out scaling of solr and how to cope with more volume.
 
  Schema (I have changed the names) :
 
  http://pastebin.com/x1ry7ieW
 
  Config:
 
  http://pastebin.com/pqjTCa7L

 This is very clean.  There's probably more you could remove/comment, but
 generally speaking I couldn't find any glaring issues.  In particular,
 you have disabled autowarming, which is a major contributor to commit
 speed problems.

 The first thing I think I'd try is increasing zkClientTimeout to 30 or
 60 seconds.  You can use the startup commandline or solr.xml, I would
 probably use the latter.  Here's a solr.xml fragment that uses a system
 property or a 15 second default:

 ?xml version=1.0 encoding=UTF-8 ?
 solr persistent=true sharedLib=lib
   cores adminPath=/admin/cores
 zkClientTimeout=${zkClientTimeout:15000} hostPort=${jetty.port:}
 hostContext=solr

 General thoughts, these changes might not help this particular issue:
 You've got autoCommit with openSearcher=true.  This is a hard commit.
 If it were me, I would set that up with openSearcher=false and either do
 explicit soft commits from my application or set up autoSoftCommit with
 a shorter timeframe than autoCommit.

 This might simply be a scaling issue, where you'll need to spread the
 load wider than four shards.  I know that there are financial
 considerations with that, and they might not be small, so let's leave
 that alone for now.

 The memory problems might be a symptom/cause of the scaling issue I just
 mentioned.  You said you're using facets, which can be a real memory hog
 even with only a few of them.  Have you tried facet.method=enum to see
 how it performs?  You'd need to switch to it exclusively, never go with
 the default of fc.  You could put that in the defaults or invariants
 section of your request handler(s).

 Another way to reduce memory usage for facets is to use disk-based
 docValues on version 4.2 or later for the facet fields, but this will
 increase your index size, and your index is already quite large.
 Depending on your index contents, the increase may be small or large.

 Something to just mention: It looks like your solrconfig.xml has
 hard-coded absolute paths for dataDir and updateLog.  This is fine if
 you'll only ever have one core/collection on each server, but it'll be a
 disaster if you have multiples.  I could be wrong about how these get
 interpreted in SolrCloud -- they might actually be relative despite
 starting with a slash.

 Thanks,
 Shawn




 --

 Annette Newton

 Database Administrator

 ServiceTick Ltd



 T:+44(0)1603 618326



 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ

 www.servicetick.com

 *www.sessioncam.com*




-- 

Annette Newton

Database Administrator

ServiceTick Ltd



T:+44(0)1603 618326



Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ

www.servicetick.com

*www.sessioncam.com*

-- 
*This message is confidential and is intended to be read solely by the 
addressee. The contents should not be disclosed to any other person or 
copies taken unless authorised to do so. If you are not the intended 
recipient, please notify the sender and permanently delete this message. As 
Internet communications are not secure ServiceTick accepts neither legal 
responsibility for the contents of 

Performance considerations when using distributed indexing + loadbalancing with Solr cloud

2013-05-03 Thread Edd Grant
Hi all,

I have been playing with Solr Cloud recently and am enjoying the
distributed indexing capability.

At the moment my SolrCloud consists of 2 leaders and 2 replicas which are
fronted by an HAProxy instance. I want to maximise performance for indexing
and it occurred to me that the model I use for loadbalancing my indexing
requests may impact performance. i.e. am I likely to see better indexing
performance if I stick certain groups of requests to certain nodes vs
simply using a round robin approach?

I'll be doing some impirical testing to try and figure this out but was
wondering if there's any general guidance here? Or if anyone has any
experience of particularly good/ bad configurations?

Many thanks,

Edd

-- 
Web: http://www.eddgrant.com
Email: e...@eddgrant.com
Mobile: +44 (0) 7861 394 543


Good Desktop Search?

2013-05-03 Thread Savia Beson
Hi everybody, 
just a simple question 
is there any solr/lucene based desktop search project around someone might 
recommend? 
I am looking for something for personal use that is kind of mature, at least 
stable,  runs on java and does not require admin rights to install. Nothing too 
fancy.

Thanks/S.





Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

2013-05-03 Thread Furkan KAMACI
Do you use CloudSolrServer when you push documnts into SolrCloud to be
indexed?

2013/5/3 Edd Grant e...@eddgrant.com

 Hi all,

 I have been playing with Solr Cloud recently and am enjoying the
 distributed indexing capability.

 At the moment my SolrCloud consists of 2 leaders and 2 replicas which are
 fronted by an HAProxy instance. I want to maximise performance for indexing
 and it occurred to me that the model I use for loadbalancing my indexing
 requests may impact performance. i.e. am I likely to see better indexing
 performance if I stick certain groups of requests to certain nodes vs
 simply using a round robin approach?

 I'll be doing some impirical testing to try and figure this out but was
 wondering if there's any general guidance here? Or if anyone has any
 experience of particularly good/ bad configurations?

 Many thanks,

 Edd

 --
 Web: http://www.eddgrant.com
 Email: e...@eddgrant.com
 Mobile: +44 (0) 7861 394 543



Re: Good Desktop Search?

2013-05-03 Thread Paul Libbrecht
Savia,

maybe not very mature yet, but someone on java-us...@lucene.apache.org 
announced such a tool the other day.
I'm copying it below.
I do not know of many otherwise.

paul

 Hi everybody, 
 just a simple question 
 is there any solr/lucene based desktop search project around someone might 
 recommend? 
 I am looking for something for personal use that is kind of mature, at least 
 stable,  runs on java and does not require admin rights to install. Nothing 
 too fancy.



Begin forwarded message:

 From: Mirko Sertic mirko.ser...@web.de
 Date: 29 avril 2013 21:20:19 HAEC
 To: java-u...@lucene.apache.org
 Subject: Lucene Desktop Search Engine with JavaFX/Tika/Filesystem 
 Crawler/HTML5
 Reply-To: java-u...@lucene.apache.org
 
 Hi@all
 
 Lucene rocks, and based on some JavaFX/HTML5 hyprids i built a small Java 
 search engine for your desktop!
 
 The prototype and the result can be seen here:
 
 http://www.mirkosertic.de/doku.php/javastuff/fxdesktopsearch
 
 I am using a multithreaded pipes and filters architecture with Tika as the 
 content extraction framework and of course Lucene as the fulltext engine. It 
 really rocks, i can search thousands of documents with syntax highlighting 
 within a few milliseconds. It also supports MoreLikeThis queries showing 
 document similarities.
 
 Thanks @all working on Lucene!
 
 I am planning future releases of the desktop search engine with facetted 
 search based on tika-extracted document metadata. Also NLP with named entity 
 extraction might be a usecase, so everyone who is willing to contribute is 
 very welcome. Sourcecode is OSS and hosted on Google Code here:
 
 http://code.google.com/p/freedesktopsearch/
 
 Regards
 Mirko
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


Re: Good Desktop Search?

2013-05-03 Thread Savia Beson
Thanks Paul,  I missed that one. 


On May 3, 2013, at 2:27 PM, Paul Libbrecht p...@hoplahup.net wrote:

 Savia,
 
 maybe not very mature yet, but someone on java-us...@lucene.apache.org 
 announced such a tool the other day.
 I'm copying it below.
 I do not know of many otherwise.
 
 paul
 
 Hi everybody, 
 just a simple question 
 is there any solr/lucene based desktop search project around someone might 
 recommend? 
 I am looking for something for personal use that is kind of mature, at least 
 stable,  runs on java and does not require admin rights to install. Nothing 
 too fancy.
 
 
 
 Begin forwarded message:
 
 From: Mirko Sertic mirko.ser...@web.de
 Date: 29 avril 2013 21:20:19 HAEC
 To: java-u...@lucene.apache.org
 Subject: Lucene Desktop Search Engine with JavaFX/Tika/Filesystem 
 Crawler/HTML5
 Reply-To: java-u...@lucene.apache.org
 
 Hi@all
 
 Lucene rocks, and based on some JavaFX/HTML5 hyprids i built a small Java 
 search engine for your desktop!
 
 The prototype and the result can be seen here:
 
 http://www.mirkosertic.de/doku.php/javastuff/fxdesktopsearch
 
 I am using a multithreaded pipes and filters architecture with Tika as the 
 content extraction framework and of course Lucene as the fulltext engine. It 
 really rocks, i can search thousands of documents with syntax highlighting 
 within a few milliseconds. It also supports MoreLikeThis queries showing 
 document similarities.
 
 Thanks @all working on Lucene!
 
 I am planning future releases of the desktop search engine with facetted 
 search based on tika-extracted document metadata. Also NLP with named entity 
 extraction might be a usecase, so everyone who is willing to contribute is 
 very welcome. Sourcecode is OSS and hosted on Google Code here:
 
 http://code.google.com/p/freedesktopsearch/
 
 Regards
 Mirko
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 



Solr 4 reload failed core

2013-05-03 Thread Peter Kirk
Hi

I have a multi-core installation, with 2 cores. Sometimes, when Solr starts up, 
one of the cores fails (due to an extension to Solr I have, which is waiting on 
an external service which has yet to initialise).

In previous versions of Solr, I could subsequently issue a RELOAD to this core, 
even though it was in a fail state, and it would reload and start up.
Now it seems with Solr 4, I cannot issue a RELOAD to a core which has failed.

Is this the case?

How can I get Solr to start a core which failed on initial start up?

Thanks,
Peter






Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

2013-05-03 Thread Edd Grant
Hi,

No we're actually POSTing them over plain old http. Our feeder process
simply points at the HAProxy box and posts merrily away.

Cheers,

Edd


On 3 May 2013 13:17, Furkan KAMACI furkankam...@gmail.com wrote:

 Do you use CloudSolrServer when you push documnts into SolrCloud to be
 indexed?

 2013/5/3 Edd Grant e...@eddgrant.com

  Hi all,
 
  I have been playing with Solr Cloud recently and am enjoying the
  distributed indexing capability.
 
  At the moment my SolrCloud consists of 2 leaders and 2 replicas which are
  fronted by an HAProxy instance. I want to maximise performance for
 indexing
  and it occurred to me that the model I use for loadbalancing my indexing
  requests may impact performance. i.e. am I likely to see better indexing
  performance if I stick certain groups of requests to certain nodes vs
  simply using a round robin approach?
 
  I'll be doing some impirical testing to try and figure this out but was
  wondering if there's any general guidance here? Or if anyone has any
  experience of particularly good/ bad configurations?
 
  Many thanks,
 
  Edd
 
  --
  Web: http://www.eddgrant.com
  Email: e...@eddgrant.com
  Mobile: +44 (0) 7861 394 543
 




-- 
Web: http://www.eddgrant.com
Email: e...@eddgrant.com
Mobile: +44 (0) 7861 394 543


Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

2013-05-03 Thread Furkan KAMACI
If you index them with SolrCloudServer, your server will learn where data
will go from Zookeeper and send data to that shard leader. However if you
use another random processes or something like data will go any of nodes
and after that will be routed into the right place within cluster. This
extra routing process within cluster may cause unnecessary network traffic
and latency for indexing time as well.

2013/5/3 Edd Grant e...@eddgrant.com

 Hi,

 No we're actually POSTing them over plain old http. Our feeder process
 simply points at the HAProxy box and posts merrily away.

 Cheers,

 Edd


 On 3 May 2013 13:17, Furkan KAMACI furkankam...@gmail.com wrote:

  Do you use CloudSolrServer when you push documnts into SolrCloud to be
  indexed?
 
  2013/5/3 Edd Grant e...@eddgrant.com
 
   Hi all,
  
   I have been playing with Solr Cloud recently and am enjoying the
   distributed indexing capability.
  
   At the moment my SolrCloud consists of 2 leaders and 2 replicas which
 are
   fronted by an HAProxy instance. I want to maximise performance for
  indexing
   and it occurred to me that the model I use for loadbalancing my
 indexing
   requests may impact performance. i.e. am I likely to see better
 indexing
   performance if I stick certain groups of requests to certain nodes vs
   simply using a round robin approach?
  
   I'll be doing some impirical testing to try and figure this out but was
   wondering if there's any general guidance here? Or if anyone has any
   experience of particularly good/ bad configurations?
  
   Many thanks,
  
   Edd
  
   --
   Web: http://www.eddgrant.com
   Email: e...@eddgrant.com
   Mobile: +44 (0) 7861 394 543
  
 



 --
 Web: http://www.eddgrant.com
 Email: e...@eddgrant.com
 Mobile: +44 (0) 7861 394 543



Re: commit in solr4 takes a longer time

2013-05-03 Thread vicky desai
Hi All,

setting opensearcher flag to true solution worked and it give me visible
improvement in commit time. One thing to make note of is that while using
solrj client we have to call server.commit(false,false) which i was doing
incorrectly and hence was not able to see the improvement earliear.

Thanks everyone



--
View this message in context: 
http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060688.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Delete from Solr Cloud 4.0 index..

2013-05-03 Thread Shawn Heisey
On 5/3/2013 3:22 AM, Annette Newton wrote:
 One question Shawn - did you ever get any costings around Zing? Did you
 trial it?

I never did do a trial.  I asked them for a cost and they didn't have an
immediate answer, wanted to do a phone call and get a lot of information
about my setup.  The price apparently has a lot of variance based on the
specific environment, so I didn't pursue it, figuring that the cost
would be higher than my superiors are willing to pay.

The only information I could find about the cost of Zing was a very
recent Register article that had this to say:

Azul is similarly cagey about what a supported version of the Zing JVM
costs, and only says that Zing costs around what a supported version of
an Oracle, IBM, or Red Hat JVM will run enterprises and that it has an
annual subscription model for Zing pricing. You can't easily get pricing
for Oracle, IBM, or Red Hat JVMs, of course, so the comparison is
accurate but perfectly useless.

http://www.theregister.co.uk/2013/04/08/azul_systems_zing_lmax_exchange/

Thanks,
Shawn



Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

2013-05-03 Thread Edd Grant
Thanks, that's exactly what I was worried about. If I take your suggested
approach of using SolrCloudServer and the feeder learns which shard leader
to target, then if the shard leader goes down midway through indexing then
I've lost my ability to index. Whereas if I take the route of making all
updates via the HAProxy instance then I've got HA but at the cost of
performance.

This has me wondering if it might be feasable to address each shard with a
VIP? Then if the leader of the shard goes down and a replica is elected as
the leader it could also take the VIP, so in essence we'd always be sending
messages to the leader. Anyone tried anything like this?

Cheers,

Edd


On 3 May 2013 15:22, Furkan KAMACI furkankam...@gmail.com wrote:

 If you index them with SolrCloudServer, your server will learn where data
 will go from Zookeeper and send data to that shard leader. However if you
 use another random processes or something like data will go any of nodes
 and after that will be routed into the right place within cluster. This
 extra routing process within cluster may cause unnecessary network traffic
 and latency for indexing time as well.

 2013/5/3 Edd Grant e...@eddgrant.com

  Hi,
 
  No we're actually POSTing them over plain old http. Our feeder process
  simply points at the HAProxy box and posts merrily away.
 
  Cheers,
 
  Edd
 
 
  On 3 May 2013 13:17, Furkan KAMACI furkankam...@gmail.com wrote:
 
   Do you use CloudSolrServer when you push documnts into SolrCloud to be
   indexed?
  
   2013/5/3 Edd Grant e...@eddgrant.com
  
Hi all,
   
I have been playing with Solr Cloud recently and am enjoying the
distributed indexing capability.
   
At the moment my SolrCloud consists of 2 leaders and 2 replicas which
  are
fronted by an HAProxy instance. I want to maximise performance for
   indexing
and it occurred to me that the model I use for loadbalancing my
  indexing
requests may impact performance. i.e. am I likely to see better
  indexing
performance if I stick certain groups of requests to certain nodes vs
simply using a round robin approach?
   
I'll be doing some impirical testing to try and figure this out but
 was
wondering if there's any general guidance here? Or if anyone has any
experience of particularly good/ bad configurations?
   
Many thanks,
   
Edd
   
--
Web: http://www.eddgrant.com
Email: e...@eddgrant.com
Mobile: +44 (0) 7861 394 543
   
  
 
 
 
  --
  Web: http://www.eddgrant.com
  Email: e...@eddgrant.com
  Mobile: +44 (0) 7861 394 543
 




-- 
Web: http://www.eddgrant.com
Email: e...@eddgrant.com
Mobile: +44 (0) 7861 394 543


Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

2013-05-03 Thread Shawn Heisey
On 5/3/2013 8:35 AM, Edd Grant wrote:
 Thanks, that's exactly what I was worried about. If I take your suggested
 approach of using SolrCloudServer and the feeder learns which shard leader
 to target, then if the shard leader goes down midway through indexing then
 I've lost my ability to index. Whereas if I take the route of making all
 updates via the HAProxy instance then I've got HA but at the cost of
 performance.
 
 This has me wondering if it might be feasable to address each shard with a
 VIP? Then if the leader of the shard goes down and a replica is elected as
 the leader it could also take the VIP, so in essence we'd always be sending
 messages to the leader. Anyone tried anything like this?

CloudSolrServer is part of the SolrJ (Java) API.  It incorporates a
zookeeper client.  To initialize it, you don't tell it about your Solr
servers, you give it the same zookeeper host information that you give
to Solr when starting in cloud mode.  It always knows the current state
of the cluster, so if you have a failure, it adjusts so that your
queries and updates don't fail.  That also means that it will know when
servers are added to or removed from the cloud.

http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html

Thanks,
Shawn



Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

2013-05-03 Thread Edd Grant
Aah I see - very useful. Thanks!


On 3 May 2013 15:49, Shawn Heisey s...@elyograg.org wrote:

 On 5/3/2013 8:35 AM, Edd Grant wrote:
  Thanks, that's exactly what I was worried about. If I take your suggested
  approach of using SolrCloudServer and the feeder learns which shard
 leader
  to target, then if the shard leader goes down midway through indexing
 then
  I've lost my ability to index. Whereas if I take the route of making all
  updates via the HAProxy instance then I've got HA but at the cost of
  performance.
 
  This has me wondering if it might be feasable to address each shard with
 a
  VIP? Then if the leader of the shard goes down and a replica is elected
 as
  the leader it could also take the VIP, so in essence we'd always be
 sending
  messages to the leader. Anyone tried anything like this?

 CloudSolrServer is part of the SolrJ (Java) API.  It incorporates a
 zookeeper client.  To initialize it, you don't tell it about your Solr
 servers, you give it the same zookeeper host information that you give
 to Solr when starting in cloud mode.  It always knows the current state
 of the cluster, so if you have a failure, it adjusts so that your
 queries and updates don't fail.  That also means that it will know when
 servers are added to or removed from the cloud.


 http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html

 Thanks,
 Shawn




-- 
Web: http://www.eddgrant.com
Email: e...@eddgrant.com
Mobile: +44 (0) 7861 394 543


Re: commit in solr4 takes a longer time

2013-05-03 Thread vicky desai
Hi,

After using the following config

updateHandler class=solr.DirectUpdateHandler2
autoSoftCommit
maxDocs500/maxDocs
maxTime1000/maxTime
/autoSoftCommit
autoCommit
maxDocs5000/maxDocs 
openSearcherfalse/openSearcher
/autoCommit
/updateHandler

When a commit operation is fired I am getting the following logs 
INFO: start
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

even though openSearcher is false , waitSearcher is true . Can that be set
to false too? Will that give a performance improvement and what is the
config for that  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060706.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Does Near Real Time get not supported at SolrCloud?

2013-05-03 Thread Timothy Potter
yes, absolutely - NRT was a big driver for the leader to replica
distribution approach in Solr Cloud

On Fri, May 3, 2013 at 1:14 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 Does soft commits distributes into nodes of SolrCloud?

 2013/5/3 Otis Gospodnetic otis.gospodne...@gmail.com

 NRT works with SolrCloud.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/

 On May 2, 2013 5:34 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 
  Does Near Real Time get not supported at SolrCloud?
 
  I mean when a soft commit occurs at a leader I think that it doesn't
  distribute it to replicas(because it is not at storage, does indexes at
 RAM
  distributes to replicas too?) and a search query comes what happens?



Re: Solr metrics in Codahale metrics and Graphite?

2013-05-03 Thread Furkan KAMACI
Does anybody tested Ganglia with JMXTrans at production environment for
SolrCloud?

2013/4/26 Dmitry Kan solrexp...@gmail.com

 Alan, Shawn,

 If backporting to 3.x is hard, no worries, we don't necessarily require the
 patch as we are heading to 4.x eventually. It is just much easier within
 our organization to test on the existing solr 3.4 as there are a few of
 internal dependencies and custom code on top of solr. Also solr upgrades on
 production systems are usually pushed forward by a month or so starting the
 upgrade on development systems (requires lots of testing and
 verifications).

 Nevertheless, it is good effort to make #solr #graphite friendly, so keep
 it up! :)

 Dmitry




 On Thu, Apr 25, 2013 at 9:29 PM, Shawn Heisey s...@elyograg.org wrote:

  On 4/25/2013 6:30 AM, Dmitry Kan wrote:
   We are very much interested in 3.4.
  
   On Thu, Apr 25, 2013 at 12:55 PM, Alan Woodward a...@flax.co.uk
 wrote:
   This is on top of trunk at the moment, but would be back ported to 4.4
  if
   there was interest.
 
  This will be bad news, I'm sorry:
 
  All remaining work on 3.x versions happens in the 3.6 branch. This
  branch is in maintenance mode.  It will only get fixes for serious bugs
  with no workaround.  Improvements and new features won't be considered
  at all.
 
  You're welcome to try backporting patches from newer issues.  Due to the
  major differences in the 3x and 4x codebases, the best case scenario is
  that you'll be facing a very manual task.  Some changes can't be
  backported because they rely on other features only found in 4.x code.
 
  Thanks,
  Shawn
 
 



Re: commit in solr4 takes a longer time

2013-05-03 Thread Gopal Patwa
Since you have define commit option as Auto Commit for hard and soft
commit then you don't have to explicitly call commit from SolrJ client. And
openSearcher=false for hard commit will make hard commit faster since it is
only makes sure that recent changes are flushed to disk (for durability)
 and not opening any searcher.

can you post you log when soft commit and hard commit happens?

You can read about waitFlush=false and waitSearcher=false which are default
to true, see below from  java doc

JavaDoc:
*waitFlush* block until index changes are flushed to disk
*waitSearcher* block until a new searcher is opened and registered as the
main query searcher, making the changes visible*T*


On Fri, May 3, 2013 at 7:19 AM, vicky desai vicky.de...@germinait.comwrote:

 Hi All,

 setting opensearcher flag to true solution worked and it give me visible
 improvement in commit time. One thing to make note of is that while using
 solrj client we have to call server.commit(false,false) which i was doing
 incorrectly and hence was not able to see the improvement earliear.

 Thanks everyone



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060688.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: commit in solr4 takes a longer time

2013-05-03 Thread vicky desai
Hi,

When a auto commit operation is fired I am getting the following logs
INFO: start
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

setting the openSearcher to false definetly gave me a lot of performance
improvement but  was wondering if waitSearcher can also be set to false and
will that give me a performance raise too. 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060715.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: any plans to remove int32 limitation on the number of the documents in the index?

2013-05-03 Thread Erick Erickson
My off the cuff thought is that there are significant costs trying to
do this that would be paid by 99.999% of setups out there. Also,
usually you'll run into other issues (RAM etc) long before you come
anywhere close to 2^31 docs.

Lucene/Solr often allocates int[maxDoc] for various operations. when
maxDoc approaches 2^31, well memory goes through the roof. Now
consider allocating longs instead...

which is a long way of saying that I don't really think anyone's going
to be working on this any time soon, especially when SolrCloud removes
a LOT of the pain /complexity (from a user perspective anyway) from
going to a sharded setup...

FWIW,
Erick

On Thu, May 2, 2013 at 1:17 PM, Valery Giner valgi...@research.att.com wrote:
 Otis,

 The documents themselves are relatively small, tens of fields, only a few of
 them could be up to a hundred bytes.
 Lunix Servers with relatively large RAM (256),
 Minutes on the searches are fine for our purposes,  adding a few tens of
 millions of records in tens of minutes are also fine.
 We had to do some simple tricks for keeping indexing up to speed but nothing
 too fancy.
 Moving to the sharding adds a layer of complexity which we don't really need
 because of the above, ... and adding complexity may result in lower
 reliability :)

 Thanks,
 Val


 On 05/02/2013 03:41 PM, Otis Gospodnetic wrote:

 Val,

 Haven't seen this mentioned in a while...

 I'm curious...what sort of index, queries, hardware, and latency
 requirements do you have?

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On May 1, 2013 4:36 PM, Valery Giner valgi...@research.att.com wrote:

 Dear Solr Developers,

 I've been unable to find an answer to the question in the subject line of
 this e-mail, except of a vague one.

 We need to be able to index over 2bln+ documents.   We were doing well
 without sharding until the number of docs hit the limit ( 2bln+).   The
 performance was satisfactory for the queries, updates and indexing of new
 documents.

 That is, except for the need to go around the int32 limit, we don't
 really
 have a need for setting up distributed solr.

 I wonder whether some one on the solr team could tell us when/what
 version
 of solr we could expect the limit to be removed.

 I hope this question may be of interest to some one else :)

 --
 Thanks,
 Val





Re: transientCacheSize doesn't seem to have any effect, except on startup

2013-05-03 Thread Erick Erickson
The cores aren't loaded (or at least shouldn't be) for getting the status.
The _names_ of the cores should be returned, but those are (supposed) to be
retrieved from a list rather than loaded cores. So are you sure that's not what
you are seeing? How are you determining whether the cores are actually loaded
or not?

That said, it's perfectly possible that the status command is doing something we
didn't anticipate, but I took a quick look at the code (got to rush to a plane)
and CoreAdminHandler _appears_ to be just returning whatever info it can
about an unloaded core for status. I _think_ you'll get more info if the
core has ever been loaded though, even though if it's been removed from
the transient cache. Ditto for the create action.

So let's figure out whether you're really seeing loaded cores or not, and then
raise a JIRA if so...

Thanks for reporting!
Erick

On Thu, May 2, 2013 at 1:27 PM, didier deshommes dfdes...@gmail.com wrote:
 Hi,
 I've been very interested in the transient core feature of solr to manage a
 large number of cores. I'm especially interested in this use case, that the
 wiki lists at http://wiki.apache.org/solr/LotsOfCores (looks to be down
 now):

loadOnStartup=false transient=true: This is really the use-case. There are
 a large number of cores in your system that are short-duration use. You
 want Solr to load them as necessary, but unload them when the cache gets
 full on an LRU basis.

 I'm creating 10 transient core via core admin like so

 $ curl 
 http://localhost:8983/solr/admin/cores?wt=jsonaction=CREATEname=new_core2instanceDir=collection1/dataDir=new_core2transient=trueloadOnStartup=false
 

 and have transientCacheSize=2 in my solr.xml file, which I take means I
 should have at most 2 transient cores loaded at any time. The problem is
 that these cores are still loaded when when I ask solr to list cores:

 $ curl http://localhost:8983/solr/admin/cores?wt=jsonaction=status;

 From the explanation in the wiki, it looks like solr would manage loading
 and unloading transient cores for me without having to worry about them,
 but this is not what's happening.

 The situation is different when I restart solr; it does the right thing
 by loading the maximum cores set by transientCacheSize. When I add more
 cores, the old behavior happens again, where all created transient cores
 are loaded in solr.

 I'm using the development branch lucene_solr_4_3 to run my example. I can
 open a jira if need be.


Re: socket write error

2013-05-03 Thread Dmitry Kan
After some more debugging I have found out, that one of the requests had a
size of 4,4MB. The default maxPostSize in tomcat6 is 2MB (
http://tomcat.apache.org/tomcat-6.0-doc/config/ajp.html).

Changing that to 10MB has greatly improved situation on the solr side.

Dmitry


On Fri, May 3, 2013 at 9:55 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Digging in further, found this in HttpCommComponent class:

 [code]
   static {
 MultiThreadedHttpConnectionManager mgr = new
 MultiThreadedHttpConnectionManager();
 mgr.getParams().setDefaultMaxConnectionsPerHost(20);
 mgr.getParams().setMaxTotalConnections(1);
 mgr.getParams().setConnectionTimeout(SearchHandler.connectionTimeout);
 mgr.getParams().setSoTimeout(SearchHandler.soTimeout);
 // mgr.getParams().setStaleCheckingEnabled(false);
 client = new HttpClient(mgr);
   }
 [/code]

 Could the value set by setDefaultMaxConnectionsPerHost(20) be to small for
 80+ shards returning results to the router?

 Dmitry



 On Fri, May 3, 2013 at 6:50 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi, thanks.

 Solr 3.4.
 There is POST request everywhere, between client and router, router and
 shards.

 Do you do faceting across all shards? How many documents approx you have?
 On 2 May 2013 22:02, Patanachai Tangchaisin 
 patanachai.tangchai...@wizecommerce.com wrote:

 Hi,

 First, which version of Solr are you using?

 I also has 60 shards+ on Solr 4.2.1 and it doesn't seems to be a problem
 for me.

 - Make sure you use POST to send a query to Solr.
 - 'connection reset by peer' from client can indicate that there is
 something wrong with server e.g. server closes a connection etc.

 --
 Patanachai

 On 05/02/2013 05:05 AM, Dmitry Kan wrote:

 After some searching around, I see this:

 http://search-lucene.com/m/**ErEZUl7P5f2/%2522socket+write+**
 error%2522subj=Long+list+of+**shards+breaks+solrj+queryhttp://search-lucene.com/m/ErEZUl7P5f2/%2522socket+write+error%2522subj=Long+list+of+shards+breaks+solrj+query

 Seems like this has happened in the past with large amount of shards.

 To make it clear: the distributed search works with 20 shards.


 On Thu, May 2, 2013 at 1:57 PM, Dmitry Kan solrexp...@gmail.com
 wrote:

  Hi guys!

 We have solr router and shards. I see this in jetty log on the router:

 May 02, 2013 1:30:22 PM org.apache.commons.httpclient.**
 HttpMethodDirector
 executeWithRetry
 INFO: I/O exception (java.net.SocketException) caught when processing
 request: Connection reset by peer: socket write error

 and then:

 May 02, 2013 1:30:22 PM org.apache.commons.httpclient.**
 HttpMethodDirector
 executeWithRetry
 INFO: Retrying request

 followed by exception about Internal Server Error

 any ideas why this happens?

 We run 80+ shards distributed across several servers. Router runs on
 its
 own node.

 Is there anything in particular I should be looking into wrt ubuntu
 socket
 settings? Is this a known issue for solr's distributed search from the
 past?

 Thanks,
 Dmitry



 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.





Re: commit in solr4 takes a longer time

2013-05-03 Thread Shawn Heisey

On 5/3/2013 9:28 AM, vicky desai wrote:

Hi,

When a auto commit operation is fired I am getting the following logs
INFO: start
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

setting the openSearcher to false definetly gave me a lot of performance
improvement but  was wondering if waitSearcher can also be set to false and
will that give me a performance raise too.


The openSearcher parameter changes what actually happens when you do a 
hard commit, so using it can change your performance.


The wait parameters are for client software that does commits.  The 
idea is that if you don't want your client to wait for the commit to 
finish, you use these options so that the commit API call will return 
quickly and the server will finish the commit in the background.  It 
doesn't change what the commit does, it just allows the client to start 
doing other things.


With auto commits, the client and the server are both Solr, and 
everything is multi-threaded.  The wait parameters have no meaning, 
because there's no user software that has to wait.  There would be no 
performance gain from turning them off.


Side note: The waitFlush parameter was completely removed in Solr 4.0.

Thanks,
Shawn



Re: The HttpSolrServer add(CollectionSolrInputDocument docs) method is not atomic.

2013-05-03 Thread Erick Erickson
bq:  Is there a way to commit multiple documents/beans in a
transaction/together in a way that it succeeds completely or fails
completely?

Not that I know of. I've seen various divide and conquer strategies
to identify _which_ document failed, but the general process
is usually to re-index the docs in smaller chunks until you
isolate the offending one and trust that re-indexing documents will
be OK since it overwrites the earlier copiy.

Best
Erick

On Thu, May 2, 2013 at 7:53 PM, mark12345 marks1900-pos...@yahoo.com.au wrote:
 One thing I noticed is that while the HttpSolrServer add(SolrInputDocument
 doc) method is atomic (Either a bean is added or an exception is thrown),
 the HttpSolrServer add(CollectionSolrInputDocument docs) method is not
 atomic.

 Question:  Is there a way to commit multiple documents/beans in a
 transaction/together in a way that it succeeds completely or fails
 completely?


 Quick outline of what I did to highlight a call to HttpSolrServer
 add(CollectionSolrInputDocument docs) method is not atomic.
 1.  Create 5 documents, comprising of 4 valid documents (Documents 1,2,4,5)
 and 1 document with an issue, document 3.
 2.  Call to HttpSolrServer add(CollectionSolrInputDocument docs) which
 threw a SolrException.
 3.  Call to HttpSolrServer commit().
 4.  Discovered that 2 out of 5 (documents 1 and 2) documents where still
 commited.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrJ-Solr-Two-Phase-Commit-tp4060399p4060590.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query across multiple shards - key fields have different names

2013-05-03 Thread Erick Erickson
I don't think you can. Problem is that the pseudo join capability
can work cross core, which meas with two separate cores, but last I
knew distributed joins aren't supported which is what you're asking
for.

Really think about flattening your data if at all possible.

Best
Erick

On Thu, May 2, 2013 at 11:03 PM, Benjamin Ryan
benjamin.r...@manchester.ac.uk wrote:
 Hi,
   Sorry for the basic question - I can't get to the WiKi to find the answer.
   Version Solr 3.3.0
   I have two separate indexes (currently in two cores but can be moved to 
 shards)
   One core holds metadata about educational resources, the other usage 
 statistics
   They have a common value  named id in one core and search.resourceid in 
 the other core.
   How can I construct a shard query (once I have moved one the cores to a 
 different node) so that I can effectively get the statistics for each 
 educational resource grouped by each resource?
   This is an offline reporting job that needs to list the usage events for 
 educational resources over a time period (the usage events have a date/time 
 field.

 Regards,
Ben

 --
 Dr Ben Ryan
 Jorum Technical Manager

 5.12 Roscoe Building
 The University of Manchester
 Oxford Road
 Manchester
 M13 9PL
 Tel: 0160 275 6039
 E-mail: 
 benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk
 --



Re: Duplicated Documents Across shards

2013-05-03 Thread Erick Erickson
What version of Solr? The custom routing stuff is quite new so
I'm guessing 4x?

But this shouldn't be happening. The actual index data for the
shards should be in separate directories, they just happen to
be on the same physical machine.

Try querying each one with distrib=false to see the counts
from single shards, that may shed some light on this. It vaguely
sounds like you have indexed the same document to both shards
somehow...

Best
Erick

On Fri, May 3, 2013 at 5:28 AM, Iker Mtnz. Apellaniz
mitxin...@gmail.com wrote:
 Hi,
   We have currently a solrCloud implementation running 5 shards in 3
 physical machines, so the first machine will have the shard number 1, the
 second machine shards 2  4, and the third shards 3  5. We noticed that
 while queryng numFoundDocs decreased when we increased the start param.
   After some investigation we found that the documents in shards 2 to 5
 were being counted twice. Querying to shard 2 will give you back the
 results for shard 2  4, and the same thing for shards 3  5. Our guess is
 that the physical index for both shard 24 is shared, so the shards don't
 know which part of it is for each one.
   The uniqueKey is correctly defined, and we have tried using shard prefix
 (shard1!docID).

   Is there any way to solve this problem when a unique physical machine
 shares shards?
   Is it a real problem os it just affects facet  numResults?

 Thanks
Iker

 --
 /** @author imartinez*/
 Person me = *new* Developer();
 me.setName(*Iker Mtz de Apellaniz Anzuola*);
 me.setTwit(@mitxino77 https://twitter.com/mitxino77);
 me.setLocations({St Cugat, Barcelona, Kanpezu, Euskadi, *, World]});
 me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
 me.setWebs({*urbasaabentura.com, ikertxef.com*});
 *return* me;


Re: Solr 4 reload failed core

2013-05-03 Thread Erick Erickson
It seems odd, but consider create rather than reload. Create
will load up an existing core, think of it as create in memory
rather than create on disk for the case where there's already
an index.

Best
Erick

On Fri, May 3, 2013 at 6:27 AM, Peter Kirk p...@alpha-solutions.dk wrote:
 Hi

 I have a multi-core installation, with 2 cores. Sometimes, when Solr starts 
 up, one of the cores fails (due to an extension to Solr I have, which is 
 waiting on an external service which has yet to initialise).

 In previous versions of Solr, I could subsequently issue a RELOAD to this 
 core, even though it was in a fail state, and it would reload and start up.
 Now it seems with Solr 4, I cannot issue a RELOAD to a core which has failed.

 Is this the case?

 How can I get Solr to start a core which failed on initial start up?

 Thanks,
 Peter






Re: Delete from Solr Cloud 4.0 index..

2013-05-03 Thread Erick Erickson
Anette:

Be a little careful with the index size savings, they really don't
mean much for _searching_. The sotred field compression
significantly reduces the size on disk, but only for the stored
data which is only accessed when returning the top N docs. In
terms of how many docs you can fit on your hardware, it's pretty
irrelevant.

The *.fdt and *.fdx files in your index directory contain the stored
data, so when looking at the effects of various options (including
compression), you can pretty much ignore these files.

FWIW,
Erick

On Fri, May 3, 2013 at 2:03 AM, Annette Newton
annette.new...@servicetick.com wrote:
 Thanks Shawn.

 I have played around with Soft Commits before and didn't seem to have any
 improvement, but with the current load testing I am doing I will give it
 another go.

 I have researched docValues and came across the fact that it would increase
 the index size.  With the upgrade to 4.2.1 the index size has reduced by
 approx 33% which is pleasing and I don't really want to lose that saving.

 We do use the facet.enum method - which works really well, but I will
 verify that we are using that in every instance, we have numerous
 developers working on the product and maybe one or two have slipped
 through.

 Right from the first I upped the zkClientTimeout to 30 as I wanted to give
 extra time for any network blips that we experience on AWS.  We only seem
 to drop communication on a full garbage collection though.

 I am coming to the conclusion that we need to have more shards to cope with
 the writes, so I will play around with adding more shards and see how I go.


 I appreciate you having a look over our setup and the advice.

 Thanks again.

 Netty.


 On 2 May 2013 23:17, Shawn Heisey s...@elyograg.org wrote:

 On 5/2/2013 4:24 AM, Annette Newton wrote:
  Hi Shawn,
 
  Thanks so much for your response.  We basically are very write intensive
  and write throughput is pretty essential to our product.  Reads are
  sporadic and actually is functioning really well.
 
  We write on average (at the moment) 8-12 batches of 35 documents per
  minute.  But we really will be looking to write more in the future, so
 need
  to work out scaling of solr and how to cope with more volume.
 
  Schema (I have changed the names) :
 
  http://pastebin.com/x1ry7ieW
 
  Config:
 
  http://pastebin.com/pqjTCa7L

 This is very clean.  There's probably more you could remove/comment, but
 generally speaking I couldn't find any glaring issues.  In particular,
 you have disabled autowarming, which is a major contributor to commit
 speed problems.

 The first thing I think I'd try is increasing zkClientTimeout to 30 or
 60 seconds.  You can use the startup commandline or solr.xml, I would
 probably use the latter.  Here's a solr.xml fragment that uses a system
 property or a 15 second default:

 ?xml version=1.0 encoding=UTF-8 ?
 solr persistent=true sharedLib=lib
   cores adminPath=/admin/cores
 zkClientTimeout=${zkClientTimeout:15000} hostPort=${jetty.port:}
 hostContext=solr

 General thoughts, these changes might not help this particular issue:
 You've got autoCommit with openSearcher=true.  This is a hard commit.
 If it were me, I would set that up with openSearcher=false and either do
 explicit soft commits from my application or set up autoSoftCommit with
 a shorter timeframe than autoCommit.

 This might simply be a scaling issue, where you'll need to spread the
 load wider than four shards.  I know that there are financial
 considerations with that, and they might not be small, so let's leave
 that alone for now.

 The memory problems might be a symptom/cause of the scaling issue I just
 mentioned.  You said you're using facets, which can be a real memory hog
 even with only a few of them.  Have you tried facet.method=enum to see
 how it performs?  You'd need to switch to it exclusively, never go with
 the default of fc.  You could put that in the defaults or invariants
 section of your request handler(s).

 Another way to reduce memory usage for facets is to use disk-based
 docValues on version 4.2 or later for the facet fields, but this will
 increase your index size, and your index is already quite large.
 Depending on your index contents, the increase may be small or large.

 Something to just mention: It looks like your solrconfig.xml has
 hard-coded absolute paths for dataDir and updateLog.  This is fine if
 you'll only ever have one core/collection on each server, but it'll be a
 disaster if you have multiples.  I could be wrong about how these get
 interpreted in SolrCloud -- they might actually be relative despite
 starting with a slash.

 Thanks,
 Shawn




 --

 Annette Newton

 Database Administrator

 ServiceTick Ltd



 T:+44(0)1603 618326



 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ

 www.servicetick.com

 *www.sessioncam.com*

 --
 *This message is confidential and is intended to be read solely by the
 addressee. The contents should not be disclosed to any other 

Re: Configure Shingle Filter to ignore ngrams made of tokens with same start and end

2013-05-03 Thread Jack Krupansky
In short, no. I don't think you want to use the shingle filter on a token 
stream that has multiple tokens at the same position, otherwise, you will 
get confused suggestions, as you've encountered.


-- Jack Krupansky

-Original Message- 
From: Rounak Jain

Sent: Friday, May 03, 2013 7:34 AM
To: solr-user@lucene.apache.org
Subject: Configure Shingle Filter to ignore ngrams made of tokens with same 
start and end


Hello,

I was using Shingle Fitler with Suggester to implement an autosuggest
dropdown. The field I'm using with shingle filter has a worddelimiter with
preserveoriginal=1 to tokenize women's as women's and womens.

Because of this, when shingle filter is generating word ngrams, apart from
the expected tokens, there's also a women's womens tokens. I wanted to
know if there's any way to configure ShingleFilter so that it ignores
tokens with same start and end values.

Thanks,
Rounak 



SV: Solr 4 reload failed core

2013-05-03 Thread Peter Kirk
Thanks - I had just found the CREATE command, and I think that's the easiest 
path for us to take. It will actually basically function as our reload 
workaround works now.



Fra: Erick Erickson [erickerick...@gmail.com]
Sendt: 3. maj 2013 19:22
Til: solr-user@lucene.apache.org
Emne: Re: Solr 4 reload failed core

It seems odd, but consider create rather than reload. Create
will load up an existing core, think of it as create in memory
rather than create on disk for the case where there's already
an index.

Best
Erick

On Fri, May 3, 2013 at 6:27 AM, Peter Kirk p...@alpha-solutions.dk wrote:
 Hi

 I have a multi-core installation, with 2 cores. Sometimes, when Solr starts 
 up, one of the cores fails (due to an extension to Solr I have, which is 
 waiting on an external service which has yet to initialise).

 In previous versions of Solr, I could subsequently issue a RELOAD to this 
 core, even though it was in a fail state, and it would reload and start up.
 Now it seems with Solr 4, I cannot issue a RELOAD to a core which has failed.

 Is this the case?

 How can I get Solr to start a core which failed on initial start up?

 Thanks,
 Peter






Re: Configure Shingle Filter to ignore ngrams made of tokens with same start and end

2013-05-03 Thread Walter Underwood
The shingle filter should respect positions. If it doesn't, that is worth 
filing a bug so we know about it.

wunder

On May 3, 2013, at 10:50 AM, Jack Krupansky wrote:

 In short, no. I don't think you want to use the shingle filter on a token 
 stream that has multiple tokens at the same position, otherwise, you will get 
 confused suggestions, as you've encountered.
 
 -- Jack Krupansky
 
 -Original Message- From: Rounak Jain
 Sent: Friday, May 03, 2013 7:34 AM
 To: solr-user@lucene.apache.org
 Subject: Configure Shingle Filter to ignore ngrams made of tokens with same 
 start and end
 
 Hello,
 
 I was using Shingle Fitler with Suggester to implement an autosuggest
 dropdown. The field I'm using with shingle filter has a worddelimiter with
 preserveoriginal=1 to tokenize women's as women's and womens.
 
 Because of this, when shingle filter is generating word ngrams, apart from
 the expected tokens, there's also a women's womens tokens. I wanted to
 know if there's any way to configure ShingleFilter so that it ignores
 tokens with same start and end values.
 
 Thanks,
 Rounak 






custom tokenizer error

2013-05-03 Thread Sarita Nair
I am using a custom Tokenizer, as part of analysis chain, for a Solr (4.2.1) 
field. On trying to index, Solr throws a NullPointerException. 
The unit tests for the custom tokenizer work fine. Any ideas as to what is it 
that I am missing/doing incorrectly will be appreciated.

Here is the relevant schema.xml excerpt:

    fieldType name=negated class=solr.TextField omitNorms=true 
    analyzer type=index
    tokenizer 
class=some.other.solr.analysis.EmbeddedPunctuationTokenizer$Factory/
    filter class=solr.LowerCaseFilterFactory/
    filter class=solr.EnglishPossessiveFilterFactory/
    /analyzer
    /fieldType

Here are the relevant pieces of the Tokenizer:

    /**
     * Intercepts each token produced by {@link 
StandardTokenizer#incrementToken()}
     * and checks for the presence of a colon or period. If found, splits the 
token 
     * on the punctuation mark and adjusts the term and offset attributes of 
the 
     * underlying {@link TokenStream} to create additional tokens.
     * 
     * 
     */
    public class EmbeddedPunctuationTokenizer extends Tokenizer {
private static final Pattern PUNCTUATION_SYMBOLS = Pattern.compile([:.]);
private StandardTokenizer baseTokenizer;
       private CharTermAttribute termAttr;

private OffsetAttribute offsetAttr;

private /*@Nullable*/ String tokenAfterPunctuation = null;

private int currentOffset = 0;

public EmbeddedPunctuationTokenizer(final Reader reader) {
super(reader);
baseTokenizer = new StandardTokenizer(Version.MINIMUM_LUCENE_VERSION, reader);
// Two TokenStreams are in play here: the one underlying the current 
// instance and the one underlying the StandardTokenizer. The attribute 
// instances must be associated with both.
termAttr = baseTokenizer.addAttribute(CharTermAttribute.class);
offsetAttr = baseTokenizer.addAttribute(OffsetAttribute.class);
this.addAttributeImpl((CharTermAttributeImpl)termAttr);
this.addAttributeImpl((OffsetAttributeImpl)offsetAttr);
}

@Override
public void end() throws IOException {
baseTokenizer.end();
super.end();
}

@Override
public void close() throws IOException {
baseTokenizer.close();
super.close();
}

@Override
public void reset() throws IOException {
super.reset();
baseTokenizer.reset();
currentOffset = 0;
tokenAfterPunctuation = null;
}

@Override
public final boolean incrementToken() throws IOException {
clearAttributes();
if (tokenAfterPunctuation != null) {
// Do not advance the underlying TokenStream if the previous call
// found an embedded punctuation mark and set aside the substring 
// that follows it. Set the attributes instead from the substring, 
// bearing in mind that the substring could contain more embedded
// punctuation marks.
adjustAttributes(tokenAfterPunctuation);
} else if (baseTokenizer.incrementToken()) {
// No remaining substring from a token with embedded punctuation: save
// the starting offset reported by the base tokenizer as the current 
// offset, then proceed with the analysis of token it returned.
currentOffset = offsetAttr.startOffset();
adjustAttributes(termAttr.toString());
} else {
// No more tokens in the underlying token stream: return false
return false;
}
return true;
}


           private void adjustAttributes(final String token) {
Matcher m = PUNCTUATION_SYMBOLS.matcher(token);
if (m.find()) {
int index = m.start();
offsetAttr.setOffset(currentOffset, currentOffset + index);
termAttr.copyBuffer(token.toCharArray(), 0, index);
tokenAfterPunctuation = token.substring(index + 1);
// Given that the incoming token had an embedded punctuation mark, 
// the starting offset for the substring following the punctuation
// mark will be 1 beyond the end of the current token, which is the
// substring preceding embedded punctuation mark.
currentOffset = offsetAttr.endOffset() + 1;
} else if (tokenAfterPunctuation != null) {
// Last remaining substring following a previously detected embedded
// punctuation mark: adjust attributes based on its values.
int length = tokenAfterPunctuation.length();
termAttr.copyBuffer(tokenAfterPunctuation.toCharArray(), 0, length);
offsetAttr.setOffset(currentOffset, currentOffset + length);
tokenAfterPunctuation = null;
}
// Implied else: neither is true so attributes from base tokenizer need
// no adjustments.
}

 }
}

Solr throws the following error, in the 'else if' block of #incrementToken

    2013-04-29 14:19:48,920 [http-thread-pool-8080(3)] ERROR 
org.apache.solr.core.SolrCore - java.lang.NullPointerException
    at 
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:923)
    at 
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1133)
    at 
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:180)
    at 
some.other.solr.analysis.EmbeddedPunctuationTokenizer.incrementToken(EmbeddedPunctuationTokenizer.java:83)
    at 
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
  

Re: transientCacheSize doesn't seem to have any effect, except on startup

2013-05-03 Thread didier deshommes
On Fri, May 3, 2013 at 11:18 AM, Erick Erickson erickerick...@gmail.comwrote:

 The cores aren't loaded (or at least shouldn't be) for getting the status.
 The _names_ of the cores should be returned, but those are (supposed) to be
 retrieved from a list rather than loaded cores. So are you sure that's not
 what
 you are seeing? How are you determining whether the cores are actually
 loaded
 or not?


I'm looking at the output of :

$ curl http://localhost:8983/solr/admin/cores?wt=jsonaction=status;

cores that are loaded have a startTime and upTime value. Cores that are
unloaded don't appear in the output at all. For example, I created 3
transient cores with transientCacheSize=2 . When I asked for a list of
all cores, all 3 cores were returned. I explicitly unloaded 1 core and got
back 2 cores when I asked for the list again.

It would be nice if cores had a isTransient and a isCurrentlyLoaded
value so that one could see exactly which cores are loaded.




 That said, it's perfectly possible that the status command is doing
 something we
 didn't anticipate, but I took a quick look at the code (got to rush to a
 plane)
 and CoreAdminHandler _appears_ to be just returning whatever info it can
 about an unloaded core for status. I _think_ you'll get more info if the
 core has ever been loaded though, even though if it's been removed from
 the transient cache. Ditto for the create action.

 So let's figure out whether you're really seeing loaded cores or not, and
 then
 raise a JIRA if so...

 Thanks for reporting!
 Erick

 On Thu, May 2, 2013 at 1:27 PM, didier deshommes dfdes...@gmail.com
 wrote:
  Hi,
  I've been very interested in the transient core feature of solr to
 manage a
  large number of cores. I'm especially interested in this use case, that
 the
  wiki lists at http://wiki.apache.org/solr/LotsOfCores (looks to be down
  now):
 
 loadOnStartup=false transient=true: This is really the use-case. There
 are
  a large number of cores in your system that are short-duration use. You
  want Solr to load them as necessary, but unload them when the cache gets
  full on an LRU basis.
 
  I'm creating 10 transient core via core admin like so
 
  $ curl 
 
 http://localhost:8983/solr/admin/cores?wt=jsonaction=CREATEname=new_core2instanceDir=collection1/dataDir=new_core2transient=trueloadOnStartup=false
  
 
  and have transientCacheSize=2 in my solr.xml file, which I take means I
  should have at most 2 transient cores loaded at any time. The problem is
  that these cores are still loaded when when I ask solr to list cores:
 
  $ curl http://localhost:8983/solr/admin/cores?wt=jsonaction=status;
 
  From the explanation in the wiki, it looks like solr would manage loading
  and unloading transient cores for me without having to worry about them,
  but this is not what's happening.
 
  The situation is different when I restart solr; it does the right thing
  by loading the maximum cores set by transientCacheSize. When I add more
  cores, the old behavior happens again, where all created transient cores
  are loaded in solr.
 
  I'm using the development branch lucene_solr_4_3 to run my example. I can
  open a jira if need be.



disaster recovery scenarios for solr cloud and zookeeper

2013-05-03 Thread Dennis Haller
Hi,

Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
expected to have a very high (perfect?) availability. With 3 or 5 zookeeper
nodes, it is possible to manage zookeeper maintenance and online
availability to be close to %100. But what is the worst case for Solr if
for some unanticipated reason all Zookeeper nodes go offline?

Could someone comment on a couple of possible scenarios for which all ZK
nodes are offline. What would happen to Solr and what would be needed to
recover in each case?
1) brief interruption, say 2 minutes,
2) longer downtime, say 60 min

Thanks
Dennis


Re: Duplicated Documents Across shards

2013-05-03 Thread Iker Mtnz. Apellaniz
We are currently using version 4.2.
We have made tests with a single document and it gives us a 2 document
count. But if we force to shard into te first machine, the one with a
unique shard, the count gives us 1 document.
I've tried using distrib=false parameter, it gives us no duplicate
documents, but the same document appears to be in two different shards.

Finally, about the separate directories, We have only one directory for the
data in each physical machine and collection, and I don't see any subfolder
for the different shards.

Is it possible that we have something wrong with the dataDir configuration
to use multiple shards in one machine?

dataDir${solr.data.dir:}/dataDir
directoryFactory name=DirectoryFactory
class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/



2013/5/3 Erick Erickson erickerick...@gmail.com

 What version of Solr? The custom routing stuff is quite new so
 I'm guessing 4x?

 But this shouldn't be happening. The actual index data for the
 shards should be in separate directories, they just happen to
 be on the same physical machine.

 Try querying each one with distrib=false to see the counts
 from single shards, that may shed some light on this. It vaguely
 sounds like you have indexed the same document to both shards
 somehow...

 Best
 Erick

 On Fri, May 3, 2013 at 5:28 AM, Iker Mtnz. Apellaniz
 mitxin...@gmail.com wrote:
  Hi,
We have currently a solrCloud implementation running 5 shards in 3
  physical machines, so the first machine will have the shard number 1, the
  second machine shards 2  4, and the third shards 3  5. We noticed that
  while queryng numFoundDocs decreased when we increased the start param.
After some investigation we found that the documents in shards 2 to 5
  were being counted twice. Querying to shard 2 will give you back the
  results for shard 2  4, and the same thing for shards 3  5. Our guess
 is
  that the physical index for both shard 24 is shared, so the shards don't
  know which part of it is for each one.
The uniqueKey is correctly defined, and we have tried using shard
 prefix
  (shard1!docID).
 
Is there any way to solve this problem when a unique physical machine
  shares shards?
Is it a real problem os it just affects facet  numResults?
 
  Thanks
 Iker
 
  --
  /** @author imartinez*/
  Person me = *new* Developer();
  me.setName(*Iker Mtz de Apellaniz Anzuola*);
  me.setTwit(@mitxino77 https://twitter.com/mitxino77);
  me.setLocations({St Cugat, Barcelona, Kanpezu, Euskadi, *,
 World]});
  me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
  me.setWebs({*urbasaabentura.com, ikertxef.com*});
  *return* me;




-- 
/** @author imartinez*/
Person me = *new* Developer();
me.setName(*Iker Mtz de Apellaniz Anzuola*);
me.setTwit(@mitxino77 https://twitter.com/mitxino77);
me.setLocations({St Cugat, Barcelona, Kanpezu, Euskadi, *, World]});
me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
*return* me;


Re: Configure Shingle Filter to ignore ngrams made of tokens with same start and end

2013-05-03 Thread Steve Rowe
An issue exists for this problem: 
https://issues.apache.org/jira/browse/LUCENE-3475

On May 3, 2013, at 11:00 AM, Walter Underwood wun...@wunderwood.org wrote:

 The shingle filter should respect positions. If it doesn't, that is worth 
 filing a bug so we know about it.
 
 wunder
 
 On May 3, 2013, at 10:50 AM, Jack Krupansky wrote:
 
 In short, no. I don't think you want to use the shingle filter on a token 
 stream that has multiple tokens at the same position, otherwise, you will 
 get confused suggestions, as you've encountered.
 
 -- Jack Krupansky
 
 -Original Message- From: Rounak Jain
 Sent: Friday, May 03, 2013 7:34 AM
 To: solr-user@lucene.apache.org
 Subject: Configure Shingle Filter to ignore ngrams made of tokens with same 
 start and end
 
 Hello,
 
 I was using Shingle Fitler with Suggester to implement an autosuggest
 dropdown. The field I'm using with shingle filter has a worddelimiter with
 preserveoriginal=1 to tokenize women's as women's and womens.
 
 Because of this, when shingle filter is generating word ngrams, apart from
 the expected tokens, there's also a women's womens tokens. I wanted to
 know if there's any way to configure ShingleFilter so that it ignores
 tokens with same start and end values.
 
 Thanks,
 Rounak 
 
 
 
 



Re: disaster recovery scenarios for solr cloud and zookeeper

2013-05-03 Thread Otis Gospodnetic
I *think* at this point SolrCloud without ZooKeeper is like a .
body without a head?

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Fri, May 3, 2013 at 3:21 PM, Dennis Haller dhal...@talenttech.com wrote:
 Hi,

 Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
 expected to have a very high (perfect?) availability. With 3 or 5 zookeeper
 nodes, it is possible to manage zookeeper maintenance and online
 availability to be close to %100. But what is the worst case for Solr if
 for some unanticipated reason all Zookeeper nodes go offline?

 Could someone comment on a couple of possible scenarios for which all ZK
 nodes are offline. What would happen to Solr and what would be needed to
 recover in each case?
 1) brief interruption, say 2 minutes,
 2) longer downtime, say 60 min

 Thanks
 Dennis


Re: disaster recovery scenarios for solr cloud and zookeeper

2013-05-03 Thread Walter Underwood
Ideally, the Solr nodes should be able to continue as long as no node fails. 
Failure of a leader would be bad, failure of non-leader replicas might cause 
some timeouts, but could be survivable.

Of course, nodes could not be added.

wunder

On May 3, 2013, at 5:05 PM, Otis Gospodnetic wrote:

 I *think* at this point SolrCloud without ZooKeeper is like a .
 body without a head?
 
 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/
 
 
 
 
 
 On Fri, May 3, 2013 at 3:21 PM, Dennis Haller dhal...@talenttech.com wrote:
 Hi,
 
 Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
 expected to have a very high (perfect?) availability. With 3 or 5 zookeeper
 nodes, it is possible to manage zookeeper maintenance and online
 availability to be close to %100. But what is the worst case for Solr if
 for some unanticipated reason all Zookeeper nodes go offline?
 
 Could someone comment on a couple of possible scenarios for which all ZK
 nodes are offline. What would happen to Solr and what would be needed to
 recover in each case?
 1) brief interruption, say 2 minutes,
 2) longer downtime, say 60 min
 
 Thanks
 Dennis

--
Walter Underwood
wun...@wunderwood.org





Re: disaster recovery scenarios for solr cloud and zookeeper

2013-05-03 Thread Shawn Heisey
On 5/3/2013 6:07 PM, Walter Underwood wrote:
 Ideally, the Solr nodes should be able to continue as long as no node fails. 
 Failure of a leader would be bad, failure of non-leader replicas might cause 
 some timeouts, but could be survivable.
 
 Of course, nodes could not be added.

I have read a few things that say things go read only when the zookeeper
ensemble loses quorum.  I'm not sure whether that means that Solr goes
read only or zookeeper goes read only.  I would be interested in knowing
exactly what happens when zookeeper loses quorum as well as what happens
if all three (or more) zookeeper nodes in the ensemble go away entirely.

I have a SolrCloud I can experiment with, but I need to find a
maintenance window for testing, so I can't check right now.

Thanks,
Shawn



Re: disaster recovery scenarios for solr cloud and zookeeper

2013-05-03 Thread Anshum Gupta
In case all your Zk nodes go down, the querying would continue to work fine (as 
far as no nodes fail) but you'd not be able to add docs.

Sent from my iPhone

On 03-May-2013, at 17:52, Shawn Heisey s...@elyograg.org wrote:

 On 5/3/2013 6:07 PM, Walter Underwood wrote:
 Ideally, the Solr nodes should be able to continue as long as no node fails. 
 Failure of a leader would be bad, failure of non-leader replicas might cause 
 some timeouts, but could be survivable.
 
 Of course, nodes could not be added.
 
 I have read a few things that say things go read only when the zookeeper
 ensemble loses quorum.  I'm not sure whether that means that Solr goes
 read only or zookeeper goes read only.  I would be interested in knowing
 exactly what happens when zookeeper loses quorum as well as what happens
 if all three (or more) zookeeper nodes in the ensemble go away entirely.
 
 I have a SolrCloud I can experiment with, but I need to find a
 maintenance window for testing, so I can't check right now.
 
 Thanks,
 Shawn
 


Re: disaster recovery scenarios for solr cloud and zookeeper

2013-05-03 Thread Gopal Patwa
agree with Anshum and Netflix has very nice supervisor system for ZooKeeper
if they goes down it will restart them automatically

http://techblog.netflix.com/2012/04/introducing-exhibitor-supervisor-system.html
https://github.com/Netflix/exhibitor




On Fri, May 3, 2013 at 6:53 PM, Anshum Gupta ans...@anshumgupta.net wrote:

 In case all your Zk nodes go down, the querying would continue to work
 fine (as far as no nodes fail) but you'd not be able to add docs.

 Sent from my iPhone

 On 03-May-2013, at 17:52, Shawn Heisey s...@elyograg.org wrote:

  On 5/3/2013 6:07 PM, Walter Underwood wrote:
  Ideally, the Solr nodes should be able to continue as long as no node
 fails. Failure of a leader would be bad, failure of non-leader replicas
 might cause some timeouts, but could be survivable.
 
  Of course, nodes could not be added.
 
  I have read a few things that say things go read only when the zookeeper
  ensemble loses quorum.  I'm not sure whether that means that Solr goes
  read only or zookeeper goes read only.  I would be interested in knowing
  exactly what happens when zookeeper loses quorum as well as what happens
  if all three (or more) zookeeper nodes in the ensemble go away entirely.
 
  I have a SolrCloud I can experiment with, but I need to find a
  maintenance window for testing, so I can't check right now.
 
  Thanks,
  Shawn