replicate indexing to second site

2016-02-09 Thread tedsolr
I have a Solr Cloud cluster (v5.2.1) using a Zookeeper ensemble in my primary
data center. I am now trying to plan for disaster recovery with an available
warm site. I have read (many times) the disaster recovery section in the
Apache ref guide. I suppose I don't fully understand it.

What I'd like to know is the best way to sync up the existing data, and the
best way to keep that data in sync. Assume that the warm site is an exact
copy (not at the network level) of the production cluster - so the same
servers with the same config. All servers are virtual. The use case is the
active cluster goes down and cannot be repaired, so the warm site would
become the active site. This is a manual process that takes many hours to
accomplish (I just need to fit Solr into this existing process, I can't
change the process :).

I expect that rsync can be used initially to copy the collection data
folders and the zookeeper data and transaction log folders. So after
verifying Solr/ZK is functional after the install, shut it down and perform
the copy. This may sound slow but my production index size is < 100GB. Is
this approach reasonable?

So now to keep the warm site in sync, I could use rsync on a scheduled basis
but I assume there's a better way. The ref guide says to send all indexing
requests to the second cluster at the same time they are sent to the active
cluster. I use SolrJ for all requests. So would this entail using a second
CloudSolrClient instance that only knows about the second cluster? Seems
reasonable but I don't want to lengthen the response time for the users. Is
this just a software problem to work out (separate thread)? Or is there a
SolrJ solution (asyc calls)?

Thanks!!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/replicate-indexing-to-second-site-tp4256240.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr architecture

2016-02-09 Thread Upayavira
Bear in mind that Lucene is optimised towards high read lower write.
That is, it puts in a lot of effort at write time to make reading
efficient. It sounds like you are going to be doing far more writing
than reading, and I wonder whether you are necessarily choosing the
right tool for the job.

How would you later use this data, and what advantage is there to
storing it in Solr?

Upayavira

On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote:
> Hi,
> Thanks for all your suggestions. I took some time to get the details to
> be
> more accurate. Please find what I have gathered:-
> 
> My data being indexed is something like this.
> I am basically capturing all data related to a user session.
> Inside a session I have categorized my actions like actionA, actionB
> etc..,
> per page.
> So each time an action pertaining to say actionA or actionB etc.. (in
> each
> page) happens, it is updated in Solr under that session (sessionId).
> 
> So in short there is only one doc pertaining to a single session
> (identified by sessionid) in my Solr index and that is retrieved and
> updated
> whenever a new action under that session occurs.
> We expect upto 4Million session per day.
> 
> On an average *one session's* *doc has a size* of *3MB to 20MB*.
> So if it is *4Million sessions per day*, each session writing around *500
> times to Solr*, it is* 2Billion writes or (indexing) per day to Solr*.
> As it is one doc per session, it is *4Million docs per day*.
> This is around *80K docs indexed per second* during *peak* hours and
> around *15K
> docs indexed per second* into Solr during* non-peak* hours.
> Number of queries per second is around *320 queries per second*.
> 
> 
> 1. Average size of a doc
>  3MB to 20MB
> 2. Query types:-
>  Until that session is in progress, whatever data is there for that
> session so far is queried and the new action's details captured and
> appended to existing data already capturedrelated to that session
> and indexed back into Solr. So, longer the session the more data
> retrieved
> for each subsequent query to get current data captured for that session.
>  Also querying can be done on timestamp etc... which is captured
>  along
> with each action.
> 3. Are docs grouped somehow?
>  All data related to a session are retrieved from Solr, updated and
> indexed back to Solr based on sessionId. No other grouping.
> 4. Are they time sensitive (NRT or offline process does this)
>  As mentioned above this is in NRT. Each time a new user action in
>  that
> session happens, we need to query existing session info already captured
> related to that session andappend this new data  to this existing
> info retrieved and index it back to Solr.
> 5. Will they update or it is rebuild every time, etc.
>  Each time a new user action occurs, the full data pertaining to that
> session so far captured is retrieved from Solr, the extra latest data
> pertaining to this new action is appended  and indexed  back to Solr.
> 6. And the other thing you haven't told us is whether you plan on
> _adding_
> 2B docs a day or whether that number is the total corpus size and you are
> re-indexing the 2B docs/day. IOW, if you are  adding 2B docs/day, 30 days
> later do you have 2B docs or 60B docs in your
>corpus?
>We are expecting around 4 million sessions per day (per session 500
> writes to Solr), which turns out to be 2B indexing done per day. So after
> 30 days it would be 4Milion*30  docs in the index.
> 7. Is there any aging of docs
>  No we always query against the whole corpus present.
> 8. Is any doc deleted?
>  No all data remains in the index
> 
> Any suggestion is very welcome!
> 
> Thanks!
> Mark.
> 
> 
> On Mon, Feb 8, 2016 at 3:30 PM, Jack Krupansky 
> wrote:
> 
> > Oops... at 100 qps for a single node you would need 120 nodes to get to 12K
> > qps and 800 nodes to get 80K qps, but that is just an extremely rough
> > ballpark estimate, not some precise and firm number. And that's if all the
> > queries can be evenly distributed throughout the cluster and don't require
> > fanout to other shards, which effectively turns each incoming query into n
> > queries where n is the number of shards.
> >
> > -- Jack Krupansky
> >
> > On Mon, Feb 8, 2016 at 12:07 PM, Jack Krupansky 
> > wrote:
> >
> > > So is there any aging or TTL (in database terminology) of older docs?
> > >
> > > And do all of your queries need to query all of the older documents all
> > of
> > > the time or is there a clear hierarchy of querying for aged documents,
> > like
> > > past 24-hours vs. past week vs. past year vs. older than a year? Sure,
> > you
> > > can always use a function query to boost by the inverse of document age,
> > > but Solr would be more efficient with filter queries or separate indexes
> > > for different time scales.
> > >
> > > Are documents ever updated or are they write-once?
> > >
> > > 

Re: Solr 4.10 with Jetty 8.1.10 & Tomcat 7

2016-02-09 Thread Susheel Kumar
Shahzad - As Shawn mentioned you can get lot of inputs from the folks who
are using joins in Solr cloud if you start a new thread and i would suggest
to take a look at Solr Streaming expressions and Parallel SQL Interface
which covers joining use cases as well.

Thanks,
Susheel

On Tue, Feb 9, 2016 at 9:17 AM, Shawn Heisey  wrote:

> On 2/8/2016 10:10 PM, Shahzad Masud wrote:
> > Due to distributed search feature, I might not be able to run
> > SolrCloud. I would appreciate, if you please share that way of setting
> > solr home for a specific context in Jetty-Solr. Its good to seek more
> > information for comparison purposes. Do you think having multiple JVMs
> > would increase or decrease performance. My document base is around 20
> > million rows (in 24 shards), with document size ranging from 100KB -
> > 400 MB. SM
>
> For most people, the *entire point* of running SolrCloud is to do
> distributed search, so to hear that you can't run SolrCloud because of
> distributed search is very confusing to me.
>
> I admit to ignorance when it comes to the join feature in Solr ... but
> it is my understanding that all you need to make joins work properly is
> to have both of the indexes that you are joining running in the same JVM
> and the same Solr instance.  If you arrange your SolrCloud replicas so a
> copy of every index is loaded on every server, I think that would
> satisfy this requirement.  I may be wrong, but I believe there are
> SolrCloud users that use the join feature.
>
> When you create a config file for a Solr context, whether it's Jetty,
> Tomcat, or some other container, you can set the solr/home JNDI variable
> in the context fragment to set the solr home for that context.  I found
> a specific example for Tomcat.  I know Jetty can do the same, but I do
> not know how to actually create the context fragment.
>
>
> https://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat
>
> I need to reiterate one point again.  You should only run one Solr
> container per server, with exactly one Solr context installed in that
> server.  This is recommended whether you're running SolrCloud or not,
> and whether you're using distributed search or not.  One Solr context
> can handle a LOT of indexes.
>
> Running multiple Solr instances per server is only recommended in one
> case:  Extremely large indexes where you would need a very large heap.
> Running two JVMs with smaller heaps *might* be more efficient ... but in
> that case, it is usually better to split those indexes between two
> separate servers, each one running only one instance of Solr.
>
> Thanks,
> Shawn
>
>


Re: online scoring explanation

2016-02-09 Thread John Blythe
that's it!

and doug is the one from back in the day :)

thanks guys

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, Feb 8, 2016 at 3:08 PM, Toke Eskildsen 
wrote:

> John Blythe  wrote:
> > last year i had gotten a site recommended to me on this forum. it helped
> > you break down the results/score you were getting from your queries.
>
> http://splainer.io/ perhaps?
>
> - Toke Eskildsen
>


Re: online scoring explanation

2016-02-09 Thread Vincenzo D'Amore
Hi,

I did a chrome extension:

https://chrome.google.com/webstore/detail/solr-query-debugger/gmpkeiamnmccifccnbfljffkcnacmmdl


Hope this helps,
Vincenzo


On Tue, Feb 9, 2016 at 11:39 AM, John Blythe  wrote:

> that's it!
>
> and doug is the one from back in the day :)
>
> thanks guys
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Mon, Feb 8, 2016 at 3:08 PM, Toke Eskildsen 
> wrote:
>
> > John Blythe  wrote:
> > > last year i had gotten a site recommended to me on this forum. it
> helped
> > > you break down the results/score you were getting from your queries.
> >
> > http://splainer.io/ perhaps?
> >
> > - Toke Eskildsen
> >
>



-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251


Re: Exporting Score value from export handler

2016-02-09 Thread Akiel Ahmed
Hi Joel,

I saw your response this morning, and have created an issue, SOLR-8664, 
and linked it to SOLR-8125. As context, I included my original question 
and your answer, as a comment.

Cheers

Akiel



From:   Joel Bernstein 
To: solr-user@lucene.apache.org
Date:   29/01/2016 13:46
Subject:Re: Exporting Score value from export handler



Exporting scores would be a great feature to have. I don't believe it will
add too much complexity to export and sort by score. The main 
consideration
has been memory consumption for every large export sets. The export 
feature
powers SQL queries that are unlimited in Solr 6. So adding scores to 
export
would support queries like:

select id, title, score from tableX where a = '(a query)'

Where currently you can only do this:

select id, title, score from tableX where a = '(a query)' limit 1000

Can you create a jira for this and link it to SOLR-8125.




Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jan 29, 2016 at 8:26 AM, Akiel Ahmed  wrote:

> Hi,
>
> I would like to issue a query and get the ID and Score for each matching
> document. There may be lots of results so I wanted to use the export
> handler, but unfortunately the current version of Solr doesn't seem to
> export the Score - I read the comments on
> https://issues.apache.org/jira/browse/SOLR-5244 (Exporting Full Sorted
> Result Sets) but am not sure what happened with the idea of exporting 
the
> Score. Does anybody know of an existing or future version where this can
> be done?
>
> I compared exporting 100,000 IDs via the export handler with getting
> 100,000 ID,Score pairs using the cursor mark - exporting 100,000 IDs was
> an order of magnitude faster on my laptop. Does anybody know of a faster
> way to retrieve the ID,Score pairs for a query on a SolrScloud 
deployment
> and/or have an idea on the possible performance characteristics of
> exporting ID, Score (without ranking) if it was to be implemented?
>
> Cheers
>
> Akiel
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 
3AU
>


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


RE: Custom JSON facet functions

2016-02-09 Thread Markus Jelsma
Nice! Are the aggregations also going to be pluggable? Reading the ticket, i 
would assume it is going to be pluggable.

Thanks,
Markus 
 
-Original message-
> From:Yonik Seeley 
> Sent: Tuesday 9th February 2016 15:25
> To: solr-user@lucene.apache.org
> Subject: Re: Custom JSON facet functions
> 
> On Tue, Feb 9, 2016 at 7:10 AM, Markus Jelsma
>  wrote:
> > Hi - i must have missing something but is it possible to declare custom 
> > JSON facet functions in solrconfig.xml? Just like we would do with request 
> > handlers or  search components?
> 
> Yes, but it will probably change:
> https://issues.apache.org/jira/browse/SOLR-7447
> 
> So currently, you would register a facet function just like a custom
> function (value source),
> but just put "_agg" at the end of the name and implement AggValueSource.
> So for example, the "sum" facet function is registered as "sum_agg"
> and implements AggValueSource (the class is SumAgg)
> 
> So if you utilize this, just realize that the mechanism and interfaces
> are experimental and subject to change (and probably will change at
> some point for this in particular).
> 
> -Yonik
> 


Re: Custom JSON facet functions

2016-02-09 Thread Yonik Seeley
On Tue, Feb 9, 2016 at 10:02 AM, Markus Jelsma
 wrote:
> Nice! Are the aggregations also going to be pluggable? Reading the ticket, i 
> would assume it is going to be pluggable.

Yep.

-Yonik


> Thanks,
> Markus
>
> -Original message-
>> From:Yonik Seeley 
>> Sent: Tuesday 9th February 2016 15:25
>> To: solr-user@lucene.apache.org
>> Subject: Re: Custom JSON facet functions
>>
>> On Tue, Feb 9, 2016 at 7:10 AM, Markus Jelsma
>>  wrote:
>> > Hi - i must have missing something but is it possible to declare custom 
>> > JSON facet functions in solrconfig.xml? Just like we would do with request 
>> > handlers or  search components?
>>
>> Yes, but it will probably change:
>> https://issues.apache.org/jira/browse/SOLR-7447
>>
>> So currently, you would register a facet function just like a custom
>> function (value source),
>> but just put "_agg" at the end of the name and implement AggValueSource.
>> So for example, the "sum" facet function is registered as "sum_agg"
>> and implements AggValueSource (the class is SumAgg)
>>
>> So if you utilize this, just realize that the mechanism and interfaces
>> are experimental and subject to change (and probably will change at
>> some point for this in particular).
>>
>> -Yonik
>>


Re: Solr 4.10 with Jetty 8.1.10 & Tomcat 7

2016-02-09 Thread Shawn Heisey
On 2/8/2016 10:10 PM, Shahzad Masud wrote:
> Due to distributed search feature, I might not be able to run
> SolrCloud. I would appreciate, if you please share that way of setting
> solr home for a specific context in Jetty-Solr. Its good to seek more
> information for comparison purposes. Do you think having multiple JVMs
> would increase or decrease performance. My document base is around 20
> million rows (in 24 shards), with document size ranging from 100KB -
> 400 MB. SM

For most people, the *entire point* of running SolrCloud is to do
distributed search, so to hear that you can't run SolrCloud because of
distributed search is very confusing to me.

I admit to ignorance when it comes to the join feature in Solr ... but
it is my understanding that all you need to make joins work properly is
to have both of the indexes that you are joining running in the same JVM
and the same Solr instance.  If you arrange your SolrCloud replicas so a
copy of every index is loaded on every server, I think that would
satisfy this requirement.  I may be wrong, but I believe there are
SolrCloud users that use the join feature.

When you create a config file for a Solr context, whether it's Jetty,
Tomcat, or some other container, you can set the solr/home JNDI variable
in the context fragment to set the solr home for that context.  I found
a specific example for Tomcat.  I know Jetty can do the same, but I do
not know how to actually create the context fragment.

https://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat

I need to reiterate one point again.  You should only run one Solr
container per server, with exactly one Solr context installed in that
server.  This is recommended whether you're running SolrCloud or not,
and whether you're using distributed search or not.  One Solr context
can handle a LOT of indexes.

Running multiple Solr instances per server is only recommended in one
case:  Extremely large indexes where you would need a very large heap. 
Running two JVMs with smaller heaps *might* be more efficient ... but in
that case, it is usually better to split those indexes between two
separate servers, each one running only one instance of Solr.

Thanks,
Shawn



Re: Custom JSON facet functions

2016-02-09 Thread Yonik Seeley
On Tue, Feb 9, 2016 at 7:10 AM, Markus Jelsma
 wrote:
> Hi - i must have missing something but is it possible to declare custom JSON 
> facet functions in solrconfig.xml? Just like we would do with request 
> handlers or  search components?

Yes, but it will probably change:
https://issues.apache.org/jira/browse/SOLR-7447

So currently, you would register a facet function just like a custom
function (value source),
but just put "_agg" at the end of the name and implement AggValueSource.
So for example, the "sum" facet function is registered as "sum_agg"
and implements AggValueSource (the class is SumAgg)

So if you utilize this, just realize that the mechanism and interfaces
are experimental and subject to change (and probably will change at
some point for this in particular).

-Yonik


Re: /solr/admin/ping causing exceptions in log?

2016-02-09 Thread Daniel Pool
Nathan,

Did you ever get to the bottom of this issue? I'm encountering exactly the same 
problem with haproxy 1.6.2; health checks throwing occasional errors and the 
connection being closed by haproxy.

Daniel Pool


This electronic message contains information from CACI International Inc or
subsidiary companies, which may be confidential, proprietary,
privileged or otherwise protected from disclosure.  The information is
intended to be used solely by the recipient(s) named above.  If you are not
an intended recipient, be aware that any review, disclosure, copying,
distribution or use of this transmission or its contents is prohibited.  If
you have received this transmission in error, please notify us immediately
at postmas...@caci.co.uk
Viruses: Although we have taken steps to ensure that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.

CACI Limited. Registered in England & Wales. Registration No. 1649776. CACI 
House, Avonmore Road, London, W14 8TS.


Custom JSON facet functions

2016-02-09 Thread Markus Jelsma
Hi - i must have missing something but is it possible to declare custom JSON 
facet functions in solrconfig.xml? Just like we would do with request handlers 
or  search components?

Thanks,
Markus


Re: online scoring explanation

2016-02-09 Thread John Blythe
amazing, thanks!

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Tue, Feb 9, 2016 at 6:04 AM, Vincenzo D'Amore  wrote:

> Hi,
>
> I did a chrome extension:
>
>
> https://chrome.google.com/webstore/detail/solr-query-debugger/gmpkeiamnmccifccnbfljffkcnacmmdl
>
>
> Hope this helps,
> Vincenzo
>
>
> On Tue, Feb 9, 2016 at 11:39 AM, John Blythe  wrote:
>
> > that's it!
> >
> > and doug is the one from back in the day :)
> >
> > thanks guys
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | j...@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, Feb 8, 2016 at 3:08 PM, Toke Eskildsen 
> > wrote:
> >
> > > John Blythe  wrote:
> > > > last year i had gotten a site recommended to me on this forum. it
> > helped
> > > > you break down the results/score you were getting from your queries.
> > >
> > > http://splainer.io/ perhaps?
> > >
> > > - Toke Eskildsen
> > >
> >
>
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>


Re: Solr 4.10 with Jetty 8.1.10 & Tomcat 7

2016-02-09 Thread Susheel Kumar
Shahzad - I am curious what features of distributed search stops you to run
SolrCloud. Using DS, you would be able to search across cores or
collections.
https://cwiki.apache.org/confluence/display/solr/Advanced+Distributed+Request+Options

Thanks,
Susheel

On Tue, Feb 9, 2016 at 12:10 AM, Shahzad Masud <
shahzad.ma...@northbaysolutions.net> wrote:

> Thank you Shawn for your response. I would be running some performance
> tests lately on this structure (one JVM with multiple cores), and would
> share feedback on this thread.
>
> >There IS a way to specify the solr home for a specific context, but keep
> >in mind that I definitely DO NOT recommend doing this.  There is
> >resource and administrative overhead to running multiple copies of Solr
> >in one JVM.  Simply run one context and let it handle multiple shards,
> >whether you choose SolrCloud or not.
> Due to distributed search feature, I might not be able to run SolrCloud. I
> would appreciate, if you please share that way of setting solr home for a
> specific context in Jetty-Solr. Its good to seek more information for
> comparison purposes. Do you think having multiple JVMs would increase or
> decrease performance. My document base is around 20 million rows (in 24
> shards), with document size ranging from 100KB - 400 MB.
>
> SM
>
> On Mon, Feb 8, 2016 at 8:09 PM, Shawn Heisey  wrote:
>
> > On 2/8/2016 1:14 AM, Shahzad Masud wrote:
> > > Thank you Shawn for your reply. Here is my structure of cores and
> shards
> > >
> > > Shard 1 = localhost:8983/solr_2014 [3 Core  - Employee, Service
> Tickets,
> > > Departments]
> > > Shard 2 = localhost:8983/solr_2015 [3 Core  - Employee, Service
> Tickets,
> > > Departments]
> > > Shard 3 = localhost:8983/solr_2016 [3 Core  - Employee, Service
> Tickets,
> > > Departments]
> > >
> > > While searching, I use distributed search feature to search data from
> all
> > > three shards in respective cores e.g. If I want to search from Employee
> > > data for all three years, I search from Employee core of three
> contexts.
> > > This is legacy design, do you think this is okay, or this require
> > immediate
> > > restructure / design? I am going to try this,
> > >
> > > Context = localhost:8982/solr (9 cores - Employee-2014, Employee-2015,
> > > Employee-2016, ServiceTickets-2014, ServiceTickets-2015,
> > > ServiceTickets-2016, Department-2014, Department-2015, Department-2016]
> > > distributed search would be from all three cores of same data category
> > > (i.e. For Employee search, it would be from Employee-2014,
> Employee-2015,
> > > Employee-2016).
> >
> > With SolrCloud, you can have multiple collections for each of these
> > types and alias them together.  Or you can simply have one collection
> > for employee, one for servicetickets, and one for department, with
> > SolrCloud automatically handling splitting those documents into the
> > number of shardsthat you specify when you create the collection.  You
> > can also do manual sharding and split each collection on a time basis
> > like you have been doing, but then you lose some of the automation that
> > SolrCloud provides, so I do not recommend handling it that way.
> >
> > > Regarding one Solr context per jetty; I cannot run two solr contexts
> > > pointing to different data in Jetty, as while starting jetty I have to
> > > provide -Dsolr.solr.home variable - which ends up pointing to one data
> > > folder (2014 data) only.
> >
> > You do not need multiple contexts to have multiple indexes.
> >
> > My dev Solr server has exactly one Solr JVM, with exactly one context --
> > /solr.  That instance of Solr has 45 indexes (cores) on it.  These 45
> > cores are various shards for three larger indexes.  I am not running
> > SolrCloud, but I certainly could.
> >
> > You can see 25 of the 45 cores in my Solr instance in this screenshot of
> > the admin UI for this server:
> >
> > https://www.dropbox.com/s/v87mxvkdejvd92h/solr-with-45-cores.png?dl=0
> >
> > There IS a way to specify the solr home for a specific context, but keep
> > in mind that I definitely DO NOT recommend doing this.  There is
> > resource and administrative overhead to running multiple copies of Solr
> > in one JVM.  Simply run one context and let it handle multiple shards,
> > whether you choose SolrCloud or not.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Solr 4.10 with Jetty 8.1.10 & Tomcat 7

2016-02-09 Thread Shahzad Masud
Susheel, thank you for asking. I am using joins of two cores (employee,
department, servicetickets), which isn't support by SolrCloud - last time I
check. Not sure if this (advanced distributed request option) was present
in 4.10. Do you think, I am missing something here?

Shahzad

On Tue, Feb 9, 2016 at 6:39 PM, Susheel Kumar  wrote:

> Shahzad - I am curious what features of distributed search stops you to run
> SolrCloud. Using DS, you would be able to search across cores or
> collections.
>
> https://cwiki.apache.org/confluence/display/solr/Advanced+Distributed+Request+Options
>
> Thanks,
> Susheel
>
> On Tue, Feb 9, 2016 at 12:10 AM, Shahzad Masud <
> shahzad.ma...@northbaysolutions.net> wrote:
>
> > Thank you Shawn for your response. I would be running some performance
> > tests lately on this structure (one JVM with multiple cores), and would
> > share feedback on this thread.
> >
> > >There IS a way to specify the solr home for a specific context, but keep
> > >in mind that I definitely DO NOT recommend doing this.  There is
> > >resource and administrative overhead to running multiple copies of Solr
> > >in one JVM.  Simply run one context and let it handle multiple shards,
> > >whether you choose SolrCloud or not.
> > Due to distributed search feature, I might not be able to run SolrCloud.
> I
> > would appreciate, if you please share that way of setting solr home for a
> > specific context in Jetty-Solr. Its good to seek more information for
> > comparison purposes. Do you think having multiple JVMs would increase or
> > decrease performance. My document base is around 20 million rows (in 24
> > shards), with document size ranging from 100KB - 400 MB.
> >
> > SM
> >
> > On Mon, Feb 8, 2016 at 8:09 PM, Shawn Heisey 
> wrote:
> >
> > > On 2/8/2016 1:14 AM, Shahzad Masud wrote:
> > > > Thank you Shawn for your reply. Here is my structure of cores and
> > shards
> > > >
> > > > Shard 1 = localhost:8983/solr_2014 [3 Core  - Employee, Service
> > Tickets,
> > > > Departments]
> > > > Shard 2 = localhost:8983/solr_2015 [3 Core  - Employee, Service
> > Tickets,
> > > > Departments]
> > > > Shard 3 = localhost:8983/solr_2016 [3 Core  - Employee, Service
> > Tickets,
> > > > Departments]
> > > >
> > > > While searching, I use distributed search feature to search data from
> > all
> > > > three shards in respective cores e.g. If I want to search from
> Employee
> > > > data for all three years, I search from Employee core of three
> > contexts.
> > > > This is legacy design, do you think this is okay, or this require
> > > immediate
> > > > restructure / design? I am going to try this,
> > > >
> > > > Context = localhost:8982/solr (9 cores - Employee-2014,
> Employee-2015,
> > > > Employee-2016, ServiceTickets-2014, ServiceTickets-2015,
> > > > ServiceTickets-2016, Department-2014, Department-2015,
> Department-2016]
> > > > distributed search would be from all three cores of same data
> category
> > > > (i.e. For Employee search, it would be from Employee-2014,
> > Employee-2015,
> > > > Employee-2016).
> > >
> > > With SolrCloud, you can have multiple collections for each of these
> > > types and alias them together.  Or you can simply have one collection
> > > for employee, one for servicetickets, and one for department, with
> > > SolrCloud automatically handling splitting those documents into the
> > > number of shardsthat you specify when you create the collection.  You
> > > can also do manual sharding and split each collection on a time basis
> > > like you have been doing, but then you lose some of the automation that
> > > SolrCloud provides, so I do not recommend handling it that way.
> > >
> > > > Regarding one Solr context per jetty; I cannot run two solr contexts
> > > > pointing to different data in Jetty, as while starting jetty I have
> to
> > > > provide -Dsolr.solr.home variable - which ends up pointing to one
> data
> > > > folder (2014 data) only.
> > >
> > > You do not need multiple contexts to have multiple indexes.
> > >
> > > My dev Solr server has exactly one Solr JVM, with exactly one context
> --
> > > /solr.  That instance of Solr has 45 indexes (cores) on it.  These 45
> > > cores are various shards for three larger indexes.  I am not running
> > > SolrCloud, but I certainly could.
> > >
> > > You can see 25 of the 45 cores in my Solr instance in this screenshot
> of
> > > the admin UI for this server:
> > >
> > > https://www.dropbox.com/s/v87mxvkdejvd92h/solr-with-45-cores.png?dl=0
> > >
> > > There IS a way to specify the solr home for a specific context, but
> keep
> > > in mind that I definitely DO NOT recommend doing this.  There is
> > > resource and administrative overhead to running multiple copies of Solr
> > > in one JVM.  Simply run one context and let it handle multiple shards,
> > > whether you choose SolrCloud or not.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>


Re: /solr/admin/ping causing exceptions in log?

2016-02-09 Thread Shawn Heisey
On 2/9/2016 7:01 AM, Daniel Pool wrote:
> Did you ever get to the bottom of this issue? I'm encountering exactly the 
> same problem with haproxy 1.6.2; health checks throwing occasional errors and 
> the connection being closed by haproxy.

Your message did not include any quotes from the original thread, or
mention the specific problem you are seeing.  The thread that you
replied to is nearly a year and a half old, so I had already archived it
to a "2014" folder.

If you are seeing "EofException" in your logs like the original poster
was, then this is happening because Solr is taking longer to respond to
the query than whichever timeout in haproxy is active for that request,
so haproxy closed the TCP connection, resulting in that error in the
Solr log.  The underlying issue is likely a performance problem.

If you are seeing a different problem than EofException, please give us
full details.

Here is all the generic info I have on performance problems with Solr:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



Re: Solr architecture

2016-02-09 Thread Mark Robinson
Hi,
Thanks for all your suggestions. I took some time to get the details to be
more accurate. Please find what I have gathered:-

My data being indexed is something like this.
I am basically capturing all data related to a user session.
Inside a session I have categorized my actions like actionA, actionB etc..,
per page.
So each time an action pertaining to say actionA or actionB etc.. (in each
page) happens, it is updated in Solr under that session (sessionId).

So in short there is only one doc pertaining to a single session
(identified by sessionid) in my Solr index and that is retrieved and
updated
whenever a new action under that session occurs.
We expect upto 4Million session per day.

On an average *one session's* *doc has a size* of *3MB to 20MB*.
So if it is *4Million sessions per day*, each session writing around *500
times to Solr*, it is* 2Billion writes or (indexing) per day to Solr*.
As it is one doc per session, it is *4Million docs per day*.
This is around *80K docs indexed per second* during *peak* hours and
around *15K
docs indexed per second* into Solr during* non-peak* hours.
Number of queries per second is around *320 queries per second*.


1. Average size of a doc
 3MB to 20MB
2. Query types:-
 Until that session is in progress, whatever data is there for that
session so far is queried and the new action's details captured and
appended to existing data already capturedrelated to that session
and indexed back into Solr. So, longer the session the more data retrieved
for each subsequent query to get current data captured for that session.
 Also querying can be done on timestamp etc... which is captured along
with each action.
3. Are docs grouped somehow?
 All data related to a session are retrieved from Solr, updated and
indexed back to Solr based on sessionId. No other grouping.
4. Are they time sensitive (NRT or offline process does this)
 As mentioned above this is in NRT. Each time a new user action in that
session happens, we need to query existing session info already captured
related to that session andappend this new data  to this existing
info retrieved and index it back to Solr.
5. Will they update or it is rebuild every time, etc.
 Each time a new user action occurs, the full data pertaining to that
session so far captured is retrieved from Solr, the extra latest data
pertaining to this new action is appended  and indexed  back to Solr.
6. And the other thing you haven't told us is whether you plan on _adding_
2B docs a day or whether that number is the total corpus size and you are
re-indexing the 2B docs/day. IOW, if you are  adding 2B docs/day, 30 days
later do you have 2B docs or 60B docs in your
   corpus?
   We are expecting around 4 million sessions per day (per session 500
writes to Solr), which turns out to be 2B indexing done per day. So after
30 days it would be 4Milion*30  docs in the index.
7. Is there any aging of docs
 No we always query against the whole corpus present.
8. Is any doc deleted?
 No all data remains in the index

Any suggestion is very welcome!

Thanks!
Mark.


On Mon, Feb 8, 2016 at 3:30 PM, Jack Krupansky 
wrote:

> Oops... at 100 qps for a single node you would need 120 nodes to get to 12K
> qps and 800 nodes to get 80K qps, but that is just an extremely rough
> ballpark estimate, not some precise and firm number. And that's if all the
> queries can be evenly distributed throughout the cluster and don't require
> fanout to other shards, which effectively turns each incoming query into n
> queries where n is the number of shards.
>
> -- Jack Krupansky
>
> On Mon, Feb 8, 2016 at 12:07 PM, Jack Krupansky 
> wrote:
>
> > So is there any aging or TTL (in database terminology) of older docs?
> >
> > And do all of your queries need to query all of the older documents all
> of
> > the time or is there a clear hierarchy of querying for aged documents,
> like
> > past 24-hours vs. past week vs. past year vs. older than a year? Sure,
> you
> > can always use a function query to boost by the inverse of document age,
> > but Solr would be more efficient with filter queries or separate indexes
> > for different time scales.
> >
> > Are documents ever updated or are they write-once?
> >
> > Are documents explicitly deleted?
> >
> > Technically you probably could meet those specs, but... how many
> > organizations have the resources and the energy to do so?
> >
> > As a back of the envelope calculation, if Solr gave you 100 queries per
> > second per node, that would mean you would need 1,200 nodes. It would
> also
> > depend on whether those queries are very narrow so that a single node can
> > execute them or if they require fanout to other shards and then
> aggregation
> > of results from those other shards.
> >
> > -- Jack Krupansky
> >
> > On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson  >
> > wrote:
> >
> >> Short 

Re: SolrCloud behavior when a ZooKeeper node goes down

2016-02-09 Thread Shawn Heisey
On 2/8/2016 1:09 PM, Kelly, Frank wrote:
> We are running a small SolrCloud instance on AWS
>
> Solr : Version 5.3.1
> ZooKeeper: Version 3.4.6
>
> 3 x ZooKeeper nodes (with higher limits and timeouts due to being on AWS)
> 3 x Solr Nodes (8 GB of memory each – 2 collections with 3 shards for
> each collection)
>
> Let’s call the ZooKeeper nodes A, B and C.
> One of our ZooKeeper nodes (B) failed a health check and was replaced
> due to autoscaling , but during this time of failover
> our SolrCloud cluster became unavailable. All new connections to Solr
> were unable to connect complaining about connectivity issues
> and preexisting connections also had errors
>

> I thought because we had configured SolrCloud to point at all three ZK
> nodes that the failure of one ZK node would be OK (since we still had
> a quorum).
>  Did I misunderstand something about SolrCloud and its relationship
> with ZK?

That's supposed to be how Zookeeper and SolrCloud work, if everything is
configured properly and has full network connectivity.

What is your zkHost string for Solr?  Is the zkHost value the same on
all three SolrCloud nodes?  It should be identical on all of them, and
every server should be able to directly reach every other server on all
relevant ports.

> The weird thing now is that when the new ZooKeeper node (D) started up
> – after a few minutes we could connect to SolrCloud again even though
> we were still only pointing to A,B and C (not D).
> Any thoughts on why this also happened?

This sounds odd.

The exceptions that you outlined are from *client* code
(CloudSolrClient), not the Solr servers.  CloudSolrClient instances
should normally be constructed using the same zkHost string that your
Solr servers use, listing all of the zookeeper servers.  Is this how
they are set up?

I am unsure how all this might be affected by the internal/external
addressing that AWS uses.

Thanks,
Shawn



Knowing which doc failed to get added in solr during bulk addition in Solr 5.2

2016-02-09 Thread Debraj Manna
Hi,



I have a Document Centric Versioning Constraints added in solr schema:-


  false
  doc_version


I am adding multiple documents in solr in a single call using SolrJ 5.2.
The code fragment looks something like below :-


try {
UpdateResponse resp = solrClient.add(docs.getDocCollection(),
500);
if (resp.getStatus() != 0) {
throw new Exception(new StringBuilder(
"Failed to add docs in solr ").append(resp.toString())
.toString());
}
} catch (Exception e) {
logError("Adding docs to solr failed", e);
}


If one of the document is violating the versioning constraints then Solr is
returning an exception with error message like "user version is not high
enough: 1454587156" & the other documents are getting added perfectly. Is
there a way I can know which document is violating the constraints either
in Solr logs or from the Update response returned by Solr?

Thanks


Re: Knowing which doc failed to get added in solr during bulk addition in Solr 5.2

2016-02-09 Thread Erick Erickson
This has been a long standing issue, Hoss is doing some current work on it see:
https://issues.apache.org/jira/browse/SOLR-445

But the short form is "no, not yet".

Best,
Erick

On Tue, Feb 9, 2016 at 8:19 AM, Debraj Manna  wrote:
> Hi,
>
>
>
> I have a Document Centric Versioning Constraints added in solr schema:-
>
> 
>   false
>   doc_version
> 
>
> I am adding multiple documents in solr in a single call using SolrJ 5.2.
> The code fragment looks something like below :-
>
>
> try {
> UpdateResponse resp = solrClient.add(docs.getDocCollection(),
> 500);
> if (resp.getStatus() != 0) {
> throw new Exception(new StringBuilder(
> "Failed to add docs in solr ").append(resp.toString())
> .toString());
> }
> } catch (Exception e) {
> logError("Adding docs to solr failed", e);
> }
>
>
> If one of the document is violating the versioning constraints then Solr is
> returning an exception with error message like "user version is not high
> enough: 1454587156" & the other documents are getting added perfectly. Is
> there a way I can know which document is violating the constraints either
> in Solr logs or from the Update response returned by Solr?
>
> Thanks


Re: How is Tika used with Solr

2016-02-09 Thread Erick Erickson
Here's a writeup that should help

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

On Tue, Feb 9, 2016 at 2:49 PM, Alexandre Rafalovitch
 wrote:
> Solr uses Tika directly. And not in the most efficient way. It is
> there mostly for convenience rather than performance.
>
> So, for performance, Solr recommendation is also to run Tika
> separately and only send Solr the processed documents.
>
> Regards,
> Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 10 February 2016 at 09:46, Steven White  wrote:
>> Hi folks,
>>
>> I'm writing a file-system-crawler that will index files.  The file system
>> is going to be very busy an I anticipate on average 10 new updates per
>> min.  My application checks for new or updated files once every 1 min.  I
>> use Tika to extract the raw-text off those files and send them over to Solr
>> for indexing.  My application will be running 24x7xN-days.  It will not
>> recycle unless if the OS is restarted.
>>
>> Over at Tika mailing list, I was told the following:
>>
>> "As a side note, if you are handling a bunch of files from the wild in a
>> production environment, I encourage separating Tika into a separate jvm vs
>> tying it into any post processing – consider tika-batch and writing
>> separate text files for each file processed (not so efficient, but
>> exceedingly robust).  If this is demo code or you know your document set
>> well enough, you should be good to go with keeping Tika and your
>> postprocessing steps in the same jvm."
>>
>> My question is, how does Solr utilize Tika?  Does it run Tika in its own
>> JVM as an out-of-process application or does it link with Tika JARs
>> directly?  If it links in directly, are there known issues with Solr
>> integrated with Tika because of Tika issues?
>>
>> Thanks
>>
>> Steve


Re: replicate indexing to second site

2016-02-09 Thread Upayavira
There is a Cross Datacenter replication feature in the works - not sure
of its status.

In lieu of that, I'd simply have two copies of your indexing code -
index everything simultaneously into both clusters.

There is, of course risks that both get out of sync, so you might want
to find some ways to identify/manage that.

Upayavira

On Tue, Feb 9, 2016, at 08:43 PM, tedsolr wrote:
> I have a Solr Cloud cluster (v5.2.1) using a Zookeeper ensemble in my
> primary
> data center. I am now trying to plan for disaster recovery with an
> available
> warm site. I have read (many times) the disaster recovery section in the
> Apache ref guide. I suppose I don't fully understand it.
> 
> What I'd like to know is the best way to sync up the existing data, and
> the
> best way to keep that data in sync. Assume that the warm site is an exact
> copy (not at the network level) of the production cluster - so the same
> servers with the same config. All servers are virtual. The use case is
> the
> active cluster goes down and cannot be repaired, so the warm site would
> become the active site. This is a manual process that takes many hours to
> accomplish (I just need to fit Solr into this existing process, I can't
> change the process :).
> 
> I expect that rsync can be used initially to copy the collection data
> folders and the zookeeper data and transaction log folders. So after
> verifying Solr/ZK is functional after the install, shut it down and
> perform
> the copy. This may sound slow but my production index size is < 100GB. Is
> this approach reasonable?
> 
> So now to keep the warm site in sync, I could use rsync on a scheduled
> basis
> but I assume there's a better way. The ref guide says to send all
> indexing
> requests to the second cluster at the same time they are sent to the
> active
> cluster. I use SolrJ for all requests. So would this entail using a
> second
> CloudSolrClient instance that only knows about the second cluster? Seems
> reasonable but I don't want to lengthen the response time for the users.
> Is
> this just a software problem to work out (separate thread)? Or is there a
> SolrJ solution (asyc calls)?
> 
> Thanks!!
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/replicate-indexing-to-second-site-tp4256240.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: How is Tika used with Solr

2016-02-09 Thread Alexandre Rafalovitch
Solr uses Tika directly. And not in the most efficient way. It is
there mostly for convenience rather than performance.

So, for performance, Solr recommendation is also to run Tika
separately and only send Solr the processed documents.

Regards,
Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 10 February 2016 at 09:46, Steven White  wrote:
> Hi folks,
>
> I'm writing a file-system-crawler that will index files.  The file system
> is going to be very busy an I anticipate on average 10 new updates per
> min.  My application checks for new or updated files once every 1 min.  I
> use Tika to extract the raw-text off those files and send them over to Solr
> for indexing.  My application will be running 24x7xN-days.  It will not
> recycle unless if the OS is restarted.
>
> Over at Tika mailing list, I was told the following:
>
> "As a side note, if you are handling a bunch of files from the wild in a
> production environment, I encourage separating Tika into a separate jvm vs
> tying it into any post processing – consider tika-batch and writing
> separate text files for each file processed (not so efficient, but
> exceedingly robust).  If this is demo code or you know your document set
> well enough, you should be good to go with keeping Tika and your
> postprocessing steps in the same jvm."
>
> My question is, how does Solr utilize Tika?  Does it run Tika in its own
> JVM as an out-of-process application or does it link with Tika JARs
> directly?  If it links in directly, are there known issues with Solr
> integrated with Tika because of Tika issues?
>
> Thanks
>
> Steve


How is Tika used with Solr

2016-02-09 Thread Steven White
Hi folks,

I'm writing a file-system-crawler that will index files.  The file system
is going to be very busy an I anticipate on average 10 new updates per
min.  My application checks for new or updated files once every 1 min.  I
use Tika to extract the raw-text off those files and send them over to Solr
for indexing.  My application will be running 24x7xN-days.  It will not
recycle unless if the OS is restarted.

Over at Tika mailing list, I was told the following:

"As a side note, if you are handling a bunch of files from the wild in a
production environment, I encourage separating Tika into a separate jvm vs
tying it into any post processing – consider tika-batch and writing
separate text files for each file processed (not so efficient, but
exceedingly robust).  If this is demo code or you know your document set
well enough, you should be good to go with keeping Tika and your
postprocessing steps in the same jvm."

My question is, how does Solr utilize Tika?  Does it run Tika in its own
JVM as an out-of-process application or does it link with Tika JARs
directly?  If it links in directly, are there known issues with Solr
integrated with Tika because of Tika issues?

Thanks

Steve


Re: Solr architecture

2016-02-09 Thread Daniel Collins
So as I understand your use case, its effectively logging actions within a
user session, why do you have to do the update in NRT?  Why not just log
all the user session events (with some unique key, and ensuring the session
Id is in the document somewhere), then when you want to do the query, you
join on the session id, and that gives you all the data records for that
session. I don't really follow why it has to be 1 document (which you
continually update). If you really need that aggregation, couldn't that
happen offline?

I guess your 1 saving grace is that you query using the unique ID (in your
scenario) so you could use the real-time get handler, since you aren't
doing a complex query (strictly its not a search, its a raw key lookup).

But I would still question your use case, if you go the Solr route for that
kind of scale with querying and indexing that much, you're going to have to
throw a lot of hardware at it, as Jack says probably in the order of
hundreds of machines...

On 9 February 2016 at 19:00, Upayavira  wrote:

> Bear in mind that Lucene is optimised towards high read lower write.
> That is, it puts in a lot of effort at write time to make reading
> efficient. It sounds like you are going to be doing far more writing
> than reading, and I wonder whether you are necessarily choosing the
> right tool for the job.
>
> How would you later use this data, and what advantage is there to
> storing it in Solr?
>
> Upayavira
>
> On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote:
> > Hi,
> > Thanks for all your suggestions. I took some time to get the details to
> > be
> > more accurate. Please find what I have gathered:-
> >
> > My data being indexed is something like this.
> > I am basically capturing all data related to a user session.
> > Inside a session I have categorized my actions like actionA, actionB
> > etc..,
> > per page.
> > So each time an action pertaining to say actionA or actionB etc.. (in
> > each
> > page) happens, it is updated in Solr under that session (sessionId).
> >
> > So in short there is only one doc pertaining to a single session
> > (identified by sessionid) in my Solr index and that is retrieved and
> > updated
> > whenever a new action under that session occurs.
> > We expect upto 4Million session per day.
> >
> > On an average *one session's* *doc has a size* of *3MB to 20MB*.
> > So if it is *4Million sessions per day*, each session writing around *500
> > times to Solr*, it is* 2Billion writes or (indexing) per day to Solr*.
> > As it is one doc per session, it is *4Million docs per day*.
> > This is around *80K docs indexed per second* during *peak* hours and
> > around *15K
> > docs indexed per second* into Solr during* non-peak* hours.
> > Number of queries per second is around *320 queries per second*.
> >
> >
> > 1. Average size of a doc
> >  3MB to 20MB
> > 2. Query types:-
> >  Until that session is in progress, whatever data is there for that
> > session so far is queried and the new action's details captured and
> > appended to existing data already capturedrelated to that session
> > and indexed back into Solr. So, longer the session the more data
> > retrieved
> > for each subsequent query to get current data captured for that session.
> >  Also querying can be done on timestamp etc... which is captured
> >  along
> > with each action.
> > 3. Are docs grouped somehow?
> >  All data related to a session are retrieved from Solr, updated and
> > indexed back to Solr based on sessionId. No other grouping.
> > 4. Are they time sensitive (NRT or offline process does this)
> >  As mentioned above this is in NRT. Each time a new user action in
> >  that
> > session happens, we need to query existing session info already captured
> > related to that session andappend this new data  to this existing
> > info retrieved and index it back to Solr.
> > 5. Will they update or it is rebuild every time, etc.
> >  Each time a new user action occurs, the full data pertaining to that
> > session so far captured is retrieved from Solr, the extra latest data
> > pertaining to this new action is appended  and indexed  back to Solr.
> > 6. And the other thing you haven't told us is whether you plan on
> > _adding_
> > 2B docs a day or whether that number is the total corpus size and you are
> > re-indexing the 2B docs/day. IOW, if you are  adding 2B docs/day, 30 days
> > later do you have 2B docs or 60B docs in your
> >corpus?
> >We are expecting around 4 million sessions per day (per session 500
> > writes to Solr), which turns out to be 2B indexing done per day. So after
> > 30 days it would be 4Milion*30  docs in the index.
> > 7. Is there any aging of docs
> >  No we always query against the whole corpus present.
> > 8. Is any doc deleted?
> >  No all data remains in the index
> >
> > Any suggestion is very welcome!
> >
> > Thanks!
> > Mark.
> >
> >
> > On Mon, Feb 8, 2016 at 

Re: replicate indexing to second site

2016-02-09 Thread Walter Underwood
Making two indexing calls, one to each, works until one system is not 
available. Then they are out of sync.

You might want to put the updates into a persistent message queue, then have 
both systems indexed from that queue.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 9, 2016, at 1:49 PM, Upayavira  wrote:
> 
> There is a Cross Datacenter replication feature in the works - not sure
> of its status.
> 
> In lieu of that, I'd simply have two copies of your indexing code -
> index everything simultaneously into both clusters.
> 
> There is, of course risks that both get out of sync, so you might want
> to find some ways to identify/manage that.
> 
> Upayavira
> 
> On Tue, Feb 9, 2016, at 08:43 PM, tedsolr wrote:
>> I have a Solr Cloud cluster (v5.2.1) using a Zookeeper ensemble in my
>> primary
>> data center. I am now trying to plan for disaster recovery with an
>> available
>> warm site. I have read (many times) the disaster recovery section in the
>> Apache ref guide. I suppose I don't fully understand it.
>> 
>> What I'd like to know is the best way to sync up the existing data, and
>> the
>> best way to keep that data in sync. Assume that the warm site is an exact
>> copy (not at the network level) of the production cluster - so the same
>> servers with the same config. All servers are virtual. The use case is
>> the
>> active cluster goes down and cannot be repaired, so the warm site would
>> become the active site. This is a manual process that takes many hours to
>> accomplish (I just need to fit Solr into this existing process, I can't
>> change the process :).
>> 
>> I expect that rsync can be used initially to copy the collection data
>> folders and the zookeeper data and transaction log folders. So after
>> verifying Solr/ZK is functional after the install, shut it down and
>> perform
>> the copy. This may sound slow but my production index size is < 100GB. Is
>> this approach reasonable?
>> 
>> So now to keep the warm site in sync, I could use rsync on a scheduled
>> basis
>> but I assume there's a better way. The ref guide says to send all
>> indexing
>> requests to the second cluster at the same time they are sent to the
>> active
>> cluster. I use SolrJ for all requests. So would this entail using a
>> second
>> CloudSolrClient instance that only knows about the second cluster? Seems
>> reasonable but I don't want to lengthen the response time for the users.
>> Is
>> this just a software problem to work out (separate thread)? Or is there a
>> SolrJ solution (asyc calls)?
>> 
>> Thanks!!
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/replicate-indexing-to-second-site-tp4256240.html
>> Sent from the Solr - User mailing list archive at Nabble.com.



Re: replicate indexing to second site

2016-02-09 Thread Shawn Heisey
On 2/9/2016 1:43 PM, tedsolr wrote:
> I expect that rsync can be used initially to copy the collection data
> folders and the zookeeper data and transaction log folders. So after
> verifying Solr/ZK is functional after the install, shut it down and perform
> the copy. This may sound slow but my production index size is < 100GB. Is
> this approach reasonable?
>
> So now to keep the warm site in sync, I could use rsync on a scheduled basis
> but I assume there's a better way. The ref guide says to send all indexing
> requests to the second cluster at the same time they are sent to the active
> cluster. I use SolrJ for all requests. So would this entail using a second
> CloudSolrClient instance that only knows about the second cluster? Seems
> reasonable but I don't want to lengthen the response time for the users. Is
> this just a software problem to work out (separate thread)? Or is there a
> SolrJ solution (asyc calls)?

The way I would personally handle keeping both systems in sync at the
moment would be to modify my indexing system to update both systems in
parallel.  That likely would involve a second CloudSolrClient instance.

There's a new feature called "Cross Data Center Replication" but as far
as I know, it is only available in development versions, and has not
been made available in any released version of Solr.

http://yonik.com/solr-cross-data-center-replication/

This new feature may become available in 6.0 or a later 6.x release.  I
do not have any concrete information about the expected release date for
6.0.

Thanks,
Shawn



Re: replicate indexing to second site

2016-02-09 Thread Walter Underwood
Updating two systems in parallel gets into two-phase commit, instantly. So you 
need a persistent pool of updates that both clusters pull from.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 9, 2016, at 4:15 PM, Shawn Heisey  wrote:
> 
> On 2/9/2016 1:43 PM, tedsolr wrote:
>> I expect that rsync can be used initially to copy the collection data
>> folders and the zookeeper data and transaction log folders. So after
>> verifying Solr/ZK is functional after the install, shut it down and perform
>> the copy. This may sound slow but my production index size is < 100GB. Is
>> this approach reasonable?
>> 
>> So now to keep the warm site in sync, I could use rsync on a scheduled basis
>> but I assume there's a better way. The ref guide says to send all indexing
>> requests to the second cluster at the same time they are sent to the active
>> cluster. I use SolrJ for all requests. So would this entail using a second
>> CloudSolrClient instance that only knows about the second cluster? Seems
>> reasonable but I don't want to lengthen the response time for the users. Is
>> this just a software problem to work out (separate thread)? Or is there a
>> SolrJ solution (asyc calls)?
> 
> The way I would personally handle keeping both systems in sync at the
> moment would be to modify my indexing system to update both systems in
> parallel.  That likely would involve a second CloudSolrClient instance.
> 
> There's a new feature called "Cross Data Center Replication" but as far
> as I know, it is only available in development versions, and has not
> been made available in any released version of Solr.
> 
> http://yonik.com/solr-cross-data-center-replication/
> 
> This new feature may become available in 6.0 or a later 6.x release.  I
> do not have any concrete information about the expected release date for
> 6.0.
> 
> Thanks,
> Shawn
> 



Re: How is Tika used with Solr

2016-02-09 Thread Steven White
Thank you Erick and Alex.

My main question is with a long running process using Tika in the same JVM
as my application.  I'm running my file-system-crawler in its own JVM (not
Solr's).  On Tika mailing list, it is suggested to run Tika's code in it's
own JVM and invoke it from my file-system-crawler using
Runtime.getRuntime().exec().

I fully understand from Alex suggestion and link provided by Erick to use
Tika outside Solr.  But what about using Tika within the same JVM as my
file-system-crawler application or should I be making a system call to
invoke another JAR, that runs in its own JVM to extract the raw text?  Are
there known issues with Tika when used in a long running process?

Steve


On Tue, Feb 9, 2016 at 5:53 PM, Erick Erickson 
wrote:

> Here's a writeup that should help
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> On Tue, Feb 9, 2016 at 2:49 PM, Alexandre Rafalovitch
>  wrote:
> > Solr uses Tika directly. And not in the most efficient way. It is
> > there mostly for convenience rather than performance.
> >
> > So, for performance, Solr recommendation is also to run Tika
> > separately and only send Solr the processed documents.
> >
> > Regards,
> > Alex.
> > 
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 10 February 2016 at 09:46, Steven White  wrote:
> >> Hi folks,
> >>
> >> I'm writing a file-system-crawler that will index files.  The file
> system
> >> is going to be very busy an I anticipate on average 10 new updates per
> >> min.  My application checks for new or updated files once every 1 min.
> I
> >> use Tika to extract the raw-text off those files and send them over to
> Solr
> >> for indexing.  My application will be running 24x7xN-days.  It will not
> >> recycle unless if the OS is restarted.
> >>
> >> Over at Tika mailing list, I was told the following:
> >>
> >> "As a side note, if you are handling a bunch of files from the wild in a
> >> production environment, I encourage separating Tika into a separate jvm
> vs
> >> tying it into any post processing – consider tika-batch and writing
> >> separate text files for each file processed (not so efficient, but
> >> exceedingly robust).  If this is demo code or you know your document set
> >> well enough, you should be good to go with keeping Tika and your
> >> postprocessing steps in the same jvm."
> >>
> >> My question is, how does Solr utilize Tika?  Does it run Tika in its own
> >> JVM as an out-of-process application or does it link with Tika JARs
> >> directly?  If it links in directly, are there known issues with Solr
> >> integrated with Tika because of Tika issues?
> >>
> >> Thanks
> >>
> >> Steve
>


Re: replicate indexing to second site

2016-02-09 Thread Walter Underwood
I agree. If the system updates synchronously, then you are in two-phase commit 
land. If you have a persistent store that each index can track, then things are 
good.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 9, 2016, at 7:37 PM, Shawn Heisey  wrote:
> 
> On 2/9/2016 5:48 PM, Walter Underwood wrote:
>> Updating two systems in parallel gets into two-phase commit, instantly. So 
>> you need a persistent pool of updates that both clusters pull from.
> 
> My indexing system does exactly what I have suggested for tedsolr -- it
> updates multiple copies of my index in parallel.  My data source is MySQL.
> 
> For each copy, information about the last successful update is
> separately tracked, so if one of the index copies goes offline, the
> other stays current.  When the offline system comes back, it will be
> updated from the saved position, and will eventually have the same
> information as the system that did not go offline.
> 
> As far as two-phase commit goes, that would make it so that neither copy
> of the index would stay current if one of them went offline.  In most
> situations I can think of, that's not really very useful.
> 
> Thanks,
> Shawn
> 



Re: solr performance issue

2016-02-09 Thread Zheng Lin Edwin Yeo
1 million document isn't considered big for Solr. How much RAM does your
machine have?

Regards,
Edwin

On 8 February 2016 at 23:45, Susheel Kumar  wrote:

> 1 million document shouldn't have any issues at all.  Something else is
> wrong with your hw/system configuration.
>
> Thanks,
> Susheel
>
> On Mon, Feb 8, 2016 at 6:45 AM, sara hajili  wrote:
>
> > On Mon, Feb 8, 2016 at 3:04 AM, sara hajili 
> wrote:
> >
> > > sorry i made a mistake i have a bout 1000 K doc.
> > > i mean about 100 doc.
> > >
> > > On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic <
> > > emir.arnauto...@sematext.com> wrote:
> > >
> > >> Hi Sara,
> > >> Not sure if I am reading this right, but I read it as you have 1000
> doc
> > >> index and issues? Can you tell us bit more about your setup: number of
> > >> servers, hw, index size, number of shards, queries that you run, do
> you
> > >> index at the same time...
> > >>
> > >> It seems to me that you are running Solr on server with limited RAM
> and
> > >> probably small heap. Swapping for sure will slow things down and GC is
> > most
> > >> likely reason for high CPU.
> > >>
> > >> You can use http://sematext.com/spm to collect Solr and host metrics
> > and
> > >> see where the issue is.
> > >>
> > >> Thanks,
> > >> Emir
> > >>
> > >> --
> > >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > >> Solr & Elasticsearch Support * http://sematext.com/
> > >>
> > >>
> > >>
> > >> On 08.02.2016 10:27, sara hajili wrote:
> > >>
> > >>> hi all.
> > >>> i have a problem with my solr performance and usage hardware like a
> > >>> ram,cup...
> > >>> i have a lot of document and so indexed file about 1000 doc in solr
> > that
> > >>> every doc has about 8 field in average.
> > >>> and each field has about 60 char.
> > >>> i set my field as a storedfield = "false" except of  1 field. // i
> read
> > >>> that this help performance.
> > >>> i used copy field and dynamic field if it was necessary . // i read
> > that
> > >>> this help performance.
> > >>> and now my question is that when i run a lot of query on solr i faced
> > >>> with
> > >>> a problem solr use more cpu and ram and after that filled ,it use a
> lot
> > >>>   swapped storage and then use hard,but doesn't create a system file!
> > >>> solr
> > >>> fill hard until i forced to restart server to release hard disk.
> > >>> and now my question is why solr treat in this way? and how i can
> avoid
> > >>> solr
> > >>> to use huge cpu space?
> > >>> any config need?!
> > >>>
> > >>>
> > >>
> > >
> >
>


Re: Solr architecture

2016-02-09 Thread Mark Robinson
Thanks for your replies and suggestions!

Why I store all events related to a session under one doc?
Each session can have about 500 total entries (events) corresponding to it.
So when I try to retrieve a session's info it can back with around 500
records. If it is this compounded one doc per session, I can retrieve more
sessions at a time with one doc per session.
eg under a sessionId an array of eventA activities, eventB activities
 (using json). When an eventA activity again occurs, we will read all that
data for that session, append this extra info to evenA data and push the
whole session related data back (indexing) to Solr. Like this for many
sessions parallely.


Why NRT?
Parallely many sessions are being written (4Million sessions hence 4Million
docs per day). A person can do this querying any time.

It is just a look up?
Yes. We just need to retrieve all info for a session and pass it on to
another system. We may even do some extra querying on some data like
timestamps, pageurl etc in that info added to a session.

Thinking of having the data separate from the actual Solr Instance and
mention the loc of the dataDir in solrconfig.

If Solr is not a good option could you please suggest something which will
satisfy this use case with min response time while querying.

Thanks!
Mark

On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins 
wrote:

> So as I understand your use case, its effectively logging actions within a
> user session, why do you have to do the update in NRT?  Why not just log
> all the user session events (with some unique key, and ensuring the session
> Id is in the document somewhere), then when you want to do the query, you
> join on the session id, and that gives you all the data records for that
> session. I don't really follow why it has to be 1 document (which you
> continually update). If you really need that aggregation, couldn't that
> happen offline?
>
> I guess your 1 saving grace is that you query using the unique ID (in your
> scenario) so you could use the real-time get handler, since you aren't
> doing a complex query (strictly its not a search, its a raw key lookup).
>
> But I would still question your use case, if you go the Solr route for that
> kind of scale with querying and indexing that much, you're going to have to
> throw a lot of hardware at it, as Jack says probably in the order of
> hundreds of machines...
>
> On 9 February 2016 at 19:00, Upayavira  wrote:
>
> > Bear in mind that Lucene is optimised towards high read lower write.
> > That is, it puts in a lot of effort at write time to make reading
> > efficient. It sounds like you are going to be doing far more writing
> > than reading, and I wonder whether you are necessarily choosing the
> > right tool for the job.
> >
> > How would you later use this data, and what advantage is there to
> > storing it in Solr?
> >
> > Upayavira
> >
> > On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote:
> > > Hi,
> > > Thanks for all your suggestions. I took some time to get the details to
> > > be
> > > more accurate. Please find what I have gathered:-
> > >
> > > My data being indexed is something like this.
> > > I am basically capturing all data related to a user session.
> > > Inside a session I have categorized my actions like actionA, actionB
> > > etc..,
> > > per page.
> > > So each time an action pertaining to say actionA or actionB etc.. (in
> > > each
> > > page) happens, it is updated in Solr under that session (sessionId).
> > >
> > > So in short there is only one doc pertaining to a single session
> > > (identified by sessionid) in my Solr index and that is retrieved and
> > > updated
> > > whenever a new action under that session occurs.
> > > We expect upto 4Million session per day.
> > >
> > > On an average *one session's* *doc has a size* of *3MB to 20MB*.
> > > So if it is *4Million sessions per day*, each session writing around
> *500
> > > times to Solr*, it is* 2Billion writes or (indexing) per day to Solr*.
> > > As it is one doc per session, it is *4Million docs per day*.
> > > This is around *80K docs indexed per second* during *peak* hours and
> > > around *15K
> > > docs indexed per second* into Solr during* non-peak* hours.
> > > Number of queries per second is around *320 queries per second*.
> > >
> > >
> > > 1. Average size of a doc
> > >  3MB to 20MB
> > > 2. Query types:-
> > >  Until that session is in progress, whatever data is there for that
> > > session so far is queried and the new action's details captured and
> > > appended to existing data already capturedrelated to that
> session
> > > and indexed back into Solr. So, longer the session the more data
> > > retrieved
> > > for each subsequent query to get current data captured for that
> session.
> > >  Also querying can be done on timestamp etc... which is captured
> > >  along
> > > with each action.
> > > 3. Are docs grouped somehow?
> > >  All data related to 

RE: How is Tika used with Solr

2016-02-09 Thread Allison, Timothy B.
I have one answer here [0], but I'd be interested to hear what Solr 
users/devs/integrators have experienced on this topic.

[0] 
http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1PR09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlook.com%3E
 

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Tuesday, February 09, 2016 6:33 PM
To: solr-user@lucene.apache.org
Subject: Re: How is Tika used with Solr

Thank you Erick and Alex.

My main question is with a long running process using Tika in the same JVM as 
my application.  I'm running my file-system-crawler in its own JVM (not 
Solr's).  On Tika mailing list, it is suggested to run Tika's code in it's own 
JVM and invoke it from my file-system-crawler using Runtime.getRuntime().exec().

I fully understand from Alex suggestion and link provided by Erick to use Tika 
outside Solr.  But what about using Tika within the same JVM as my 
file-system-crawler application or should I be making a system call to invoke 
another JAR, that runs in its own JVM to extract the raw text?  Are there known 
issues with Tika when used in a long running process?

Steve




Re: How is Tika used with Solr

2016-02-09 Thread Erick Erickson
My impulse would be to _not_ run Tika in its own JVM, just catch any
exceptions in my code and "do the right thing". I'm not sure I see any
real benefit in yet another JVM.

FWIW,
Erick

On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.  wrote:
> I have one answer here [0], but I'd be interested to hear what Solr 
> users/devs/integrators have experienced on this topic.
>
> [0] 
> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1PR09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlook.com%3E
>
> -Original Message-
> From: Steven White [mailto:swhite4...@gmail.com]
> Sent: Tuesday, February 09, 2016 6:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How is Tika used with Solr
>
> Thank you Erick and Alex.
>
> My main question is with a long running process using Tika in the same JVM as 
> my application.  I'm running my file-system-crawler in its own JVM (not 
> Solr's).  On Tika mailing list, it is suggested to run Tika's code in it's 
> own JVM and invoke it from my file-system-crawler using 
> Runtime.getRuntime().exec().
>
> I fully understand from Alex suggestion and link provided by Erick to use 
> Tika outside Solr.  But what about using Tika within the same JVM as my 
> file-system-crawler application or should I be making a system call to invoke 
> another JAR, that runs in its own JVM to extract the raw text?  Are there 
> known issues with Tika when used in a long running process?
>
> Steve
>
>


Re: How to use DocValues with TextField

2016-02-09 Thread Alok Bhandari
Hello Harry ,

sorry for delayed reply , I have taken other approach by giving user a
different usability as I did not have solution for this. But your option
looks great , I will try this out.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-use-DocValues-with-TextField-tp4248647p4256316.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: replicate indexing to second site

2016-02-09 Thread Alexandre Rafalovitch
This issue might be similar to what Apple presented at the closing
keynote at Solr Revolution 2014. I believe they used a queue on each
of the site feeding into Solr. The presentation should be online.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 10 February 2016 at 07:43, tedsolr  wrote:
> I have a Solr Cloud cluster (v5.2.1) using a Zookeeper ensemble in my primary
> data center. I am now trying to plan for disaster recovery with an available
> warm site. I have read (many times) the disaster recovery section in the
> Apache ref guide. I suppose I don't fully understand it.
>
> What I'd like to know is the best way to sync up the existing data, and the
> best way to keep that data in sync. Assume that the warm site is an exact
> copy (not at the network level) of the production cluster - so the same
> servers with the same config. All servers are virtual. The use case is the
> active cluster goes down and cannot be repaired, so the warm site would
> become the active site. This is a manual process that takes many hours to
> accomplish (I just need to fit Solr into this existing process, I can't
> change the process :).
>
> I expect that rsync can be used initially to copy the collection data
> folders and the zookeeper data and transaction log folders. So after
> verifying Solr/ZK is functional after the install, shut it down and perform
> the copy. This may sound slow but my production index size is < 100GB. Is
> this approach reasonable?
>
> So now to keep the warm site in sync, I could use rsync on a scheduled basis
> but I assume there's a better way. The ref guide says to send all indexing
> requests to the second cluster at the same time they are sent to the active
> cluster. I use SolrJ for all requests. So would this entail using a second
> CloudSolrClient instance that only knows about the second cluster? Seems
> reasonable but I don't want to lengthen the response time for the users. Is
> this just a software problem to work out (separate thread)? Or is there a
> SolrJ solution (asyc calls)?
>
> Thanks!!
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/replicate-indexing-to-second-site-tp4256240.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: replicate indexing to second site

2016-02-09 Thread Shawn Heisey
On 2/9/2016 5:48 PM, Walter Underwood wrote:
> Updating two systems in parallel gets into two-phase commit, instantly. So 
> you need a persistent pool of updates that both clusters pull from.

My indexing system does exactly what I have suggested for tedsolr -- it
updates multiple copies of my index in parallel.  My data source is MySQL.

For each copy, information about the last successful update is
separately tracked, so if one of the index copies goes offline, the
other stays current.  When the offline system comes back, it will be
updated from the saved position, and will eventually have the same
information as the system that did not go offline.

As far as two-phase commit goes, that would make it so that neither copy
of the index would stay current if one of them went offline.  In most
situations I can think of, that's not really very useful.

Thanks,
Shawn



Re: Multi-lingual search

2016-02-09 Thread Modassar Ather
And what does proximity search exactly mean?

A proximity search means searching terms with a distance in between them.
E.g. Search for a document which has java near 3 words of network.
field:"java network"~3
So the above query will match any document having a distance of 3 by its
position between java and network.

Can i implement proximity search if i use
>seperate core per language
>field per language
>multilingual field that supports all languages.
A proximity search is on a field so it does not matter if it is in
same/different core.

searching for walk word when walking is indexed,should fetch and display the
record?
It will be included in stemming filter.right?
Stemming does bring the word to its root form. So yes if the root word is
achieved from the given word it will search.

Hope this helps.

Best,
Modassar


On Tue, Feb 9, 2016 at 12:58 PM, vidya  wrote:

> Hi
>   Can i implement proximity search if i use
> >seperate core per language
> >field per language
> >multilingual field that supports all languages.
>
> And what does proximity search exactly mean?
>
> searching for walk word when walking is indexed,should fetch and display
> the
> record?
> It will be included in stemming filter.right?
>
> Thanks in advance
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Multi-lingual-search-tp4254398p4256094.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: replicate indexing to second site

2016-02-09 Thread Arcadius Ahouansou
Hello Ted.

We have a similar requirement to deploy Solr across 2 DCs.
In our case, the DCs are connected via fibre optic.

We managed to deploy a single SolrCloud cluster across multiple DCs without
any major issue (see links below).

The whole set-up is described in the following articles:

-
http://menelic.com/2015/11/21/deploying-solrcloud-across-multiple-data-centers-dc/

-
http://menelic.com/2015/12/04/deploying-solrcloud-across-multiple-data-centers-dc-performance/

-
http://menelic.com/2015/12/05/allowing-solrj-cloudsolrclient-to-have-preferred-replica-for-query-operations/

- Here is the main issue we had to deal with:
http://menelic.com/2015/12/30/zookeeper-shutdown-leader-reason-not-sufficient-followers-synced-only-synced-with-sids/


I believe that if your DCs are well connected, you can have a single
SolrCloud cluster spanning across multiple DCs.

Arcadius.





On 10 February 2016 at 04:15, Walter Underwood 
wrote:

> I agree. If the system updates synchronously, then you are in two-phase
> commit land. If you have a persistent store that each index can track, then
> things are good.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Feb 9, 2016, at 7:37 PM, Shawn Heisey  wrote:
> >
> > On 2/9/2016 5:48 PM, Walter Underwood wrote:
> >> Updating two systems in parallel gets into two-phase commit, instantly.
> So you need a persistent pool of updates that both clusters pull from.
> >
> > My indexing system does exactly what I have suggested for tedsolr -- it
> > updates multiple copies of my index in parallel.  My data source is
> MySQL.
> >
> > For each copy, information about the last successful update is
> > separately tracked, so if one of the index copies goes offline, the
> > other stays current.  When the offline system comes back, it will be
> > updated from the saved position, and will eventually have the same
> > information as the system that did not go offline.
> >
> > As far as two-phase commit goes, that would make it so that neither copy
> > of the index would stay current if one of them went offline.  In most
> > situations I can think of, that's not really very useful.
> >
> > Thanks,
> > Shawn
> >
>
>


-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---


Re: CorruptIndexException during optimize.

2016-02-09 Thread Modassar Ather
Hi,

Kindly provide your inputs on the issue.

Thanks,
Modassar

On Mon, Feb 1, 2016 at 12:40 PM, Modassar Ather 
wrote:

> Hi,
>
> Got following error during optimize of index on 2 nodes of 12 node
> cluster. Please let me know if the index can be recovered and how and what
> could be the reason?
> Total number of nodes: 12
> No replica.
> Solr version - 5.4.0
> Java version - 1.7.0_91 (Open JDK 64 bit)
> Ubuntu version : Ubuntu 14.04.3 LTS
>
> 2016-01-31 20:00:31.211 ERROR (qtp1698904557-9710) [c:core s:shard4
> r:core_node3 x:core] o.a.s.h.RequestHandlerBase java.io.IOException:
> Invalid vInt detected (too many bits)
> at org.apache.lucene.store.DataInput.readVInt(DataInput.java:141)
> at
> org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer.readNumericEntry(Lucene54DocValuesProducer.java:355)
> at
> org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer.readFields(Lucene54DocValuesProducer.java:243)
> at
> org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer.(Lucene54DocValuesProducer.java:122)
> at
> org.apache.lucene.codecs.lucene54.Lucene54DocValuesFormat.fieldsProducer(Lucene54DocValuesFormat.java:113)
> at
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.(PerFieldDocValuesFormat.java:268)
> at
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat.fieldsProducer(PerFieldDocValuesFormat.java:358)
> at
> org.apache.lucene.index.SegmentDocValues.newDocValuesProducer(SegmentDocValues.java:51)
> at
> org.apache.lucene.index.SegmentDocValues.getDocValuesProducer(SegmentDocValues.java:67)
> at
> org.apache.lucene.index.SegmentReader.initDocValuesProducer(SegmentReader.java:147)
> at org.apache.lucene.index.SegmentReader.(SegmentReader.java:81)
> at
> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
> at
> org.apache.lucene.index.BufferedUpdatesStream$SegmentState.(BufferedUpdatesStream.java:384)
> at
> org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
> at
> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
> at
> org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3161)
> at
> org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3147)
> at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3124)
> at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3087)
> at
> org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1741)
> at
> org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1721)
> at
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:590)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:62)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1612)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1589)
> at
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:64)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:457)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>