Re: Overall large size in Solr across collections

2016-04-19 Thread Zheng Lin Edwin Yeo
Thanks for the information Shawn.

I believe it could be due to the types of file that is being indexed.
Currently, I'm indexing the EML files which are in HTML format, and they
are more rich in content (with in line images and full text), while
previously the EML files are in Plain Text format, with the images as
attachments.

Will this be the cause of the slow indexing speed which I'm facing now? It
is more than 3 times slower than what I had previously.

Regards,
Edwin


On 20 April 2016 at 01:41, Shawn Heisey  wrote:

> On 4/19/2016 9:28 AM, Zheng Lin Edwin Yeo wrote:
> > Currently, the searching performance is still doing fine, but it is the
> > indexing that is slowing down. Not sure if increasing the RAM, or
> changing
> > to a SSD hard disk will help with the indexing speed?
>
> You need to figure out exactly what is slow -- is it actual indexing,
> merging segments, or is it commits?
>
> If it's commits, then the work will be to speed those up.  Reducing or
> eliminating autowarming is one thing you can do.  Putting the index on
> SSD might also help.
>
> Slow merging might be improved with SSD storage.
>
> If it's actual indexing speed (independent of merging and commits)
> that's slow, then there are a lot of potential reasons.  You'll need to
> nail down exactly where the bottleneck is.  I'm not even sure what
> questions to ask on the road to figuring this out.
>
> Thanks,
> Shawn
>
>


Re: Storing different collection on different hard disk

2016-04-19 Thread Zheng Lin Edwin Yeo
Thanks for your info.

I tried to set, but Solr is not able to find the indexes, and I get the
following error:

   - *collection1:*
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
   java.io.IOException: The filename, directory name, or volume label syntax
   is incorrect


Is this the correct way to set in core.properties file?
dataDir="D:\collection1"

Also, do we need to set the dataDir in solrconfig.xml as well?

Regards,
Edwin


On 19 April 2016 at 19:36, Alexandre Rafalovitch  wrote:

> Have you tried setting dataDir parameter in the core.properties file?
> https://cwiki.apache.org/confluence/display/solr/Defining+core.properties
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 19 April 2016 at 20:43, Zheng Lin Edwin Yeo 
> wrote:
> > Hi,
> >
> > I would like to find out is it possible to store the indexes file of
> > different collections in different hard disk?
> > Like for example, I want to store the indexes of collection1 in Hard Disk
> > 1, and the indexes of collection2 in Hard Disk 2.
> >
> > I am using Solr 5.4.0
> >
> > Regards,
> > Edwin
>


Re: Facet heatmaps: cluster coordinates based on average position of docs

2016-04-19 Thread David Smiley
Hi Anton,

Perhaps you should request a more detailed / high-res heatmap, and then
work with that, perhaps using some clustering technique?  I confess I don't
work on the UI end of things these days.

p.s. I'm on vacation this week; so I don't respond quickly

~ David

On Thu, Apr 7, 2016 at 3:43 PM Anton K.  wrote:

> I am working with new solr feature: facet heatmaps. It works great, i
> create clusters on my map with counts. When user click on cluster i zoom in
> that area and i might show him more clusters or documents (based on current
> zoom level).
>
> But all my cluster icons (i use round one, see screenshot below) placed
> straight in the center of cluster's rectangles:
>
> https://dl.dropboxusercontent.com/u/1999619/images/map_grid3.png
>
> Some clusters can be in sea and so on. Also it feels not natural in my case
> to have icons placed orderly on the world map.
>
> I want to place cluster's icons in average coords based on coordinates of
> all my docs inside cluster. Is there any way to achieve this? I am trying
> to use stats component for facet heatmap but it isn't implemented yet.
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


how to restrict phrase to appear in same child document

2016-04-19 Thread Yangrui Guo
hello

I have a nested document type in my index. Here's the structure of my
document:

{
id:
{
car:
color:
}
{
driver:
color:
}
}

However, when I use the query q={!parent
which="content_type:parent"}+(black AND driver)&fq={!parent
which="content_type:parent"}+(white AND mercedes), the result also
contained white driver with black mercedes. I know I can put fields before
terms but it is not always easy to do this. Users might just enter one
string. How can I modify my query to require that the terms between two
parentheses must appear in the same child document, or boost those meet the
criteria? Thanks


RE: Streaming with facets

2016-04-19 Thread Davis, Daniel (NIH/NLM) [C]
Thanks, Yonik, that makes great sense.

My understanding of "many parts of Solr can already stream" is that not all 
sets of SearchHandler parameters are equal.  One set of SearchHandler 
parameters can be best for classic <1 second web search, one set of 
SearchHandler parameters may be best for returning just analytic computed 
facets over 10k rows, or even more.

I'm understanding StreamHandler and its relation to JDBC completely now, and 
staying away from it for now because it doesn't fit my application.

-Original Message-
From: Yonik Seeley [mailto:ysee...@gmail.com] 
Sent: Tuesday, April 19, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Streaming with facets

Part of the difficulty is that "stream" and "streaming" are rather overloaded 
terms.  Many parts of Solr can already stream, with varying degrees of how much 
state is aggregated / internally collected before "streaming" starts.

Faceting can be truly streamed *if* the sort order is by the bucket value 
ascending, since that is the order contained in the lucene index.  All of the 
rest of the bucket information can be computed on the fly as it is being sent 
out.  This is what the JSON Facet API does when method="stream".

We could extend the current facet streaming for other sorts... this would 
involve calculating & sorting the sort criteria first, and then streaming after 
that point (i.e. other metrics would be calculated on-the-fly as facet buckets 
are being streamed).

-Yonik


On Tue, Apr 19, 2016 at 4:48 PM, Davis, Daniel (NIH/NLM) [C] 
 wrote:
> So, can someone clarify how faceting works with streaming expressions?
>
> I can see how document search can return documents as it finds them, using 
> any particular ordering desired - just a parse tree of query operators with 
> priority queues (or something more complicated) within each query operator, 
> so you really get the best match as you go for as long as you continue.
>
> For facet values, without knowing Solr's internals, my intuition is that Solr 
> could stream unique facet values, but not counts of matching documents.
>
> Even when I put on my user hat - I don't see how the Streaming API can return 
> both facet values and documents, it looks like it is either documents or 
> facet values as results.
>
> Dan Davis, Systems/Applications Architect (Contractor), Office of 
> Computer and Communications Systems, National Library of Medicine, NIH
>


Re: Streaming with facets

2016-04-19 Thread Yonik Seeley
Part of the difficulty is that "stream" and "streaming" are rather
overloaded terms.  Many parts of Solr can already stream, with varying
degrees of how much state is aggregated / internally collected before
"streaming" starts.

Faceting can be truly streamed *if* the sort order is by the bucket
value ascending, since that is the order contained in the lucene
index.  All of the rest of the bucket information can be computed on
the fly as it is being sent out.  This is what the JSON Facet API does
when method="stream".

We could extend the current facet streaming for other sorts... this
would involve calculating & sorting the sort criteria first, and then
streaming after that point (i.e. other metrics would be calculated
on-the-fly as facet buckets are being streamed).

-Yonik


On Tue, Apr 19, 2016 at 4:48 PM, Davis, Daniel (NIH/NLM) [C]
 wrote:
> So, can someone clarify how faceting works with streaming expressions?
>
> I can see how document search can return documents as it finds them, using 
> any particular ordering desired - just a parse tree of query operators with 
> priority queues (or something more complicated) within each query operator, 
> so you really get the best match as you go for as long as you continue.
>
> For facet values, without knowing Solr's internals, my intuition is that Solr 
> could stream unique facet values, but not counts of matching documents.
>
> Even when I put on my user hat - I don't see how the Streaming API can return 
> both facet values and documents, it looks like it is either documents or 
> facet values as results.
>
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
>


Streaming with facets

2016-04-19 Thread Davis, Daniel (NIH/NLM) [C]
So, can someone clarify how faceting works with streaming expressions?

I can see how document search can return documents as it finds them, using any 
particular ordering desired - just a parse tree of query operators with 
priority queues (or something more complicated) within each query operator, so 
you really get the best match as you go for as long as you continue.

For facet values, without knowing Solr's internals, my intuition is that Solr 
could stream unique facet values, but not counts of matching documents.

Even when I put on my user hat - I don't see how the Streaming API can return 
both facet values and documents, it looks like it is either documents or facet 
values as results.

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH



Re: Return only parent on child query match (w/o block-join)

2016-04-19 Thread Susmit Shukla
Hi Shamik,

you could try solr grouping using group.query construct. you could discard
the child match from the result i.e. any doc that has parent_doc_id field
and use join to fetch the parent record

q=*:*&group=true&group.query=title:title2&group.query={!join
from=parent_doc_id to=doc_id}parent_doc_id:*&group.limit=10

Thanks,
Susmit


On Tue, Apr 19, 2016 at 10:29 AM, Shamik Bandopadhyay 
wrote:

> Hi,
>
>I have a set of documents indexed which has a pseudo parent-child
> relationship. Each child document had a reference to the parent document.
> Due to document availability complexity (and the condition of updating both
> parent-child documents at the time of indexing), I'm not able to use
> explicit block-join.Instead of a nested structure, they are all flat.
> Here's an example:
>
> 
>   1
>   Parent title
>   123
> 
> 
>   2
>   Child title1
>   123
> 
> 
>   3
>   Child title2
>   123
> 
> 
>   4
>   Misc title2
> 
>
> What I'm looking is if I search "title2", the result should bring back the
> following two docs, 1 matching the parent and one based on a regular match.
>
> 
>   1
>   Parent title
>   123
> 
> 
>   4
>   Misc title2
> 
>
> With block-join support, I could have used Block Join Parent Query Parser,
> q={!parent which="content_type:parentDocument"}title:title2
>
> Transforming result documents is an alternate but it has the reverse
> support through ChildDocTransformerFactory
>
> Just wondering if there's a way to address query in a different way. Any
> pointers will be appreciated.
>
> -Thanks,
> Shamik
>


Re: Indexing 700 docs per second

2016-04-19 Thread Jeff Wartes

I have no numbers to back this up, but I’d expect Atomic Updates to be slightly 
slower than a full update, since the atomic approach has to retrieve the fields 
you didn't specify before it can write the new (updated) document.




On 4/19/16, 11:54 AM, "Tim Robertson"  wrote:

>Hi Mark,
>
>We were putting in and updating docs of around 20-25 indexed fields (mainly
>INTs, but some Strings and multivalue fields) at >1000/sec on far lesser
>hardware and a total of 600 million docs (batch updates of course) while
>also serving live queries for a website which had about 30 concurrent users
>steady state (not all hitting SOLR though).
>
>It seems realistic with that kind of hardware in my experience, but you
>didn't mention what else was going on that might affect it (e.g. reads).
>
>HTH,
>Tim
>
>
>On Tue, Apr 19, 2016 at 7:12 PM, Erick Erickson 
>wrote:
>
>> Make very sure you batch updates though.
>> Here's a benchmark I ran:
>> https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>>
>> NOTE: it's not entirely clear that you want to
>> put 122M docs on a single shard. Depending on the queries
>> you'll run you may want 2 or more shards, but that depends
>> on the query pattern and your SLAs. Here's the long version
>> of "you really have to load test this":
>>
>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> Best,
>> Erick
>>
>> On Tue, Apr 19, 2016 at 6:48 AM, Susheel Kumar 
>> wrote:
>> >  It sounds achievable with your machine configuration and i would suggest
>> > to try out atomic update.  Use SolrJ with multi-threaded indexing for
>> > higher indexing rate.
>> >
>> > Thanks,
>> > Susheel
>> >
>> >
>> >
>> > On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans 
>> wrote:
>> >
>> >> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson <
>> mark123lea...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I have a requirement to index (mainly updation) 700 docs per second.
>> >> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around
>> 260
>> >> > byes (6 fields out of which only 2 will undergo updation at the above
>> >> > rate). This collection has around 122Million docs and that count is
>> >> pretty
>> >> > much a constant.
>> >> >
>> >> > 1. Can I manage this updation rate with a non-sharded ie single Solr
>> >> > instance set up?
>> >> > 2. Also is atomic update or a full update (the whole doc) of the
>> changed
>> >> > records the better approach in this case.
>> >> >
>> >> > Could some one please share their views/ experience?
>> >>
>> >> Try it and see - everyone's data/schemas are different and can affect
>> >> indexing speed. It certainly sounds achievable enough - presumably you
>> >> can at least produce the documents at that rate?
>> >>
>> >> Cheers
>> >>
>> >> Tom
>> >>
>>


Re: Indexing 700 docs per second

2016-04-19 Thread Tim Robertson
Hi Mark,

We were putting in and updating docs of around 20-25 indexed fields (mainly
INTs, but some Strings and multivalue fields) at >1000/sec on far lesser
hardware and a total of 600 million docs (batch updates of course) while
also serving live queries for a website which had about 30 concurrent users
steady state (not all hitting SOLR though).

It seems realistic with that kind of hardware in my experience, but you
didn't mention what else was going on that might affect it (e.g. reads).

HTH,
Tim


On Tue, Apr 19, 2016 at 7:12 PM, Erick Erickson 
wrote:

> Make very sure you batch updates though.
> Here's a benchmark I ran:
> https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>
> NOTE: it's not entirely clear that you want to
> put 122M docs on a single shard. Depending on the queries
> you'll run you may want 2 or more shards, but that depends
> on the query pattern and your SLAs. Here's the long version
> of "you really have to load test this":
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Best,
> Erick
>
> On Tue, Apr 19, 2016 at 6:48 AM, Susheel Kumar 
> wrote:
> >  It sounds achievable with your machine configuration and i would suggest
> > to try out atomic update.  Use SolrJ with multi-threaded indexing for
> > higher indexing rate.
> >
> > Thanks,
> > Susheel
> >
> >
> >
> > On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans 
> wrote:
> >
> >> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson <
> mark123lea...@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I have a requirement to index (mainly updation) 700 docs per second.
> >> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around
> 260
> >> > byes (6 fields out of which only 2 will undergo updation at the above
> >> > rate). This collection has around 122Million docs and that count is
> >> pretty
> >> > much a constant.
> >> >
> >> > 1. Can I manage this updation rate with a non-sharded ie single Solr
> >> > instance set up?
> >> > 2. Also is atomic update or a full update (the whole doc) of the
> changed
> >> > records the better approach in this case.
> >> >
> >> > Could some one please share their views/ experience?
> >>
> >> Try it and see - everyone's data/schemas are different and can affect
> >> indexing speed. It certainly sounds achievable enough - presumably you
> >> can at least produce the documents at that rate?
> >>
> >> Cheers
> >>
> >> Tom
> >>
>


Re: Overall large size in Solr across collections

2016-04-19 Thread Shawn Heisey
On 4/19/2016 9:28 AM, Zheng Lin Edwin Yeo wrote:
> Currently, the searching performance is still doing fine, but it is the
> indexing that is slowing down. Not sure if increasing the RAM, or changing
> to a SSD hard disk will help with the indexing speed?

You need to figure out exactly what is slow -- is it actual indexing,
merging segments, or is it commits?

If it's commits, then the work will be to speed those up.  Reducing or
eliminating autowarming is one thing you can do.  Putting the index on
SSD might also help.

Slow merging might be improved with SSD storage.

If it's actual indexing speed (independent of merging and commits)
that's slow, then there are a lot of potential reasons.  You'll need to
nail down exactly where the bottleneck is.  I'm not even sure what
questions to ask on the road to figuring this out.

Thanks,
Shawn



Return only parent on child query match (w/o block-join)

2016-04-19 Thread Shamik Bandopadhyay
Hi,

   I have a set of documents indexed which has a pseudo parent-child
relationship. Each child document had a reference to the parent document.
Due to document availability complexity (and the condition of updating both
parent-child documents at the time of indexing), I'm not able to use
explicit block-join.Instead of a nested structure, they are all flat.
Here's an example:


  1
  Parent title
  123


  2
  Child title1
  123


  3
  Child title2
  123


  4
  Misc title2


What I'm looking is if I search "title2", the result should bring back the
following two docs, 1 matching the parent and one based on a regular match.


  1
  Parent title
  123


  4
  Misc title2


With block-join support, I could have used Block Join Parent Query Parser,
q={!parent which="content_type:parentDocument"}title:title2

Transforming result documents is an alternate but it has the reverse
support through ChildDocTransformerFactory

Just wondering if there's a way to address query in a different way. Any
pointers will be appreciated.

-Thanks,
Shamik


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-19 Thread John Bickerstaff
When combining a load balancer with SolrCloud, the handler definitions
in solrconfig.xml should set preferLocalShards to true (which Tom
mentioned)

Thanks Shawn!  I was wondering where to set this...

Yup - my IT guy is sharp, sharp, sharp -- nice to get this confirmation
from the list...

On Tue, Apr 19, 2016 at 7:59 AM, Shawn Heisey  wrote:

> On 4/18/2016 11:22 AM, John Bickerstaff wrote:
> > So - my IT guy makes the case that we don't really need Zookeeper / Solr
> > Cloud...
> 
> > I'm biased in terms of using the most recent functionality, but I'm aware
> > that bias is not necessarily based on facts and want to do my due
> > diligence...
> >
> > Aside from the obvious benefits of spreading work across nodes (which may
> > not be a big deal in our application and which my IT guy proposes is more
> > transparently handled with a load balancer he understands) are there any
> > other considerations that would drive a choice for Solr Cloud (zookeeper
> > etc)?
>
> Erick has a point.  If your IT guy feels comfortable with a load
> balancer, he should go ahead and set that up.
>
> For a new install like you're describing, I would probably still use
> SolrCloud on the back end, even with a load balancer.
>
> As Daniel said, a non-cloud replicated setup requires configuration of
> masters and slaves.  Instead of replication, you could go with a build
> system that sends updates to each copy of the index independently.
>
> When using replication, switching master/slave roles in the event of a
> master server failure is not trivial.  SolrCloud handles all that,
> making multi-server management a LOT easier.  Initial setup is slightly
> more complicated due to zookeeper, and configuration management requires
> an "upload to zookeeper" step ... but I do not think these are not high
> hurdles considering how much easier it is to manage multiple servers.
>
> With the deployment you have described (which I trimmed out of this
> reply), I think you'd be fine running a standalone zookeeper process on
> three of your Solr servers, so you won't even need a bunch of extra
> hardware.
>
> When combining a load balancer with SolrCloud, the handler definitions
> in solrconfig.xml should set preferLocalShards to true (which Tom
> mentioned) so the load balancer target is the machine that actually
> processes the request.  Troubleshooting becomes more difficult if you
> don't do this, and avoiding the extra network hop will help performance.
>
> Thanks,
> Shawn
>
>


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-19 Thread John Bickerstaff
@Charlie

It's easy to do and wow does it save time and database resources...

I've built a Spring Boot Micro-services architecture that also registers in
Zookeeper.  One micro-service pulls from the original data source and
pushes to Kafka.  The second micro-service pulls from Kafka into SOLR.

Because they're registered in Zookeeper, the micro-services can be brought
up anywhere in the infrastructure I'm building and "rebuild" SOLR indices
from scratch.

I.E. if you lose SOLR completely, just bring up a new VM copy with an empty
index, start your microservice, and rebuild the index from scratch

We're dropping it all into AWS eventually.

It's sweet.  The original "run" to consolidate the data from various
databases takes over an hour -- IF the load on production is light. Running
out of Kafka takes less than 10 minutes and totally avoids loading
production databases.

If you're interested, ping me -- I'm happy to share what I've got...

On Tue, Apr 19, 2016 at 2:08 AM, Charlie Hull  wrote:

> On 18/04/2016 18:22, John Bickerstaff wrote:
>
>> So - my IT guy makes the case that we don't really need Zookeeper / Solr
>> Cloud...
>>
>> He may be right - we're serving static data (changes to the collection
>> occur only 2 or 3 times a year and are minor)
>>
>> We probably could have 3 or 4 Solr nodes running in non-Cloud mode -- each
>> configured the same way, behind a load balancer and do fine.
>>
>> I've got a Kafka server set up with the solr docs as topics.  It takes
>> about 10 minutes to reload a "blank" Solr Server from the Kafka topic...
>> If I target 3-4 SOLR servers from my microservice instead of one, it
>> wouldn't take much longer than 10 minutes to concurrently reload all 3 or
>> 4
>> Solr servers from scratch...
>>
>
> This is something we've been discussing as a concept - to offload all the
> scaling stuff to Kafka (which is very good at that sort of thing) and
> simply hang Solr instances onto a Kafka topic. We've not taken it any
> further than a concept at this point but interesting to hear about others
> doing so!
>
> Charlie
>
>
>
>> I'm biased in terms of using the most recent functionality, but I'm aware
>> that bias is not necessarily based on facts and want to do my due
>> diligence...
>>
>> Aside from the obvious benefits of spreading work across nodes (which may
>> not be a big deal in our application and which my IT guy proposes is more
>> transparently handled with a load balancer he understands) are there any
>> other considerations that would drive a choice for Solr Cloud (zookeeper
>> etc)?
>>
>>
>>
>> On Mon, Apr 18, 2016 at 9:26 AM, Tom Evans 
>> wrote:
>>
>> On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
>>>  wrote:
>>>
 Thanks all - very helpful.

 @Shawn - your reply implies that even if I'm hitting the URL for a
 single
 endpoint via HTTP - the "balancing" will still occur across the Solr

>>> Cloud
>>>
 (I understand the caveat about that single endpoint being a potential

>>> point
>>>
 of failure).  I just want to verify that I'm interpreting your response
 correctly...

 (I have been asked to provide IT with a comprehensive list of options

>>> prior
>>>
 to a design discussion - which is why I'm trying to get clear about the
 various options)

 In a nutshell, I think I understand the following:

 a. Even if hitting a single URL, the Solr Cloud will "balance" across
 all
 available nodes for searching
Caveat: That single URL represents a potential single point
 of
 failure and this should be taken into account

 b. SolrJ's CloudSolrClient API provides the ability to distribute load
 --
 based on Zookeeper's "knowledge" of all available Solr instances.
Note: This is more robust than "a" due to the fact that it
 eliminates the "single point of failure"

 c.  Use of a load balancer hitting all known Solr instances will be fine

>>> -
>>>
 although the search requests may not run on the Solr instance the load
 balancer targeted - due to "a" above.

 Corrections or refinements welcomed...

>>>
>>> With option a), although queries will be distributed across the
>>> cluster, all queries will be going through that single node. Not only
>>> is that a single point of failure, but you risk saturating the
>>> inter-node network traffic, possibly resulting in lower QPS and higher
>>> latency on your queries.
>>>
>>> With option b), as well as SolrJ, recent versions of pysolr have a
>>> ZK-aware SolrCloud client that behaves in a similar way.
>>>
>>> With option c), you can use the preferLocalShards so that shards that
>>> are local to the queried node are used in preference to distributed
>>> shards. Depending on your shard/cluster topology, this can increase
>>> performance if you are returning large amounts of data - many or large
>>> fields or many documents.
>>>
>>> Cheers
>>>
>>> Tom
>>>
>>>
>>
>
>

Re: Cannot use Phrase Queries in eDisMax and filtering

2016-04-19 Thread Antoine LE FLOC'H
Thanks Erick,

looking at   parseFieldBoostsAndSlop()  do you confirm that
pf3=1&
pf2=1&
is not valid and it has to be a multivalued list of fields ?

Thank you.


On Tue, Apr 19, 2016 at 1:59 AM, Erick Erickson 
wrote:

> bq: I cannot find either the condition on the field analyzer to be able to
> use
> pf, pf2 and pf3.
>
> These don't apply to field analysis at all. What they translate into
> is a series of
> phrase queries against different sets of fields. So, you may have
> pf=fieldA^5 fieldB
> pf2=fieldA^3 fieldC
>
> Now a query like (without quotes) "big dog" would be
> translated into something like
> ...
> fieldA:"big dog"^5 fieldB:"big dog" fieldA:"big dog"^3 fieldC:"big dog"
>
> Having multiple pf fields allows you to query with different slop values,
> different boosts etc. on the same or different fields.
>
> Best,
> Erick
>
>
> On Mon, Apr 18, 2016 at 12:25 PM, Antoine LE FLOC'H 
> wrote:
> > Hello,
> >
> > I don't have Solr source code handy but is
> > pf3=1&
> > pf2=1&
> > valid ? What would that do ? use the df or qf fields ?
> >
> > This
> >
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
> > says that the value of pf2 is a multivalued list of fields ? There are
> not
> > many example about this in this link.
> >
> > I cannot find either the condition on the field analyzer to be able to
> use
> > pf, pf2 and pf3.
> >
> > Feedback would be appreciated, thanks.
> >
> > Antoine.
> >
> >
> >
> >
> >
> > On Mon, Nov 3, 2014 at 8:29 PM, Ramzi Alqrainy  >
> > wrote:
> >
> >> I tried to produce your case in my machine with below queries, but
> >> everything
> >> worked fine with me. I just want to ask you a question what is the field
> >> type of "tag" field ?
> >>
> >> q=bmw&
> >> fl=score,*&
> >> wt=json&
> >> fq=city_id:59&
> >> qt=/query&
> >> defType=edismax&
> >> pf=title^15%20discription^5&
> >> pf3=1&
> >> pf2=1&
> >> ps=1&
> >> qroup=true&
> >> group.field=member_id&
> >> group.limit=10&
> >> sort=score desc&
> >> group.ngroups=true
> >>
> >>
> >>
> >>
> >> q=bmw&
> >> fl=score,*&
> >> wt=json&
> >> fq=city_id:59&
> >> qt=/query&
> >> defType=edismax&
> >> pf=title^15%20discription^5&
> >> pf3=1&
> >> pf2=1&
> >> ps=1&
> >> qroup=true&
> >> group.field=member_id&
> >> group.limit=10&
> >> group.ngroups=true&
> >> sort=score desc&
> >> fq=category_id:1777
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Cannot-use-Phrase-Queries-in-eDisMax-and-filtering-tp4167302p4167338.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>


Re: add field requires collection reload

2016-04-19 Thread Erick Erickson
bq: In yesterday's test run I actually had only one node

I think it's still the same issue. The update happens too
fast for the core reload. Don't know that for sure mind
you...

A cheap solution would be to wait a bit before sending
the update. Clumsy but maybe good enough for now?

Or put in some retry logic. If the update fails, _then_ sleep
for a second and re-submit the batch. True, in that case
you'll be re-indexing all the docs that succeeded in the
batch, but that's usually OK.

Best,
Erick

On Mon, Apr 18, 2016 at 9:28 PM, Hendrik Haddorp
 wrote:
> Thanks, I knew I had seen a bug like this somewhere but could not find
> it yesterday.
>
> In yesterday's test run I actually had only one node and still got this
> problem. So I'll keep the collection reload until switching to 6.1 then.
>
> On 19/04/16 01:51, Erick Erickson wrote:
>> The key here is you say "sometimes". It takes a while for the reload
>> operation to propagate to _all_ the replicas that makeup your
>> collection. My bet is that by immediately indexing after changing the
>> data, your updates are getting to a core that hasn't reloaded yet.
>>
>> That said, https://issues.apache.org/jira/browse/SOLR-8662 addresses
>> this very issue I believe, but it's in 6.1
>>
>> Best,
>> Erick
>>
>> On Mon, Apr 18, 2016 at 1:34 PM, Hendrik Haddorp
>>  wrote:
>>> Hi,
>>>
>>> I'm using SolrCloud 6.0 with a managed schema. When I add fields using
>>> SolrJ and immediately afterwards try to index data I sometimes get an
>>> error telling me that a field that I just added does not exist. If I do
>>> an explicit collection reload after the schema modification things seem
>>> to work. Is that works as designed?
>>>
>>> According to https://cwiki.apache.org/confluence/display/solr/Schema+API
>>> a core reload will happen automatically when using the schema API: "When
>>> modifying the schema with the API, a core reload will automatically
>>> occur in order for the changes to be available immediately for documents
>>> indexed thereafter."
>>>
>>> regards,
>>> Hendrik
>


Re: Indexing 700 docs per second

2016-04-19 Thread Erick Erickson
Make very sure you batch updates though.
Here's a benchmark I ran:
https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

NOTE: it's not entirely clear that you want to
put 122M docs on a single shard. Depending on the queries
you'll run you may want 2 or more shards, but that depends
on the query pattern and your SLAs. Here's the long version
of "you really have to load test this":
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Tue, Apr 19, 2016 at 6:48 AM, Susheel Kumar  wrote:
>  It sounds achievable with your machine configuration and i would suggest
> to try out atomic update.  Use SolrJ with multi-threaded indexing for
> higher indexing rate.
>
> Thanks,
> Susheel
>
>
>
> On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans  wrote:
>
>> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson 
>> wrote:
>> > Hi,
>> >
>> > I have a requirement to index (mainly updation) 700 docs per second.
>> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
>> > byes (6 fields out of which only 2 will undergo updation at the above
>> > rate). This collection has around 122Million docs and that count is
>> pretty
>> > much a constant.
>> >
>> > 1. Can I manage this updation rate with a non-sharded ie single Solr
>> > instance set up?
>> > 2. Also is atomic update or a full update (the whole doc) of the changed
>> > records the better approach in this case.
>> >
>> > Could some one please share their views/ experience?
>>
>> Try it and see - everyone's data/schemas are different and can affect
>> indexing speed. It certainly sounds achievable enough - presumably you
>> can at least produce the documents at that rate?
>>
>> Cheers
>>
>> Tom
>>


Re: Live Podcast on Solr 6 with Yonik and Erik Hatcher (Today, 2pm ET)

2016-04-19 Thread Doug Turnbull
Doh! Thanks Yonik. Yes that's right. Thought I had double checked

On Tue, Apr 19, 2016 at 12:24 PM Yonik Seeley  wrote:

> Hey Doug,
> Not sure if the URL matters, but I thougt it was this one:
>
>
> https://blab.im/matthew-l-overstreet-solr-6-is-available-find-out-about-what-s-new
>
> -Yonik
>
>
> On Tue, Apr 19, 2016 at 10:37 AM, Doug Turnbull
>  wrote:
> > Hey Solristas:
> >
> > We do a regular podcast called Search Disco
> > . Today we'll be discussing
> the
> > recent release of Solr 6 with Solr creator, Yonik Seeley and Solr
> committer
> > Erik Hatcher.
> >
> > *Subscrbe to participate live*
> > <
> https://blab.im/matthew-l-overstreet-full-text-search-and-recommendation-engines
> >.
> > (*2PM ET today (19-APR)*)
> >
> > We use the blab  conversation platform, which will let
> you
> > chat with us, Yonik, and Erik. So bring your tough Solr questions!
> >
> > Look forward to seeing you there. And if you're interested in past
> > episodes, check them out .
> >
> > -Doug Turnbull
> > http://opensourceconnections.com
>


Re: Live Podcast on Solr 6 with Yonik and Erik Hatcher (Today, 2pm ET)

2016-04-19 Thread Yonik Seeley
Hey Doug,
Not sure if the URL matters, but I thougt it was this one:

https://blab.im/matthew-l-overstreet-solr-6-is-available-find-out-about-what-s-new

-Yonik


On Tue, Apr 19, 2016 at 10:37 AM, Doug Turnbull
 wrote:
> Hey Solristas:
>
> We do a regular podcast called Search Disco
> . Today we'll be discussing the
> recent release of Solr 6 with Solr creator, Yonik Seeley and Solr committer
> Erik Hatcher.
>
> *Subscrbe to participate live*
> .
> (*2PM ET today (19-APR)*)
>
> We use the blab  conversation platform, which will let you
> chat with us, Yonik, and Erik. So bring your tough Solr questions!
>
> Look forward to seeing you there. And if you're interested in past
> episodes, check them out .
>
> -Doug Turnbull
> http://opensourceconnections.com


Re: Overall large size in Solr across collections

2016-04-19 Thread Zheng Lin Edwin Yeo
Hi Shawn,

Currently, the searching performance is still doing fine, but it is the
indexing that is slowing down. Not sure if increasing the RAM, or changing
to a SSD hard disk will help with the indexing speed?

Regards,
Edwin


On 19 April 2016 at 21:57, Shawn Heisey  wrote:

> On 4/18/2016 8:50 PM, Zheng Lin Edwin Yeo wrote:
> > Thanks for your explanation.
> >
> > I have set my segment size to 20GB under the TieredMergePolicy
> >
> >   name=
> > "maxMergeAtOnce">10 10  name=
> > "maxMergedSegmentMB">20480 
>
> That just controls the maximum size of a segment.  This defaults to
> 5GB.  When segments reach this size, they will not be auto-merged
> further.  If you do an optimize, the the max segment size is ignored,
> and the whole index will be merged into one segment.
>
> > I do have 192GB of RAM on my server which Solr is running on.
>
> With 1TB of index data on the server, 192GB will not give you optimal
> performance, but if what you are getting is good enough for you, then
> you probably don't need to rush out and buy more memory.
>
> Thanks,
> Shawn
>
>


Re: Is there any JIRA changed the stored order of multivalued field?

2016-04-19 Thread forest_soup
Thanks! That's very helpful!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-any-JIRA-changed-the-stored-order-of-multivalued-field-tp4264325p4271312.html
Sent from the Solr - User mailing list archive at Nabble.com.


Is there any detailed condition on which the snapshot pull recovery will occur?

2016-04-19 Thread forest_soup
We have a SolrCloud with solr v5.3.2. 
collection1 contains 1 shard with 2 replicas on solr nodes: solr1 and solr2
respectively.
In solrconfig.xml, there are updateLog config and uploaded to ZK and
effective:

  ${solr.ulog.dir:}
  ${solr.ulog.numVersionBuckets:65536}
  1000
  100


We know with these settings, at first solr1 down and solr2 active, and solr2
received more than 1000 updates, after solr1 is restarted, the recovery of
the replica in solr1 will be snapshot pull.

But we noticed a case with below steps:
1, At first solr1 and solr2 are active and both replicas has lots of data;
2, solr2 is shutdown;
3, update to solr1 with less than 1000 updates;
4, solr1 is shutdown;
5, the replica's data dir in solr2 are missing due to bad device or
mis-deletion;
6, solr2 is startup;
7, update to solr2 with about 2 or 3 updates;
8, solr1 is startup;
9, we noticed both replicas in solr1 and solr2 have only those 2 or 3
update's data in step #7. 
Lots of data lost!

It seems the recovery in solr1 is snapshot pull from solr2. 
Our questions:
1, Is there any explanation on this case?
2, Is there any detailed condition on which the snapshot pull recovery will
occur? 

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-any-detailed-condition-on-which-the-snapshot-pull-recovery-will-occur-tp4271311.html
Sent from the Solr - User mailing list archive at Nabble.com.


Live Podcast on Solr 6 with Yonik and Erik Hatcher (Today, 2pm ET)

2016-04-19 Thread Doug Turnbull
Hey Solristas:

We do a regular podcast called Search Disco
. Today we'll be discussing the
recent release of Solr 6 with Solr creator, Yonik Seeley and Solr committer
Erik Hatcher.

*Subscrbe to participate live*
.
(*2PM ET today (19-APR)*)

We use the blab  conversation platform, which will let you
chat with us, Yonik, and Erik. So bring your tough Solr questions!

Look forward to seeing you there. And if you're interested in past
episodes, check them out .

-Doug Turnbull
http://opensourceconnections.com


Re: MiniSolrCloudCluster usage in solr 7.0.0

2016-04-19 Thread Shawn Heisey
On 4/19/2016 5:00 AM, Rohana Rajapakse wrote:
> Found the missing  CloudSolrClient ::Builder class in the master branch, and 
> the code goes a bit further now. Still Solr cloud is not starting up. It is 
> failing to register Solr servers with Zookeeper.

This, combined with the earlier message about the Builder not being
found, sounds like you've got multiple versions of jars on your
classpath, so some components of Solr are using one version and other
components are using another version that is incompatible.

Thanks,
Shawn



Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-19 Thread Shawn Heisey
On 4/18/2016 11:22 AM, John Bickerstaff wrote:
> So - my IT guy makes the case that we don't really need Zookeeper / Solr
> Cloud...

> I'm biased in terms of using the most recent functionality, but I'm aware
> that bias is not necessarily based on facts and want to do my due
> diligence...
>
> Aside from the obvious benefits of spreading work across nodes (which may
> not be a big deal in our application and which my IT guy proposes is more
> transparently handled with a load balancer he understands) are there any
> other considerations that would drive a choice for Solr Cloud (zookeeper
> etc)?

Erick has a point.  If your IT guy feels comfortable with a load
balancer, he should go ahead and set that up.

For a new install like you're describing, I would probably still use
SolrCloud on the back end, even with a load balancer.

As Daniel said, a non-cloud replicated setup requires configuration of
masters and slaves.  Instead of replication, you could go with a build
system that sends updates to each copy of the index independently.

When using replication, switching master/slave roles in the event of a
master server failure is not trivial.  SolrCloud handles all that,
making multi-server management a LOT easier.  Initial setup is slightly
more complicated due to zookeeper, and configuration management requires
an "upload to zookeeper" step ... but I do not think these are not high
hurdles considering how much easier it is to manage multiple servers.

With the deployment you have described (which I trimmed out of this
reply), I think you'd be fine running a standalone zookeeper process on
three of your Solr servers, so you won't even need a bunch of extra
hardware.

When combining a load balancer with SolrCloud, the handler definitions
in solrconfig.xml should set preferLocalShards to true (which Tom
mentioned) so the load balancer target is the machine that actually
processes the request.  Troubleshooting becomes more difficult if you
don't do this, and avoiding the extra network hop will help performance.

Thanks,
Shawn



Re: Overall large size in Solr across collections

2016-04-19 Thread Shawn Heisey
On 4/18/2016 8:50 PM, Zheng Lin Edwin Yeo wrote:
> Thanks for your explanation.
>
> I have set my segment size to 20GB under the TieredMergePolicy
>
>   "maxMergeAtOnce">10 10  "maxMergedSegmentMB">20480 

That just controls the maximum size of a segment.  This defaults to
5GB.  When segments reach this size, they will not be auto-merged
further.  If you do an optimize, the the max segment size is ignored,
and the whole index will be merged into one segment.

> I do have 192GB of RAM on my server which Solr is running on.

With 1TB of index data on the server, 192GB will not give you optimal
performance, but if what you are getting is good enough for you, then
you probably don't need to rush out and buy more memory.

Thanks,
Shawn



Re: Indexing 700 docs per second

2016-04-19 Thread Susheel Kumar
 It sounds achievable with your machine configuration and i would suggest
to try out atomic update.  Use SolrJ with multi-threaded indexing for
higher indexing rate.

Thanks,
Susheel



On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans  wrote:

> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson 
> wrote:
> > Hi,
> >
> > I have a requirement to index (mainly updation) 700 docs per second.
> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> > byes (6 fields out of which only 2 will undergo updation at the above
> > rate). This collection has around 122Million docs and that count is
> pretty
> > much a constant.
> >
> > 1. Can I manage this updation rate with a non-sharded ie single Solr
> > instance set up?
> > 2. Also is atomic update or a full update (the whole doc) of the changed
> > records the better approach in this case.
> >
> > Could some one please share their views/ experience?
>
> Try it and see - everyone's data/schemas are different and can affect
> indexing speed. It certainly sounds achievable enough - presumably you
> can at least produce the documents at that rate?
>
> Cheers
>
> Tom
>


Re: what is opening realtime Searcher

2016-04-19 Thread Yonik Seeley
On Mon, Apr 18, 2016 at 8:02 PM, Erick Erickson  wrote:
> This is about real-time get.

To clarify, it's used to handle real-time get type functionality in
general.  It's used internally in a couple ways, not just when a user
issues a "real-time get".

-Yonik


Re: Indexing 700 docs per second

2016-04-19 Thread Tom Evans
On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson  wrote:
> Hi,
>
> I have a requirement to index (mainly updation) 700 docs per second.
> Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> byes (6 fields out of which only 2 will undergo updation at the above
> rate). This collection has around 122Million docs and that count is pretty
> much a constant.
>
> 1. Can I manage this updation rate with a non-sharded ie single Solr
> instance set up?
> 2. Also is atomic update or a full update (the whole doc) of the changed
> records the better approach in this case.
>
> Could some one please share their views/ experience?

Try it and see - everyone's data/schemas are different and can affect
indexing speed. It certainly sounds achievable enough - presumably you
can at least produce the documents at that rate?

Cheers

Tom


restore issue

2016-04-19 Thread Jan Verweij van searchXperts
Hi,
Just need to check the following.
Currently building a test environment with solrcloud and 3 nodes After loading 
some data into solr I did the following:
 1. create a backup using the replication handler like 
http://localhost:8983/solr/ [SHARDNAME]
/replication?command=backup&location=/data/backup/solr
 2. cleared the index with a delbyquery
 3. restored the index using the replication mechnism

So far so good, but then I tried the following
4. clear the index again with delbyquery 5. restore the index in the same way 
as with step 3 (with the same backup name
etc.)
Now the index remains empty, even after reloading or restarting the nodes.
I noticed the index location was changed from index to
/var/solr/data/products_shard8_replica2/data/restore.snapshot.products_shard8_replica2.201106072258
after running step 3 so I might try to dump files into the same directory and
the changes are un-noticed.
Is this a bug?
I solved it by removing some files and directories but still…
Cheers,
Jan Verweij

RE: MiniSolrCloudCluster usage in solr 7.0.0 - Got It Working!

2016-04-19 Thread Rohana Rajapakse
Thanks for Shawn Heisey and Chris Hostetter for your support. Finally I got it 
working. As you both pointed out, for the time being, I will start with an 
empty baseDir.

Best,
Rohana

-Original Message-
From: Rohana Rajapakse [mailto:rohana.rajapa...@gossinteractive.com] 
Sent: 19 April 2016 12:01
To: solr-user@lucene.apache.org
Subject: RE: MiniSolrCloudCluster usage in solr 7.0.0

Found the missing  CloudSolrClient ::Builder class in the master branch, and 
the code goes a bit further now. Still Solr cloud is not starting up. It is 
failing to register Solr servers with Zookeeper.

Here is the stack trace:

java.lang.IllegalStateException: Solr servers failed to register with ZK. 
Current count: 0; Expected count: 1
at 
org.apache.solr.cloud.MiniSolrCloudCluster.(MiniSolrCloudCluster.java:240)
at 
org.apache.solr.cloud.MiniSolrCloudCluster.(MiniSolrCloudCluster.java:111)
at 
com.gossinteractive.solr.TestMiniSolrCloudCluster.setup(TestMiniSolrCloudCluster.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 

And her is the log file content:


11:51:19,688 INFO  ~ STARTING ZK TEST SERVER
11:51:19,690 INFO  ~ client port:0.0.0.0/0.0.0.0:0
11:51:19,690 INFO  ~ Starting server
11:51:19,727 INFO  ~ Server environment:zookeeper.version=3.4.8--1, built on 
02/06/2016 03:18 GMT
11:51:19,727 INFO  ~ Server environment:java.version=1.8.0_40
11:51:19,728 INFO  ~ Server 
environment:user.dir=C:\projects\gossprojects\goss-typeahead-solr6\goss-typeahead-solrhandlers
11:51:19,739 INFO  ~ Created server with tickTime 1000 minSessionTimeout 2000 
maxSessionTimeout 2 datadir 
C:\projects\gossprojects\goss-typeahead-solr6\goss-typeahead-solrhandlers\src\testdata\minicluster\zookeeper\server1\data\version-2
 snapdir 
C:\projects\gossprojects\goss-typeahead-solr6\goss-typeahead-solrhandlers\src\testdata\minicluster\zookeeper\server1\data\version-2
11:51:19,815 INFO  ~ binding to port 0.0.0.0/0.0.0.0:0
11:51:19,893 INFO  ~ start zk server on port:42470
11:51:19,901 INFO  ~ Using default ZkCredentialsProvider
11:51:19,909 INFO  ~ Client environment:zookeeper.version=3.4.8--1, built on 
02/06/2016 03:18 GMT
11:51:19,910 INFO  ~ Initiating client connection, 
connectString=127.0.0.1:42470 sessionTimeout=45000 
watcher=org.apache.solr.common.cloud.SolrZkClient$3@52d455b8
11:51:19,922 INFO  ~ Waiting for client to connect to ZooKeeper
11:51:19,924 INFO  ~ Opening socket connection to server 
127.0.0.1/127.0.0.1:42470. Will not attempt to authenticate using SASL (unknown 
error)
11:51:19,925 INFO  ~ Socket connection established to 
127.0.0.1/127.0.0.1:42470, initiating session
11:51:19,925 INFO  ~ Accepted socket connection from /127.0.0.1:42474
11:51:19,930 INFO  ~ Client attempting to establish new session at 
/127.0.0.1:42474
11:51:19,932 INFO  ~ Creating new log file: log.1
11:51:19,957 INFO  ~ Established session 0x1542e25578e with negotiated 
timeout 2 for client /127.0.0.1:42474
11:51:19,957 INFO  ~ Session establishment complete on server 
127.0.0.1/127.0.0.1:42470, sessionid = 0x1542e25578e, negotiated timeout = 
2
11:51:19,964 INFO  ~ Watcher 
org.apache.solr.common.cloud.ConnectionManager@3d1ad5cc 
name:ZooKeeperConnection Watcher:127.0.0.1:42470 got event WatchedEvent 
state:SyncConnected type:None path:null path:null type:None
11:51:19,964 INFO  ~ Client is connected to ZooKeeper
11:51:19,965 INFO  ~ Using default ZkACLProvider
11:51:19,966 INFO  ~ makePath: /solr/solr.xml
11:51:20,001 INFO  ~ Processed session termination for sessionid: 
0x1542e25578e
11:51:20,010 INFO  ~ Session: 0x1542e25578e closed
11:51:20,011 INFO  ~ Closed socket connection for client /12

Re: Storing different collection on different hard disk

2016-04-19 Thread Alexandre Rafalovitch
Have you tried setting dataDir parameter in the core.properties file?
https://cwiki.apache.org/confluence/display/solr/Defining+core.properties

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 19 April 2016 at 20:43, Zheng Lin Edwin Yeo  wrote:
> Hi,
>
> I would like to find out is it possible to store the indexes file of
> different collections in different hard disk?
> Like for example, I want to store the indexes of collection1 in Hard Disk
> 1, and the indexes of collection2 in Hard Disk 2.
>
> I am using Solr 5.4.0
>
> Regards,
> Edwin


RE: MiniSolrCloudCluster usage in solr 7.0.0

2016-04-19 Thread Rohana Rajapakse
Found the missing  CloudSolrClient ::Builder class in the master branch, and 
the code goes a bit further now. Still Solr cloud is not starting up. It is 
failing to register Solr servers with Zookeeper.

Here is the stack trace:

java.lang.IllegalStateException: Solr servers failed to register with ZK. 
Current count: 0; Expected count: 1
at 
org.apache.solr.cloud.MiniSolrCloudCluster.(MiniSolrCloudCluster.java:240)
at 
org.apache.solr.cloud.MiniSolrCloudCluster.(MiniSolrCloudCluster.java:111)
at 
com.gossinteractive.solr.TestMiniSolrCloudCluster.setup(TestMiniSolrCloudCluster.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 

And her is the log file content:


11:51:19,688 INFO  ~ STARTING ZK TEST SERVER
11:51:19,690 INFO  ~ client port:0.0.0.0/0.0.0.0:0
11:51:19,690 INFO  ~ Starting server
11:51:19,727 INFO  ~ Server environment:zookeeper.version=3.4.8--1, built on 
02/06/2016 03:18 GMT
11:51:19,727 INFO  ~ Server environment:java.version=1.8.0_40
11:51:19,728 INFO  ~ Server 
environment:user.dir=C:\projects\gossprojects\goss-typeahead-solr6\goss-typeahead-solrhandlers
11:51:19,739 INFO  ~ Created server with tickTime 1000 minSessionTimeout 2000 
maxSessionTimeout 2 datadir 
C:\projects\gossprojects\goss-typeahead-solr6\goss-typeahead-solrhandlers\src\testdata\minicluster\zookeeper\server1\data\version-2
 snapdir 
C:\projects\gossprojects\goss-typeahead-solr6\goss-typeahead-solrhandlers\src\testdata\minicluster\zookeeper\server1\data\version-2
11:51:19,815 INFO  ~ binding to port 0.0.0.0/0.0.0.0:0
11:51:19,893 INFO  ~ start zk server on port:42470
11:51:19,901 INFO  ~ Using default ZkCredentialsProvider
11:51:19,909 INFO  ~ Client environment:zookeeper.version=3.4.8--1, built on 
02/06/2016 03:18 GMT
11:51:19,910 INFO  ~ Initiating client connection, 
connectString=127.0.0.1:42470 sessionTimeout=45000 
watcher=org.apache.solr.common.cloud.SolrZkClient$3@52d455b8
11:51:19,922 INFO  ~ Waiting for client to connect to ZooKeeper
11:51:19,924 INFO  ~ Opening socket connection to server 
127.0.0.1/127.0.0.1:42470. Will not attempt to authenticate using SASL (unknown 
error)
11:51:19,925 INFO  ~ Socket connection established to 
127.0.0.1/127.0.0.1:42470, initiating session
11:51:19,925 INFO  ~ Accepted socket connection from /127.0.0.1:42474
11:51:19,930 INFO  ~ Client attempting to establish new session at 
/127.0.0.1:42474
11:51:19,932 INFO  ~ Creating new log file: log.1
11:51:19,957 INFO  ~ Established session 0x1542e25578e with negotiated 
timeout 2 for client /127.0.0.1:42474
11:51:19,957 INFO  ~ Session establishment complete on server 
127.0.0.1/127.0.0.1:42470, sessionid = 0x1542e25578e, negotiated timeout = 
2
11:51:19,964 INFO  ~ Watcher 
org.apache.solr.common.cloud.ConnectionManager@3d1ad5cc 
name:ZooKeeperConnection Watcher:127.0.0.1:42470 got event WatchedEvent 
state:SyncConnected type:None path:null path:null type:None
11:51:19,964 INFO  ~ Client is connected to ZooKeeper
11:51:19,965 INFO  ~ Using default ZkACLProvider
11:51:19,966 INFO  ~ makePath: /solr/solr.xml
11:51:20,001 INFO  ~ Processed session termination for sessionid: 
0x1542e25578e
11:51:20,010 INFO  ~ Session: 0x1542e25578e closed
11:51:20,011 INFO  ~ Closed socket connection for client /127.0.0.1:42474 which 
had sessionid 0x1542e25578e
11:51:20,012 INFO  ~ EventThread shut down for session: 0x1542e25578e
11:51:20,024 INFO  ~ Logging initialized @617ms
11:51:20,102 INFO  ~ jetty-9.3.8.RC0
11:51:20,128 INFO  ~ Started 
o.e.j.s.ServletContextHandler@5ccad1a7{/solr,null,AVAILABLE}
11:51:20,135 INFO  ~ Started 
ServerConnector@5ba090a4{HTTP/1.1,[http/1.1]}{0.0.0.0:42475}
11

Storing different collection on different hard disk

2016-04-19 Thread Zheng Lin Edwin Yeo
Hi,

I would like to find out is it possible to store the indexes file of
different collections in different hard disk?
Like for example, I want to store the indexes of collection1 in Hard Disk
1, and the indexes of collection2 in Hard Disk 2.

I am using Solr 5.4.0

Regards,
Edwin


Indexing 700 docs per second

2016-04-19 Thread Mark Robinson
Hi,

I have a requirement to index (mainly updation) 700 docs per second.
Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
byes (6 fields out of which only 2 will undergo updation at the above
rate). This collection has around 122Million docs and that count is pretty
much a constant.

1. Can I manage this updation rate with a non-sharded ie single Solr
instance set up?
2. Also is atomic update or a full update (the whole doc) of the changed
records the better approach in this case.

Could some one please share their views/ experience?

Thanks!
Mark.


Re: Wildcard query behavior.

2016-04-19 Thread Modassar Ather
Yes! wildcards are not analyzed. Thanks Shwan for reminding me.
Thanks Erick for your response.

Best,
Modassar

On Mon, Apr 18, 2016 at 8:53 PM, Erick Erickson 
wrote:

> Here's a blog on the subject:
>
> https://lucidworks.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
>
> bq: When validator is changed to validate, both at query time and index
> time,
> then should not validator*/validator return the same results at-least?
>
> This is one of those problems that's easy to state, but hard to solve. And
> there are so many variations that any attempt to solve it will _always_
> have lots of surprises. Simple example (and remember that the
> stemming is usually algorithmic). "validator" probably stems to "validat".
> However, "validato" (note the 'o') may not stem
> the same way at all, so searching for "validato*" wouldn't produce the
> expected response.
>
> Best,
> Erick
>
> On Mon, Apr 18, 2016 at 6:23 AM, Shawn Heisey  wrote:
> > On 4/18/2016 1:18 AM, Modassar Ather wrote:
> >> When I search for f:validator I get 80K+ documents whereas if I search
> for
> >> f:validator* I get only around 150 results.
> >>
> >> When I checked on analysis page I see that validator is changed to
> >> validate. Per my understanding in both the above cases it should
> at-least
> >> give the exact same result of around 80K+ documents.
> >
> > What Reth was trying to tell you, but did not state clearly, is that
> > when you use wildcards, your query is NOT analyzed -- none of your
> > filters, including the stemmer, are used.
> >
> > Thanks,
> > Shawn
> >
>


RE: MiniSolrCloudCluster usage in solr 7.0.0

2016-04-19 Thread Rohana Rajapakse
After resolving another dependency, now I find that solr-solrj.jar is missing 
the "Builder" method in CloudSolrClient class. I have checked both 
solr-solrj-6.0.0 and 7.0.0.

java.lang.NoClassDefFoundError: 
org/apache/solr/client/solrj/impl/CloudSolrClient$Builder
at 
org.apache.solr.cloud.MiniSolrCloudCluster.buildSolrClient(MiniSolrCloudCluster.java:449)
at 
org.apache.solr.cloud.MiniSolrCloudCluster.(MiniSolrCloudCluster.java:248)
at 
org.apache.solr.cloud.MiniSolrCloudCluster.(MiniSolrCloudCluster.java:111)
at 
com.gossinteractive.solr.TestMiniSolrCloudCluster.setup(TestMiniSolrCloudCluster.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: java.lang.ClassNotFoundException: 
org.apache.solr.client.solrj.impl.CloudSolrClient$Builder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 20 more

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: 15 April 2016 21:01
To: solr-user@lucene.apache.org
Subject: Re: MiniSolrCloudCluster usage in solr 7.0.0

On 4/14/2016 8:32 AM, Rohana Rajapakse wrote:
> I have added few dependency jars into my project. There are no compilation 
> errors or ClassNotFound exceptions, but Zookeeper exception " 
> KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for 
> /solr/solr.xml ". My temporary solrHome folder has a solr.xml.  No other 
> files (solrconfig.xml , schema.xml) are provided. Thought it should start 
> solr cloud server with defaults, but it doesn't. There are no other solr or 
> zookeeper servers running on my machine. 

I looked at SolrCloudTestCase to see how MiniSolrCloudCluster should be used, 
then I wrote a little program and configured ivy to pull down 
solr-test-framework from 6.0.0 (getting ivy to work right was an adventure!).  
Based on what I found in SolrCloudTestCase, this is the code I wrote last 
evening:

public class MiniSC
{
static JettyConfig jettyConfig = null;
static MiniSolrCloudCluster msc = null;
static CloudSolrClient client = null;

public static void main(String[] args) throws Exception
{
jettyConfig = JettyConfig.builder().setContext("/solr").build();
msc = new MiniSolrCloudCluster(2, Paths.get("testcluster"), 
jettyConfig);
client = msc.getSolrClient();
client.close();
msc.shutdown();
}
}

At first, I saw the same exception you got ... but after a little while I 
figured out that this is because I was running the program more than once 
without deleting everything in the baseDir -- so the zookeeper server was 
starting with an existing database already containing the solr.xml.  When 
MiniSolrCloudCluster is used in Solr tests, the baseDir is newly created for 
each test class, so this doesn't happen.

When I delete everything in "testcluster" and run my test code, I get the 
following in my logfile:

http://apaste.info/Dkw

There are no errors, only WARN and INFO logs.  At this point, I should be able 
to use the client object to upload a config to zookeeper, create a collection, 
and do other testing.

Thanks,
Shawn



Registered Office: 24 Darklake View, Estover, Plymouth, PL6 7TL.
Company Registration No: 3553908

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressin

Re: Can a field be an array of fields?

2016-04-19 Thread Bastien Latard - MDPI AG

Thank you Jack and Daniel, I somehow missed your answers.

Yes, I already thought about the JSON possibility, but I was more 
concerned of having such structure in result:


"docs":[
   {
[...]
"authors_array":
 [  
[
"given_name":["Bastien"],
"last_name":["lastname1"]
 ],
[
"last_name":["lastname2"]
 ],
[
"given_name":["Matthieu"],
"last_name":["lastname2"]
 ],
[
"given_name":["Nino"],
"last_name":["lastname4"]
 ],
 ]
[...]


And being able to query like:
- q=authors_array.given_name:Nino
OR
- q=authors_array['given_name']:Nino

Is that possible?


Kind regards,
Bastien


On 15/04/2016 17:08, Jack Krupansky wrote:

It all depends on what your queries look like - what input data does your
application have and what data does it need to retrieve.

My recommendation is that you store first name and last name as separate,
multivalued fields if you indeed need to query by precisely a first or last
name, but also store the full name as a separate multivalued text field. If
you want to search by only first or last name, fine. If you want to search
by full name or wildcards, etc., you can use the full name field, using
phrase query. You can use an update request processor to combine first and
last name into that third field. You could also store the full name in a
fourth field as raw JSON if you really need structure in the result. The
third field might have first and last name with a special separator such as
"|", although a simple comma is typically sufficient.


-- Jack Krupansky

On Fri, Apr 15, 2016 at 10:58 AM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:


Short answer - JOINs, external query outside Solr, Elastic Search ;)
Alternatives:
   * You get back an id for each document when you query on "Nino".   You
look up the last names in some other system that has the full list.
   * You index the authors in another collection and use JOINs
   * You store the author_array as formatted, escaped JSON, stored, but not
indexed (or analyzed).   When you get the data back, you navigate the JSON
to the author_array, get the value, and parse that value as JSON.   Now you
have the full list.
   * This is a sweet spot for Elastic Search, to be perfectly honest.

-Original Message-
From: Bastien Latard - MDPI AG [mailto:lat...@mdpi.com.INVALID]
Sent: Friday, April 15, 2016 7:52 AM
To: solr-user@lucene.apache.org
Subject: Can a field be an array of fields?

Hi everybody!

/I described a bit what I found in another thread, but I prefer to create
a new thread for this specific question.../ *It's **possible to create an
array of string by doing (incomplete example):
- in the data-conf.xml:*


 
   
   
   
   
 



*- in schema.xml:
*




And this provides something like:

"docs":[
{
[...]
 "given_name":["Bastien",  "Matthieu",  "Nino"],
 "last_name":["lastname1", "lastname2",
  "lastname3",   "lastname4"],

[...]


*Note: there can be one author with only a last_name, and then we are
unable to tell which one it is...*

My goal would be to get this as a result:

"docs":[
{
[...]
 "authors_array":
  [
 [
 "given_name":["Bastien"],
 "last_name":["lastname1"]
  ],
 [
 "last_name":["lastname2"]
  ],
 [
 "given_name":["Matthieu"],
 "last_name":["lastname2"]
  ],
 [
 "given_name":["Nino"],
 "last_name":["lastname4"]
  ],
  ]
[...]


Is there any way to do this?
/PS: I know that I could do '//select if(a.given_name is not null,
a.given_name ,'') as given_name, [...]//' but I would like to get an
array.../

I tried to add something like that to the schema.xml, but this doesn't
work (well, it might be of type 'array'):


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: what is opening realtime Searcher

2016-04-19 Thread Jaroslaw Rozanski
Hi Erick,

Thanks for the info. Was under impression that we have extra setting
"openSearcher" to control when the searchers are being opened. 

>From what you saying a searcher can be opened not only as a result of
hard or soft commit.

What I am observe, to follow your example:
T0 - everything is committed 
T1 - index document
T2 - opens realtime searchers
(time passes)
T3 - soft commit (commitScheduler)
T4 - opens searcher
(time passes)
T5 - hard commit (commitScheduler, openSearcher=false)
(time passes)
T6 - soft commit (commitScheduler)
T7 - opens searcher  

The T2 in above example is what is unexpected. 

Having had a look at this thread
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201507.mbox/%3cd2f8751a-b16a-4736-9e03-50873711d...@gmail.com%3E
I was wondering if I had something misconfigured. 


Thanks,
Jarek

On Tue, 19 Apr 2016, at 01:02, Erick Erickson wrote:
> This is about real-time get. The idea is this. Suppose
> you have a doc doc1 already in your index at time T1
> and update it at time T2 and your soft commit happens
> at time T3.
> 
> If a search a search happens between time T1 and T2
> but the fetch happens between T2 and T3, you get
> back the updated document, not the doc that was in
> the index. So the reatime get is outside the
> soft and hard commit issues.
> 
> It's a pretty lightweight operation, no caches are invalidated
> or warmed etc.
> 
> Best,
> Erick
> 
> On Mon, Apr 18, 2016 at 9:59 AM, Jaroslaw Rozanski
>  wrote:
> > Hi,
> >
> >  What exactly triggers opening new "realtime" searcher?
> >
> > 2016-04-18_16:28:02.33289 INFO  (qtp1038620625-13) [c:col1 s:shard1 
> > r:core_node3 x:col1_shard1_replica3] o.a.s.s.SolrIndexSearcher Opening 
> > Searcher@752e986f[col1_shard1_replica3] realtime
> >
> > I am seeing above being triggered when adding documents to index. The
> > frequency (from few milliseconds to few seconds) does not correlate with
> > maxTime of either autoCommit or autoSoftCommit (which are fixed to tens
> > of seconds).
> >
> > Client never sends commit message explicitly (and there is
> > IgnoreCommitOptimizeUpdateProcessorFactory in processor chain).
> >
> > Re: Solr 5.5.0
> >
> >
> >
> > Thanks,
> > Jarek
> >


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-19 Thread Charlie Hull

On 18/04/2016 18:22, John Bickerstaff wrote:

So - my IT guy makes the case that we don't really need Zookeeper / Solr
Cloud...

He may be right - we're serving static data (changes to the collection
occur only 2 or 3 times a year and are minor)

We probably could have 3 or 4 Solr nodes running in non-Cloud mode -- each
configured the same way, behind a load balancer and do fine.

I've got a Kafka server set up with the solr docs as topics.  It takes
about 10 minutes to reload a "blank" Solr Server from the Kafka topic...
If I target 3-4 SOLR servers from my microservice instead of one, it
wouldn't take much longer than 10 minutes to concurrently reload all 3 or 4
Solr servers from scratch...


This is something we've been discussing as a concept - to offload all 
the scaling stuff to Kafka (which is very good at that sort of thing) 
and simply hang Solr instances onto a Kafka topic. We've not taken it 
any further than a concept at this point but interesting to hear about 
others doing so!


Charlie



I'm biased in terms of using the most recent functionality, but I'm aware
that bias is not necessarily based on facts and want to do my due
diligence...

Aside from the obvious benefits of spreading work across nodes (which may
not be a big deal in our application and which my IT guy proposes is more
transparently handled with a load balancer he understands) are there any
other considerations that would drive a choice for Solr Cloud (zookeeper
etc)?



On Mon, Apr 18, 2016 at 9:26 AM, Tom Evans  wrote:


On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
 wrote:

Thanks all - very helpful.

@Shawn - your reply implies that even if I'm hitting the URL for a single
endpoint via HTTP - the "balancing" will still occur across the Solr

Cloud

(I understand the caveat about that single endpoint being a potential

point

of failure).  I just want to verify that I'm interpreting your response
correctly...

(I have been asked to provide IT with a comprehensive list of options

prior

to a design discussion - which is why I'm trying to get clear about the
various options)

In a nutshell, I think I understand the following:

a. Even if hitting a single URL, the Solr Cloud will "balance" across all
available nodes for searching
   Caveat: That single URL represents a potential single point of
failure and this should be taken into account

b. SolrJ's CloudSolrClient API provides the ability to distribute load --
based on Zookeeper's "knowledge" of all available Solr instances.
   Note: This is more robust than "a" due to the fact that it
eliminates the "single point of failure"

c.  Use of a load balancer hitting all known Solr instances will be fine

-

although the search requests may not run on the Solr instance the load
balancer targeted - due to "a" above.

Corrections or refinements welcomed...


With option a), although queries will be distributed across the
cluster, all queries will be going through that single node. Not only
is that a single point of failure, but you risk saturating the
inter-node network traffic, possibly resulting in lower QPS and higher
latency on your queries.

With option b), as well as SolrJ, recent versions of pysolr have a
ZK-aware SolrCloud client that behaves in a similar way.

With option c), you can use the preferLocalShards so that shards that
are local to the queried node are used in preference to distributed
shards. Depending on your shard/cluster topology, this can increase
performance if you are returning large amounts of data - many or large
fields or many documents.

Cheers

Tom






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


RE: MiniSolrCloudCluster usage in solr 7.0.0

2016-04-19 Thread Rohana Rajapakse
Tried again with an empty baseDir, and this time it's a different error. The 
error is thrown during the execution of the line:
msc = new MiniSolrCloudCluster(2, Paths.get("testcluster"), jettyConfig);

Here is the full stack trace:


org.apache.solr.common.SolrException: java.util.concurrent.TimeoutException: 
Could not connect to ZooKeeper 127.0.0.1:18015 within 45000 ms
at 
org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:181)
at 
org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:115)
at 
org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:105)
at 
org.apache.solr.cloud.MiniSolrCloudCluster.(MiniSolrCloudCluster.java:197)
at 
org.apache.solr.cloud.MiniSolrCloudCluster.(MiniSolrCloudCluster.java:111)
at 
com.gossinteractive.solr.TestMiniSolrCloudCluster.setup(TestMiniSolrCloudCluster.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: java.util.concurrent.TimeoutException: Could not connect to 
ZooKeeper 127.0.0.1:18015 within 45000 ms
at 
org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:228)
at 
org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:173)
... 21 more

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: 15 April 2016 21:01
To: solr-user@lucene.apache.org
Subject: Re: MiniSolrCloudCluster usage in solr 7.0.0

On 4/14/2016 8:32 AM, Rohana Rajapakse wrote:
> I have added few dependency jars into my project. There are no compilation 
> errors or ClassNotFound exceptions, but Zookeeper exception " 
> KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for 
> /solr/solr.xml ". My temporary solrHome folder has a solr.xml.  No other 
> files (solrconfig.xml , schema.xml) are provided. Thought it should start 
> solr cloud server with defaults, but it doesn't. There are no other solr or 
> zookeeper servers running on my machine. 

I looked at SolrCloudTestCase to see how MiniSolrCloudCluster should be used, 
then I wrote a little program and configured ivy to pull down 
solr-test-framework from 6.0.0 (getting ivy to work right was an adventure!).  
Based on what I found in SolrCloudTestCase, this is the code I wrote last 
evening:

public class MiniSC
{
static JettyConfig jettyConfig = null;
static MiniSolrCloudCluster msc = null;
static CloudSolrClient client = null;

public static void main(String[] args) throws Exception
{
jettyConfig = JettyConfig.builder().setContext("/solr").build();
msc = new MiniSolrCloudCluster(2, Paths.get("testcluster"), 
jettyConfig);
client = msc.getSolrClient();
client.close();
msc.shutdown();
}
}

At first, I saw the same exception you got ... but after a little while I 
figured out that this is because I was running the program more than once 
without deleting everything in the baseDir -- so the zookeeper server was 
starting with an existing database already containing the solr.xml.  When 
MiniSolrCloudCluster is used in Solr tests, the baseDir is newly created for 
each test class, so this doesn't happen.

When I delete everything in "testcluster" and run my test code, I get the 
following in my logfile:

http://apaste.info/Dkw

There are no errors, only WARN and INFO logs.  At this point, I should be able 
to use the client object to upload a config to zookeeper, create a collection, 
and do other testing.

Thanks,
Shawn



Registered Office: 24 Darklake View, Estover, Plymouth, PL6 7TL.
Company Registration No: 35539