Re: Scaling SolrCloud

2016-01-20 Thread Erick Erickson
bq: 3 are to risky, you lost one you lost quorum

Typo? You need to lose two.

On Wed, Jan 20, 2016 at 6:25 AM, Yago Riveiro  wrote:
> Our Zookeeper cluster is an ensemble of 5 machines, is a good starting point,
> 3 are to risky, you lost one you lost quorum and with 7 sync cost increase.
>
>
>
> ZK cluster is in machines without IO and rotative hdd (don't not use SDD to
> gain IO performance,  zookeeper is optimized to spinning disks).
>
>
>
> The ZK cluster behaves without problems, the first deploy of ZK was in the
> same machines that the Solr Cluster (ZK log in its own hdd) and that didn't
> wok very well, CPU and networking IO from Solr Cluster was too much.
>
>
>
> About schema modifications.
>
> Modify the schema to add new fields is relative simple with new API, in the
> pass all the work was manually uploading the schema to ZK and reloading all
> collections (indexing must be disable or timeouts and funny errors happen).
>
> With the new Schema API this is more user friendly. Anyway, I stop indexing
> and for reload the collections (I don't know if it's necessary nowadays).
>
> About Indexing data.
>
>
>
> We have self made data importer, it's not java and not performs batch indexing
> (with 500 collections buffer data and build the batch is expensive and
> complicate for error handling).
>
>
>
> We use regular HTTP post in json. Our throughput  is about 1000 docs/s without
> any type of optimization. Some time we have issues with replication, the slave
> can keep pace with leader insertion and a full sync is requested, this is bad
> because sync the replica again implicates a lot of IO wait and CPU and with
> replicas with 100G take an hour or more (normally when this happen, we disable
> indexing to release IO and CPU and not kill the node with a load of 50 or 60).
>
> In this department my advice is "keep it simple" in the end is an HTTP POST to
> a node of the cluster.
>
>
>
> \--
>
> /Yago Riveiro
>
>> On Jan 20 2016, at 1:39 pm, Troy Edwards tedwards415...@gmail.com
> wrote:
>
>>
>
>> Thank you for sharing your experiences/ideas.
>
>>
>
>> Yago since you have 8 billion documents over 500 collections, can you share
> what/how you do index maintenance (e.g. add field)? And how are you loading
> data into the index? Any experiences around how Zookeeper ensemble behaves
> with so many collections?
>
>>
>
>> Best,
>
>>
>
>>
> On Tue, Jan 19, 2016 at 6:05 PM, Yago Riveiro yago.rive...@gmail.com
> wrote:
>
>>
>
>>  What I can say is:
> 
> 
>  * SDD (crucial for performance if the index doesn't fit in memory, and
>  will not fit)
>  * Divide and conquer, for that volume of docs you will need more than 6
>  nodes.
>  * DocValues to not stress the java HEAP.
>  * Do you will you aggregate data?, if yes, what is your max
>  cardinality?, this question is the most important to size correctly the
>  memory needs.
>  * Latency is important too, which threshold is acceptable before
>  consider a query slow?
>  At my company we are running a 12 terabytes (2 replicas) Solr cluster
> with
>  8
>  billion documents sparse over 500 collection . For this we have about 12
>  machines with SDDs and 32G of ram each (~24G for the heap).
> 
>  We don't have a strict need of speed, 30 second query to aggregate 100
>  million
>  documents with 1M of unique keys is fast enough for us, normally the
>  aggregation performance decrease as the number of unique keys increase,
>  with
>  low unique key factor, queries take less than 2 seconds if data is in OS
>  cache.
> 
>  Personal recommendations:
> 
>  * Sharding is important and smart sharding is crucial, you don't want
>  run queries on data that is not interesting (this slow down queries when
>  the dataset is big).
>  * If you want measure speed do it with about 1 billion documents to
>  simulate something real (real for 10 billion document world).
>  * Index with re-indexing in mind. with 10 billion docs, re-index data
>  takes months ... This is important if you don't use regular features of
>  Solr. In my case I configured Docvalues with disk format (not standard
>  feature in 4.x) and at some point this format was deprecated. Upgrade
> Solr
>  to 5.x was an epic 3 months battle to do it without full downtime.
>  * Solr is like your girlfriend, will demand love and care and plenty of
>  space to full-recover replicas that in some point are out of sync, happen
> a
>  lot restarting nodes (this is annoying with replicas with 100G), don't
>  underestimate this point. Free space can save your life.
> 
>  \\--
> 
>  /Yago Riveiro
> 
>   On Jan 19 2016, at 11:26 pm, Shawn Heisey
> lt;apa...@elyograg.orggt;
>  wrote:
> 
>  
> 
>   On 1/19/2016 1:30 PM, Troy Edwards wrote:
>  gt; We are currently "beta testing" a SolrCloud with 2 nodes and 2
> shards
>  with
>  gt; 2 replicas each. The number of documents is about 125000.
>  gt;
>  gt; We now want to scale this to about 10 billion documents.
>  gt;
>  gt; What are the steps to prototyping, 

Re: Using Solr's spatial functionality for astronomical catalog

2016-01-20 Thread david.w.smi...@gmail.com
Hello Colin,

If the spatial field you use is the SpatialRecursivePrefixTreeFieldType one
(RPT for short) with geo="true" then the circle shape (i.e. point-radius
filter) implied by the geofilt Solr QParser is on a sphere.  That is, it
uses the "great circle" distance computed using the Haversine formula by
default, though it can be configured to use the Law of Cosines formula or
Vincenty (spherical version) formula if you so choose.  Using geodist() for
spatial distance sorting/boosting also uses this.  If you use LatLonType
then geofilt & geodist() use Haversine too.

If you use polygons or line strings, then it's *not* using a spherical
model; it's using a Euclidean (flat) model on plate carrée.  I am currently
working on adapting the Spatial4j library to work with Lucene's Geo3D (aka
spatial 3d) which has both a spherical model and an ellipsoidal model,
which can be configured with the characteristics specified by WGS84.  If
you are super-eager to get this yourself without waiting, then you could
write a Solr QParser that constructs a Geo3dShape wrapping a Geo3D GeoShape
object constructed from query parameters.  You might alternatively try and
use Geo3DPointField on Lucene 6 trunk.

~ David

On Tue, Jan 19, 2016 at 11:07 AM Colin Freas  wrote:

>
> Greetings!
>
> I have recently stood up an instance of Solr, indexing a catalog of about
> 100M records representing points on the celestial sphere.  All of the
> fields are strings, floats, and non-spatial types.  I’d like to convert the
> positional data to an appropriate spatial point data type supported by Solr.
>
> I have a couple of questions about indexing spatial data using Solr, since
> it seems spatial4j, and the spatial functionality in Solr generally, is
> more GIS geared.  I worry that the measurements of lat/long on the
> imperfect sphere of the Earth wouldn’t match up with the astronomical right
> ascension/declination concept of the perfectly spherical celestial sphere
> used to record the coordinates of our records.
>
> I’m also worried there might be other assumptions built into spatial4j &
> Solr based on using a real surface vs a virtual one.
>
> Does anyone have experience doing this, or is there perhaps some
> documentation specific to this use case that anyone might be able to point
> me to?
>
> Thanks in advance,
> Colin
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: schemaless vs schema based core

2016-01-20 Thread Erick Erickson
I would really avoid schemaless in _any_ situation where I know the
schema ahead of time.

bq: But in my case, I am planning to use solrj (so, no spelling mistakes)

On, I'm quite sure there'll be some kind of mistake sometime ;) I know
of at at least one situation where a programming mistake in SolrJ
caused over 20K unique dynamic fields to be created. Admittedly, not a
spelling mistake.

But ranting aside, let's draw a clear distinction between schemaless
and managed schema on the one hand and classic on the other.

Both schemaless and managed schema use the same underlying mechanism
to change your schema file, specifically the REST API. The difference
is that _you_ need to issue the REST API commands in "managed schema"
yourself (or script or whatever). "schemaless" mode issues those REST
API commands for you whenever the update processor sees a field it
doesn't recognize, after guessing what kind of field it is.

Classic, of course, requires you to hand-edit a text file and upload
it to SolrCloud and reload collections for changes to take effect.

I tend to prefer classic when I know up-front exactly what my schema
should be. In fact, I tend to strip everything out of the schema.xml
file I know I don't need including dynamic field definitions,
copyfields and the like. Like Shawn, I want my docs to fail if they
don't conform to my schema ASAP.

Managed is ideal for situations where you have some UI front-end that
allows end users (or administrators) to define a schema and don't want
them to muck around with hand-editing files.

Schemaless is very cool, but IMO not something I'd go to production
with, especially at scale. It's way cool for starting out, but as the
scale grows you want to squeeze out all the unessential bits of the
index you can, and schemaless doesn't have the "meta-knowledge" you
have (or at least should have) about the problem space.


bq: Another thing to keep in mind is, I am pushing documents to solr
from some random/unknown source and they are not getting stored on
separate disc

This is pretty scary. How are you controlling what fields get indexed?
You mentioned SolrJ, so I'm presuming you have a mechanism to map all
the information (meta-data included) you get from those random/unknown
sources into your known schema?

FWIW,
Erick

On Wed, Jan 20, 2016 at 10:03 AM, Shawn Heisey  wrote:
> On 1/20/2016 10:17 AM, Prateek Jain J wrote:
>>
>> What all I could gather from various blogs is, defining schema stops
>> developers from accidently adding fields to solr. But in my case, I am
>> planning to use solrj (so, no spelling mistakes). My point is:
>>
>>
>> 1.   Is there any advantage like performance or anything else while
>> reading or writing or querying, if we go schema way?
>>
>> 2.   What impact it can have on maintainability of project?
>>
>> Another thing to keep in mind is, I am pushing documents to solr from some
>> random/unknown source and they are not getting stored on separate disc
>> (using solr for indexing and storing). By this what I mean is, re-indexing
>> is not an option for me.  Starting schemaless might give me a quick start
>> for project but, is there a fine print that is getting missed? Any
>> inputs/experiences/pointers are welcome.
>
>
> There is no performance difference.  With a managed schema, there is still a
> schema file in the config, it just has a different filename and can be
> changed remotely.  Internally, I am pretty sure that the java objects are
> identical.
>
> I personally would not want to have a managed schema or run in schemaless
> mode in production.  I do not want it to be possible for anybody else to
> change the config.
>
> Thanks,
> Shawn
>


collection aliasing,solrctl

2016-01-20 Thread vidya
Hi 
 
I am using solr with cloudera distribution to index data from hdfs and I am
using "solrctl" utility for my deployment. Now i wanted to create collection
alias. How can i perform the action of creating collection aliasing by
commands.

>From google i got : " /admin/collections?action=CREATE "  . How to achieve
this by solrctl or by anything by commands from terminal.

Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/collection-aliasing-solrctl-tp4252210.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr trying to auto-update schema.xml

2016-01-20 Thread Bob Lawson
Thanks, I was using an invalid field type.  All is good now.  Thanks Hoss
and Eric.  You guys are the best!

On Tue, Jan 19, 2016 at 6:47 PM, Chris Hostetter 
wrote:

>
> : Thanks, very helpful.  I think I'm on the right track now, but when I do
> a
> : post now and my UpdateRequestProcessor extension tries to add a field to
> a
> : document, I get:
> :
> : RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1]
> : Error adding field 'myField'='2234543'
> :
> : The call I'm making is SolrInputDocument.addField(field name, value).  Is
> : that trying to add a field to the schema.xml?  The field (myField) is
> : already defined in schema.xml.  By calling SolrInputDocument.addField(),
> my
> : goal is to add the field to the document and give it a value.
>
> what is the full stack trace of that error in your logs?
>
> it's not indicating that it's trying to add a *schema* field named
> "myField", it's saying that it's trying to add a *document* field with the
> name 'myField' and the value '2234543' and some soert of problem is
> occurring -- it may be because the schema doesn't have that field, or
> because the FieldType of myField complained that the value wasn't valid
> for tha type, etc...
>
> the stack trace has the answers.
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Returning all documents in a collection

2016-01-20 Thread Joel Bernstein
The limitations of the /export handler should already be documented.

Lot's of documentation still todo for Solr 6 around Streaming Expressions
and some left todo on SQL. The SQL interface in Solr 6 can also select and
sort entire result sets as it's built on top of the Streaming API.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Jan 20, 2016 at 10:37 AM, Jack Krupansky 
wrote:

> It would be nice to have an explicit section in the doc on the topic of
> "Dealing with Large Result Sets" to point people to the various approaches
> (paging, caching, export, streaming expressions, and how to select the best
> one for a given use case.)
>
> (And Joel is going to promise to update the doc for this stored field
> restriction, right?!)
>
> -- Jack Krupansky
>
> On Wed, Jan 20, 2016 at 9:38 AM, Joel Bernstein 
> wrote:
>
> > CloudSolrStream is available in Solr 5. The "search" streaming expression
> > can used or CloudSolrStream can be used in directly.
> >
> > https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> >
> > The export handler does not export stored fields though. It only exports
> > fields using DocValues caches. So you may need to re-index your data to
> use
> > this feature.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Wed, Jan 20, 2016 at 9:29 AM, Salman Ansari 
> > wrote:
> >
> > > Thanks Emir, Susheel and Jack for your responses. Just to update, I am
> > > using Solr Cloud plus I want to get the data completely without
> > pagination
> > > or cursor (I mean in one shot). Is there a way to do this in Solr?
> > >
> > > Regards,
> > > Salman
> > >
> > > On Wed, Jan 20, 2016 at 4:49 PM, Jack Krupansky <
> > jack.krupan...@gmail.com>
> > > wrote:
> > >
> > > > Yes, Exporting Results Sets is the preferred and recommended
> technique
> > > for
> > > > returning all documents in a collection, or even simply for queries
> > that
> > > > select a large number of documents, all of which are to be returned.
> It
> > > > uses efficient streaming rather than paging.
> > > >
> > > > But... this great feature currently does not have support for
> > > > distributed/SolrCloud mode:
> > > > "The initial release treats all queries as non-distributed requests.
> So
> > > the
> > > > client is responsible for making the calls to each Solr instance and
> > > > merging the results.
> > > > Using SolrJ’s CloudSolrClient as a model, developers could build
> > clients
> > > > that automatically send requests to all the shards in a collection
> (or
> > > > multiple collections) and then merge the sorted sets any way they
> > wish."
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Wed, Jan 20, 2016 at 8:41 AM, Susheel Kumar <
> susheel2...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello Salman,
> > > > >
> > > > > Please checkout the export functionality
> > > > >
> > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> > > > >
> > > > > Thanks,
> > > > > Susheel
> > > > >
> > > > > On Wed, Jan 20, 2016 at 6:57 AM, Emir Arnautovic <
> > > > > emir.arnauto...@sematext.com> wrote:
> > > > >
> > > > > > Hi Salman,
> > > > > > You should use cursors in order to avoid "deep paging issues".
> > Take a
> > > > > look
> > > > > > at
> > > > >
> > https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
> > > .
> > > > > >
> > > > > > Regards,
> > > > > > Emir
> > > > > >
> > > > > > --
> > > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> > > Management
> > > > > > Solr & Elasticsearch Support * http://sematext.com/
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 20.01.2016 12:55, Salman Ansari wrote:
> > > > > >
> > > > > >> Hi,
> > > > > >>
> > > > > >> I am looking for a way to return all documents from a
> collection.
> > > > > >> Currently, I am restricted to specifying the number of rows
> using
> > > > > Solr.NET
> > > > > >> but I am looking for a better approach to actually return all
> > > > documents.
> > > > > >> If
> > > > > >> I specify a huge number such as 1M, the processing takes a long
> > > time.
> > > > > >>
> > > > > >> Any feedback/comment will be appreciated.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Salman
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Solr UninvertingReader getNumericDocValues doesn't seem to work for fields that are not stored or indexed

2016-01-20 Thread plbarrios
Joel,

Thank you for the reply!

This approach solved my problem.

Now should I be concerned about the 32 bits that are lost in converting the
long to an int? Also, is this the intended approach when using
NumericDocValues?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-UninvertingReader-getNumericDocValues-doesn-t-seem-to-work-for-fields-that-are-not-stored-or-ind-tp4251881p4252035.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr score threashold

2016-01-20 Thread Walter Underwood
The ScoresAsPercentages page is not really instructions for how to normalize 
scores. It is an explanation of why a score threshold does not do what you want.

Don’t use thresholds. If you want thresholds, you will need a search engine 
with a probabilistic model, like Verity K2. Those generally give worse results 
than a vector space model, but you can have thresholds.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 20, 2016, at 5:11 AM, Emir Arnautovic  
> wrote:
> 
> Hi Sara,
> You can use funct and frange to achive needed, but note that scores are not 
> normalized meaning score 8 does not mean it is good match - it is just best 
> match. There are examples online how to normalize score (e.g. 
> http://wiki.apache.org/lucene-java/ScoresAsPercentages).
> Other approach is to write custom component that will filter out docs below 
> some threshold.
> 
> Thanks,
> Emir
> 
> On 20.01.2016 13:58, sara hajili wrote:
>> hi all,
>> i wanna to know about solr search relevency scoreing threashold.
>> can i change it?
>> i mean immagine when i searching i get this result
>> doc1 score =8
>> doc2 score =6.4
>> doc3 score=6
>> doc8score=5.5
>> doc5 score=2
>> i wana to change solr score threashold .in this way i set threashold for
>> example >4
>> and then i didn't get doc5 as result.can i do this?if yes how?
>> and if not how i can modified search to don't get docs as a result that
>> these docs have a lot distance from doc with max score?
>> in other word i wanna to delete this gap between solr results
>> 
> 
> -- 
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> 



FileBased Spellcheck on Solr cloud

2016-01-20 Thread Riyaz
Hi,

*Environment: *
Solr-4.10.4
tomcat6
Solr Cloud - 6 shards and 6 replicas with external zookeeper ensemble

We are configuring Filebased spellcheck component on Solr Cloud. The source
file for dictionary generation having 5 million text entries. Since the solr
configurations(including spellings_xxx.txt) are distributed by zookeeper to
all the nodes in the cloud. Is it built the spellcheck dictionary on all the
leaders/replicas using same source file?. Can you please share is there any
better way to configure the same?.

The source file(spellings_xxx.txt) will be changed frequently and have to
build the spellcheck dictionary accordingly.

Thanks
Riyaz



--
View this message in context: 
http://lucene.472066.n3.nabble.com/FileBased-Spellcheck-on-Solr-cloud-tp4252034.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: FileBased Spellcheck on Solr cloud

2016-01-20 Thread Binoy Dalal
One thing you could do is index your entire spell check file into lucene as
string values. That way your index will be available across the cloud and
you can build your dictionary from the indexed field. This will however
mean that everytime you change the spellcheck file, you will need to do
reindexing. If your number of updates to this dictionary file are small,
you can simply write a routine that will use atomic updates to update your
index directly and save you the trouble of doing a full reindex.

On Wed, Jan 20, 2016 at 8:47 PM Riyaz 
wrote:

> Hi,
>
> *Environment: *
> Solr-4.10.4
> tomcat6
> Solr Cloud - 6 shards and 6 replicas with external zookeeper ensemble
>
> We are configuring Filebased spellcheck component on Solr Cloud. The source
> file for dictionary generation having 5 million text entries. Since the
> solr
> configurations(including spellings_xxx.txt) are distributed by zookeeper to
> all the nodes in the cloud. Is it built the spellcheck dictionary on all
> the
> leaders/replicas using same source file?. Can you please share is there any
> better way to configure the same?.
>
> The source file(spellings_xxx.txt) will be changed frequently and have to
> build the spellcheck dictionary accordingly.
>
> Thanks
> Riyaz
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/FileBased-Spellcheck-on-Solr-cloud-tp4252034.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
-- 
Regards,
Binoy Dalal


Re: Solr UninvertingReader getNumericDocValues doesn't seem to work for fields that are not stored or indexed

2016-01-20 Thread Yonik Seeley
On Wed, Jan 20, 2016 at 10:19 AM, plbarrios  wrote:
> Joel,
>
> Thank you for the reply!
>
> This approach solved my problem.
>
> Now should I be concerned about the 32 bits that are lost in converting the
> long to an int? Also, is this the intended approach when using
> NumericDocValues?

If the value was was originally a float, then no bits are lost, there
were only 32 to begin with.

-Yonik


schemaless vs schema based core

2016-01-20 Thread Prateek Jain J

Hi,

I have just started to play around with solr capabilities. Got a basic doubt 
(couldn't get clear answer by searching over internet), I am working on an 
application which has very basic query requirement like searching on uniqueID 
or date range(not any faceting or NLP) and I have all the information required 
to define schema file for this solr core. Now my question, is there any 
advantage of using schema based solr against schemaless?

What all I could gather from various blogs is, defining schema stops developers 
from accidently adding fields to solr. But in my case, I am planning to use 
solrj (so, no spelling mistakes). My point is:


1.   Is there any advantage like performance or anything else while reading 
or writing or querying, if we go schema way?

2.   What impact it can have on maintainability of project?

Another thing to keep in mind is, I am pushing documents to solr from some 
random/unknown source and they are not getting stored on separate disc (using 
solr for indexing and storing). By this what I mean is, re-indexing is not an 
option for me.  Starting schemaless might give me a quick start for project 
but, is there a fine print that is getting missed? Any 
inputs/experiences/pointers are welcome.

Regards,
Prateek




Re: solr score threashold

2016-01-20 Thread Doug Turnbull
What problem are you trying to solve?

If you're trying to cut out "bad" results, I might suggest explicitly using
filters that eliminate undesirable search items in terms that are
meaningful to how your users evaluate relevance.

For example, let's say your users only want items that have at least one
match in the title. One natural way to do this is to create a filter query
like *fq={!edismax qf=title mm=1 v=$q} *(where q is the user's plaintext
query). That's just an example, maybe you'd like to have some other
criteria for cutting out poor results? Use a filter query and express what
you need to trim out to Solr :)

-Doug




On Wed, Jan 20, 2016 at 7:58 AM, sara hajili  wrote:

> hi all,
> i wanna to know about solr search relevency scoreing threashold.
> can i change it?
> i mean immagine when i searching i get this result
> doc1 score =8
> doc2 score =6.4
> doc3 score=6
> doc8score=5.5
> doc5 score=2
> i wana to change solr score threashold .in this way i set threashold for
> example >4
> and then i didn't get doc5 as result.can i do this?if yes how?
> and if not how i can modified search to don't get docs as a result that
> these docs have a lot distance from doc with max score?
> in other word i wanna to delete this gap between solr results
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Solrcloud getting warning "missed update"

2016-01-20 Thread Mugeesh Husain
Hello,

I am sharing warning image, please find/check this
 


could anyone have an idea of above warning



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrcloud-getting-warning-missed-update-tp4251556p4252110.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: FieldCache

2016-01-20 Thread Yonik Seeley
On Thu, Jan 14, 2016 at 2:43 PM, Lewin Joy (TMS)  wrote:
> Thanks for the reply.
> But, the grouping on multivalued is working for me even with multiple data in 
> the multivalued field.
> I also tested this on the tutorial collection from the later solr version 
> 5.3.1 , which works as well.

Older versions of Solr would happily populate a FieldCache entry with
a multi-valued field by overwriting old values with new values while
uninverting.  Thus the FieldCache entry (used for sorting, faceting,
grouping, function queries, etc) would contain just the last/highest
value for any document.
So that sort-of explains how it was working in the past I think...
probably not how you intended.

If it works sometimes, but not other times, it may be due to details
of the request that cause one code path to be executed vs another, and
you hit a path were the check is done vs not.  The check checks the
schema only.

For example, in StrField.java:

  @Override
  public ValueSource getValueSource(SchemaField field, QParser parser) {
field.checkFieldCacheSource(parser);
return new StrFieldSource(field.getName());
  }

There are different implementations of grouping... and only some go
through a ValueSource I believe... and those are the only ones that
would check to see if the field was single valued.  The grouping code
started in Solr, but was refactored and moved to Lucene, and I'm no
longer that familiar with it.

-Yonik


Re: Returning all documents in a collection

2016-01-20 Thread Jack Krupansky
It would be nice to have an explicit section in the doc on the topic of
"Dealing with Large Result Sets" to point people to the various approaches
(paging, caching, export, streaming expressions, and how to select the best
one for a given use case.)

(And Joel is going to promise to update the doc for this stored field
restriction, right?!)

-- Jack Krupansky

On Wed, Jan 20, 2016 at 9:38 AM, Joel Bernstein  wrote:

> CloudSolrStream is available in Solr 5. The "search" streaming expression
> can used or CloudSolrStream can be used in directly.
>
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>
> The export handler does not export stored fields though. It only exports
> fields using DocValues caches. So you may need to re-index your data to use
> this feature.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Jan 20, 2016 at 9:29 AM, Salman Ansari 
> wrote:
>
> > Thanks Emir, Susheel and Jack for your responses. Just to update, I am
> > using Solr Cloud plus I want to get the data completely without
> pagination
> > or cursor (I mean in one shot). Is there a way to do this in Solr?
> >
> > Regards,
> > Salman
> >
> > On Wed, Jan 20, 2016 at 4:49 PM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> > > Yes, Exporting Results Sets is the preferred and recommended technique
> > for
> > > returning all documents in a collection, or even simply for queries
> that
> > > select a large number of documents, all of which are to be returned. It
> > > uses efficient streaming rather than paging.
> > >
> > > But... this great feature currently does not have support for
> > > distributed/SolrCloud mode:
> > > "The initial release treats all queries as non-distributed requests. So
> > the
> > > client is responsible for making the calls to each Solr instance and
> > > merging the results.
> > > Using SolrJ’s CloudSolrClient as a model, developers could build
> clients
> > > that automatically send requests to all the shards in a collection (or
> > > multiple collections) and then merge the sorted sets any way they
> wish."
> > >
> > > -- Jack Krupansky
> > >
> > > On Wed, Jan 20, 2016 at 8:41 AM, Susheel Kumar 
> > > wrote:
> > >
> > > > Hello Salman,
> > > >
> > > > Please checkout the export functionality
> > > >
> https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> > > >
> > > > Thanks,
> > > > Susheel
> > > >
> > > > On Wed, Jan 20, 2016 at 6:57 AM, Emir Arnautovic <
> > > > emir.arnauto...@sematext.com> wrote:
> > > >
> > > > > Hi Salman,
> > > > > You should use cursors in order to avoid "deep paging issues".
> Take a
> > > > look
> > > > > at
> > > >
> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
> > .
> > > > >
> > > > > Regards,
> > > > > Emir
> > > > >
> > > > > --
> > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> > Management
> > > > > Solr & Elasticsearch Support * http://sematext.com/
> > > > >
> > > > >
> > > > >
> > > > > On 20.01.2016 12:55, Salman Ansari wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> I am looking for a way to return all documents from a collection.
> > > > >> Currently, I am restricted to specifying the number of rows using
> > > > Solr.NET
> > > > >> but I am looking for a better approach to actually return all
> > > documents.
> > > > >> If
> > > > >> I specify a huge number such as 1M, the processing takes a long
> > time.
> > > > >>
> > > > >> Any feedback/comment will be appreciated.
> > > > >>
> > > > >> Regards,
> > > > >> Salman
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>


Re: schemaless vs schema based core

2016-01-20 Thread Shawn Heisey

On 1/20/2016 10:17 AM, Prateek Jain J wrote:

What all I could gather from various blogs is, defining schema stops developers 
from accidently adding fields to solr. But in my case, I am planning to use 
solrj (so, no spelling mistakes). My point is:


1.   Is there any advantage like performance or anything else while reading 
or writing or querying, if we go schema way?

2.   What impact it can have on maintainability of project?

Another thing to keep in mind is, I am pushing documents to solr from some 
random/unknown source and they are not getting stored on separate disc (using 
solr for indexing and storing). By this what I mean is, re-indexing is not an 
option for me.  Starting schemaless might give me a quick start for project 
but, is there a fine print that is getting missed? Any 
inputs/experiences/pointers are welcome.


There is no performance difference.  With a managed schema, there is 
still a schema file in the config, it just has a different filename and 
can be changed remotely.  Internally, I am pretty sure that the java 
objects are identical.


I personally would not want to have a managed schema or run in 
schemaless mode in production.  I do not want it to be possible for 
anybody else to change the config.


Thanks,
Shawn



Re: ramBufferSizeMB and maxIndexingThreads

2016-01-20 Thread Emir Arnautovic
Kind of obvious/logical, but seen some people forgetting that it is per 
core - if single node host multiple shards, each will take 100MB.


Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 20.01.2016 07:02, Shalin Shekhar Mangar wrote:

ramBufferSizeMB is independent of the maxIndexingThreads. If you set
it to 100MB then any lucene segment (or part of a segment) exceeding
100MB will be flushed to disk.

On Wed, Jan 20, 2016 at 3:50 AM, Angel Todorov  wrote:

hi guys,

quick question - is the ramBufferSizeMB the maximum value no matter how
maxIndexingThreads I have, or is it multiplied by the number if indexing
threads? So, if I  have ramBufferSizeMB set to 100 MB, and 8 indexing
threads, does this mean the total ram buffer will be 100 MB or 800 MB ?

Thanks
Angel







Re: Position increment in WordDelimiterFilter.

2016-01-20 Thread Alessandro Benedetti
On 19 January 2016 at 05:41, Modassar Ather  wrote:

> Thanks Shawn for your explanation.
>
> Everything else about the analysis looks
> correct to me, and the positions you see are needed for a phrase query
> to work correctly.
>
> Here the "WiFi device" will not be searched as there is a gap in between
> because Fi is at position 2. The document containing WiFi device will be
> seen as a phrase with no word in between hence it should match phrase "WiFi
> device" but it will not whereas "WiFi device"~1 will matched.
>
> ,Let's try to summarise in detail as this is quite confusing :

1) Index : "WiFi device"
tokenized as you described
[
WiFi1
> Wi  1
> WiFi1
> Fi  2
> device  3
]

2) Query time simple whitespace tokenized : "WiFi device"
[
WiFi(0)
device(1)
]

In this case, it will happen what you exactly quoted.
I should take a look to an old message in the mailing list, pretty sure we
faced this very same discussion.
The problem with word expansion is that whatever you do you are going to
get some side effect.

Cheers

> Best,
> Modassar
>
> On Mon, Jan 18, 2016 at 7:57 PM, Shawn Heisey  wrote:
>
> > On 1/18/2016 6:21 AM, Modassar Ather wrote:
> > > Can you please send us tokens you get (and positions) when you analyze
> > > *WiFi device*
> > >
> > > Tokens generated and their respective positions.
> > >
> > > WiFi1
> > > Wi  1
> > > WiFi1
> > > Fi  2
> > > device  3
> >
> > It seems very odd to me that the original value would show up twice with
> > the preserveOriginal parameter set, but I am seeing the same behavior on
> > 4.7 and 5.3.  Because both copies are at the same position, this will
> > not affect search, but will slightly affect relevance if you are not
> > specifying a sort parameter.  Everything else about the analysis looks
> > correct to me, and the positions you see are needed for a phrase query
> > to work correctly.
> >
> > I have seen working configurations where preserveOriginal is set on the
> > index analysis but NOT set on query analysis.  This is how my own schema
> > is configured.  One of the reasons for this configuration is to reduce
> > the number of terms in the query so it is faster than it would be if
> > preserveOriginal were present and generated additional terms.  The
> > preserveOriginal on the index side ensures a match whether mixed case is
> > used or not.
> >
> > Thanks,
> > Shawn
> >
> >
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Returning all documents in a collection

2016-01-20 Thread Salman Ansari
Hi,

I am looking for a way to return all documents from a collection.
Currently, I am restricted to specifying the number of rows using Solr.NET
but I am looking for a better approach to actually return all documents. If
I specify a huge number such as 1M, the processing takes a long time.

Any feedback/comment will be appreciated.

Regards,
Salman


Re: Returning all documents in a collection

2016-01-20 Thread Emir Arnautovic

Hi Salman,
You should use cursors in order to avoid "deep paging issues". Take a 
look at 
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 20.01.2016 12:55, Salman Ansari wrote:

Hi,

I am looking for a way to return all documents from a collection.
Currently, I am restricted to specifying the number of rows using Solr.NET
but I am looking for a better approach to actually return all documents. If
I specify a huge number such as 1M, the processing takes a long time.

Any feedback/comment will be appreciated.

Regards,
Salman





solr score threashold

2016-01-20 Thread sara hajili
hi all,
i wanna to know about solr search relevency scoreing threashold.
can i change it?
i mean immagine when i searching i get this result
doc1 score =8
doc2 score =6.4
doc3 score=6
doc8score=5.5
doc5 score=2
i wana to change solr score threashold .in this way i set threashold for
example >4
and then i didn't get doc5 as result.can i do this?if yes how?
and if not how i can modified search to don't get docs as a result that
these docs have a lot distance from doc with max score?
in other word i wanna to delete this gap between solr results


Re: solr score threashold

2016-01-20 Thread Emir Arnautovic

Hi Sara,
You can use funct and frange to achive needed, but note that scores are 
not normalized meaning score 8 does not mean it is good match - it is 
just best match. There are examples online how to normalize score (e.g. 
http://wiki.apache.org/lucene-java/ScoresAsPercentages).
Other approach is to write custom component that will filter out docs 
below some threshold.


Thanks,
Emir

On 20.01.2016 13:58, sara hajili wrote:

hi all,
i wanna to know about solr search relevency scoreing threashold.
can i change it?
i mean immagine when i searching i get this result
doc1 score =8
doc2 score =6.4
doc3 score=6
doc8score=5.5
doc5 score=2
i wana to change solr score threashold .in this way i set threashold for
example >4
and then i didn't get doc5 as result.can i do this?if yes how?
and if not how i can modified search to don't get docs as a result that
these docs have a lot distance from doc with max score?
in other word i wanna to delete this gap between solr results



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Returning all documents in a collection

2016-01-20 Thread Salman Ansari
Thanks Emir, Susheel and Jack for your responses. Just to update, I am
using Solr Cloud plus I want to get the data completely without pagination
or cursor (I mean in one shot). Is there a way to do this in Solr?

Regards,
Salman

On Wed, Jan 20, 2016 at 4:49 PM, Jack Krupansky 
wrote:

> Yes, Exporting Results Sets is the preferred and recommended technique for
> returning all documents in a collection, or even simply for queries that
> select a large number of documents, all of which are to be returned. It
> uses efficient streaming rather than paging.
>
> But... this great feature currently does not have support for
> distributed/SolrCloud mode:
> "The initial release treats all queries as non-distributed requests. So the
> client is responsible for making the calls to each Solr instance and
> merging the results.
> Using SolrJ’s CloudSolrClient as a model, developers could build clients
> that automatically send requests to all the shards in a collection (or
> multiple collections) and then merge the sorted sets any way they wish."
>
> -- Jack Krupansky
>
> On Wed, Jan 20, 2016 at 8:41 AM, Susheel Kumar 
> wrote:
>
> > Hello Salman,
> >
> > Please checkout the export functionality
> > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> >
> > Thanks,
> > Susheel
> >
> > On Wed, Jan 20, 2016 at 6:57 AM, Emir Arnautovic <
> > emir.arnauto...@sematext.com> wrote:
> >
> > > Hi Salman,
> > > You should use cursors in order to avoid "deep paging issues". Take a
> > look
> > > at
> > https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.
> > >
> > > Regards,
> > > Emir
> > >
> > > --
> > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > > Solr & Elasticsearch Support * http://sematext.com/
> > >
> > >
> > >
> > > On 20.01.2016 12:55, Salman Ansari wrote:
> > >
> > >> Hi,
> > >>
> > >> I am looking for a way to return all documents from a collection.
> > >> Currently, I am restricted to specifying the number of rows using
> > Solr.NET
> > >> but I am looking for a better approach to actually return all
> documents.
> > >> If
> > >> I specify a huge number such as 1M, the processing takes a long time.
> > >>
> > >> Any feedback/comment will be appreciated.
> > >>
> > >> Regards,
> > >> Salman
> > >>
> > >>
> > >
> >
>


Re: Scaling SolrCloud

2016-01-20 Thread Yago Riveiro
Our Zookeeper cluster is an ensemble of 5 machines, is a good starting point,
3 are to risky, you lost one you lost quorum and with 7 sync cost increase.

  

ZK cluster is in machines without IO and rotative hdd (don't not use SDD to
gain IO performance,  zookeeper is optimized to spinning disks).

  

The ZK cluster behaves without problems, the first deploy of ZK was in the
same machines that the Solr Cluster (ZK log in its own hdd) and that didn't
wok very well, CPU and networking IO from Solr Cluster was too much.

  

About schema modifications.  
  
Modify the schema to add new fields is relative simple with new API, in the
pass all the work was manually uploading the schema to ZK and reloading all
collections (indexing must be disable or timeouts and funny errors happen).  
  
With the new Schema API this is more user friendly. Anyway, I stop indexing
and for reload the collections (I don't know if it's necessary nowadays).  
  
About Indexing data.

  

We have self made data importer, it's not java and not performs batch indexing
(with 500 collections buffer data and build the batch is expensive and
complicate for error handling).

  

We use regular HTTP post in json. Our throughput  is about 1000 docs/s without
any type of optimization. Some time we have issues with replication, the slave
can keep pace with leader insertion and a full sync is requested, this is bad
because sync the replica again implicates a lot of IO wait and CPU and with
replicas with 100G take an hour or more (normally when this happen, we disable
indexing to release IO and CPU and not kill the node with a load of 50 or 60).  
  
In this department my advice is "keep it simple" in the end is an HTTP POST to
a node of the cluster.

  

\--

/Yago Riveiro

> On Jan 20 2016, at 1:39 pm, Troy Edwards tedwards415...@gmail.com
wrote:  

>

> Thank you for sharing your experiences/ideas.

>

> Yago since you have 8 billion documents over 500 collections, can you share  
what/how you do index maintenance (e.g. add field)? And how are you loading  
data into the index? Any experiences around how Zookeeper ensemble behaves  
with so many collections?

>

> Best,

>

>  
On Tue, Jan 19, 2016 at 6:05 PM, Yago Riveiro yago.rive...@gmail.com  
wrote:

>

>  What I can say is:  
  
  
 * SDD (crucial for performance if the index doesn't fit in memory, and  
 will not fit)  
 * Divide and conquer, for that volume of docs you will need more than 6  
 nodes.  
 * DocValues to not stress the java HEAP.  
 * Do you will you aggregate data?, if yes, what is your max  
 cardinality?, this question is the most important to size correctly the  
 memory needs.  
 * Latency is important too, which threshold is acceptable before  
 consider a query slow?  
 At my company we are running a 12 terabytes (2 replicas) Solr cluster
with  
 8  
 billion documents sparse over 500 collection . For this we have about 12  
 machines with SDDs and 32G of ram each (~24G for the heap).  
  
 We don't have a strict need of speed, 30 second query to aggregate 100  
 million  
 documents with 1M of unique keys is fast enough for us, normally the  
 aggregation performance decrease as the number of unique keys increase,  
 with  
 low unique key factor, queries take less than 2 seconds if data is in OS  
 cache.  
  
 Personal recommendations:  
  
 * Sharding is important and smart sharding is crucial, you don't want  
 run queries on data that is not interesting (this slow down queries when  
 the dataset is big).  
 * If you want measure speed do it with about 1 billion documents to  
 simulate something real (real for 10 billion document world).  
 * Index with re-indexing in mind. with 10 billion docs, re-index data  
 takes months ... This is important if you don't use regular features of  
 Solr. In my case I configured Docvalues with disk format (not standard  
 feature in 4.x) and at some point this format was deprecated. Upgrade
Solr  
 to 5.x was an epic 3 months battle to do it without full downtime.  
 * Solr is like your girlfriend, will demand love and care and plenty of  
 space to full-recover replicas that in some point are out of sync, happen
a  
 lot restarting nodes (this is annoying with replicas with 100G), don't  
 underestimate this point. Free space can save your life.  
  
 \\--  
  
 /Yago Riveiro  
  
  On Jan 19 2016, at 11:26 pm, Shawn Heisey
lt;apa...@elyograg.orggt;  
 wrote:  
  
   
  
  On 1/19/2016 1:30 PM, Troy Edwards wrote:  
 gt; We are currently "beta testing" a SolrCloud with 2 nodes and 2
shards  
 with  
 gt; 2 replicas each. The number of documents is about 125000.  
 gt;  
 gt; We now want to scale this to about 10 billion documents.  
 gt;  
 gt; What are the steps to prototyping, hardware estimation and
stress  
 testing?  
  
   
  
  There is no general information available for sizing, because there
are  
 too many factors that will affect the answers. Some of the important  
 information that you need will be 

Re: Returning all documents in a collection

2016-01-20 Thread Joel Bernstein
CloudSolrStream is available in Solr 5. The "search" streaming expression
can used or CloudSolrStream can be used in directly.

https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

The export handler does not export stored fields though. It only exports
fields using DocValues caches. So you may need to re-index your data to use
this feature.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Jan 20, 2016 at 9:29 AM, Salman Ansari 
wrote:

> Thanks Emir, Susheel and Jack for your responses. Just to update, I am
> using Solr Cloud plus I want to get the data completely without pagination
> or cursor (I mean in one shot). Is there a way to do this in Solr?
>
> Regards,
> Salman
>
> On Wed, Jan 20, 2016 at 4:49 PM, Jack Krupansky 
> wrote:
>
> > Yes, Exporting Results Sets is the preferred and recommended technique
> for
> > returning all documents in a collection, or even simply for queries that
> > select a large number of documents, all of which are to be returned. It
> > uses efficient streaming rather than paging.
> >
> > But... this great feature currently does not have support for
> > distributed/SolrCloud mode:
> > "The initial release treats all queries as non-distributed requests. So
> the
> > client is responsible for making the calls to each Solr instance and
> > merging the results.
> > Using SolrJ’s CloudSolrClient as a model, developers could build clients
> > that automatically send requests to all the shards in a collection (or
> > multiple collections) and then merge the sorted sets any way they wish."
> >
> > -- Jack Krupansky
> >
> > On Wed, Jan 20, 2016 at 8:41 AM, Susheel Kumar 
> > wrote:
> >
> > > Hello Salman,
> > >
> > > Please checkout the export functionality
> > > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> > >
> > > Thanks,
> > > Susheel
> > >
> > > On Wed, Jan 20, 2016 at 6:57 AM, Emir Arnautovic <
> > > emir.arnauto...@sematext.com> wrote:
> > >
> > > > Hi Salman,
> > > > You should use cursors in order to avoid "deep paging issues". Take a
> > > look
> > > > at
> > > https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
> .
> > > >
> > > > Regards,
> > > > Emir
> > > >
> > > > --
> > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> Management
> > > > Solr & Elasticsearch Support * http://sematext.com/
> > > >
> > > >
> > > >
> > > > On 20.01.2016 12:55, Salman Ansari wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> I am looking for a way to return all documents from a collection.
> > > >> Currently, I am restricted to specifying the number of rows using
> > > Solr.NET
> > > >> but I am looking for a better approach to actually return all
> > documents.
> > > >> If
> > > >> I specify a huge number such as 1M, the processing takes a long
> time.
> > > >>
> > > >> Any feedback/comment will be appreciated.
> > > >>
> > > >> Regards,
> > > >> Salman
> > > >>
> > > >>
> > > >
> > >
> >
>


Re: Scaling SolrCloud

2016-01-20 Thread Troy Edwards
Thank you for sharing your experiences/ideas.

Yago since you have 8 billion documents over 500 collections, can you share
what/how you do index maintenance (e.g. add field)? And how are you loading
data into the index? Any experiences around how Zookeeper ensemble behaves
with so many collections?

Best,


On Tue, Jan 19, 2016 at 6:05 PM, Yago Riveiro 
wrote:

> What I can say is:
>
>
>   * SDD (crucial for performance if the index doesn't fit in memory, and
> will not fit)
>   * Divide and conquer, for that volume of docs you will need more than 6
> nodes.
>   * DocValues to not stress the java HEAP.
>   * Do you will you aggregate data?, if yes, what is your max
> cardinality?, this question is the most important to size correctly the
> memory needs.
>   * Latency is important too, which threshold is acceptable before
> consider a query slow?
> At my company we are running a 12 terabytes (2 replicas) Solr cluster with
> 8
> billion documents sparse over 500 collection . For this we have about 12
> machines with SDDs and 32G of ram each (~24G for the heap).
>
> We don't have a strict need of speed, 30 second query to aggregate 100
> million
> documents with 1M of unique keys is fast enough for us, normally the
> aggregation performance decrease as the number of unique keys increase,
> with
> low unique key factor, queries take less than 2 seconds if data is in OS
> cache.
>
> Personal recommendations:
>
>   * Sharding is important and smart sharding is crucial, you don't want
> run queries on data that is not interesting (this slow down queries when
> the dataset is big).
>   * If you want measure speed do it with about 1 billion documents to
> simulate something real (real for 10 billion document world).
>   * Index with re-indexing in mind. with 10 billion docs, re-index data
> takes months ... This is important if you don't use regular features of
> Solr. In my case I configured Docvalues with disk format (not standard
> feature in 4.x) and at some point this format was deprecated. Upgrade Solr
> to 5.x was an epic 3 months battle to do it without full downtime.
>   * Solr is like your girlfriend, will demand love and care and plenty of
> space to full-recover replicas that in some point are out of sync, happen a
> lot restarting nodes (this is annoying with replicas with 100G), don't
> underestimate this point. Free space can save your life.
>
> \--
>
> /Yago Riveiro
>
> > On Jan 19 2016, at 11:26 pm, Shawn Heisey apa...@elyograg.org
> wrote:
>
> >
>
> > On 1/19/2016 1:30 PM, Troy Edwards wrote:
>  We are currently "beta testing" a SolrCloud with 2 nodes and 2 shards
> with
>  2 replicas each. The number of documents is about 125000.
> 
>  We now want to scale this to about 10 billion documents.
> 
>  What are the steps to prototyping, hardware estimation and stress
> testing?
>
> >
>
> > There is no general information available for sizing, because there are
> too many factors that will affect the answers. Some of the important
> information that you need will be impossible to predict until you
> actually build it and subject it to a real query load.
>
> >
>
> > https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-
> have-a-definitive-answer/
>
> >
>
> > With an index of 10 billion documents, you may not be able to precisely
> predict performance and hardware requirements from a small-scale
> prototype. You'll likely need to build a full-scale system on a small
> testbed, look for bottlenecks, ask for advice, and plan on a larger
> system for production.
>
> >
>
> > The hard limit for documents on a single shard is slightly less than
> Java's Integer.MAX_VALUE -- just over two billion. Because deleted
> documents count against this max, about one billion documents per shard
> is the absolute max that should be loaded in practice.
>
> >
>
> > BUT, if you actually try to put one billion documents in a single
> server, performance will likely be awful. A more reasonable limit per
> machine is 100 million ... but even this is quite large. You might need
> smaller shards, or you might be able to get good performance with larger
> shards. It all depends on things that you may not even know yet.
>
> >
>
> > Memory is always a strong driver for Solr performance, and I am speaking
> specifically of OS disk cache -- memory that has not been allocated by
> any program. With 10 billion documents, your total index size will
> likely be hundreds of gigabytes, and might even reach terabyte scale.
> Good performance with indexes this large will require a lot of total
> memory, which probably means that you will need a lot of servers and
> many shards. SSD storage is strongly recommended.
>
> >
>
> > For extreme scaling on Solr, especially if the query rate will be high,
> it is recommended to only have one shard replica per server.
>
> >
>
> > I have just added an "extreme scaling" section to the following wiki
> page, but it's mostly a placeholder right now. I would like 

Re: Returning all documents in a collection

2016-01-20 Thread Susheel Kumar
Hello Salman,

Please checkout the export functionality
https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets

Thanks,
Susheel

On Wed, Jan 20, 2016 at 6:57 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Salman,
> You should use cursors in order to avoid "deep paging issues". Take a look
> at https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.
>
> Regards,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On 20.01.2016 12:55, Salman Ansari wrote:
>
>> Hi,
>>
>> I am looking for a way to return all documents from a collection.
>> Currently, I am restricted to specifying the number of rows using Solr.NET
>> but I am looking for a better approach to actually return all documents.
>> If
>> I specify a huge number such as 1M, the processing takes a long time.
>>
>> Any feedback/comment will be appreciated.
>>
>> Regards,
>> Salman
>>
>>
>


Re: Returning all documents in a collection

2016-01-20 Thread Jack Krupansky
Yes, Exporting Results Sets is the preferred and recommended technique for
returning all documents in a collection, or even simply for queries that
select a large number of documents, all of which are to be returned. It
uses efficient streaming rather than paging.

But... this great feature currently does not have support for
distributed/SolrCloud mode:
"The initial release treats all queries as non-distributed requests. So the
client is responsible for making the calls to each Solr instance and
merging the results.
Using SolrJ’s CloudSolrClient as a model, developers could build clients
that automatically send requests to all the shards in a collection (or
multiple collections) and then merge the sorted sets any way they wish."

-- Jack Krupansky

On Wed, Jan 20, 2016 at 8:41 AM, Susheel Kumar 
wrote:

> Hello Salman,
>
> Please checkout the export functionality
> https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
>
> Thanks,
> Susheel
>
> On Wed, Jan 20, 2016 at 6:57 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
>
> > Hi Salman,
> > You should use cursors in order to avoid "deep paging issues". Take a
> look
> > at
> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.
> >
> > Regards,
> > Emir
> >
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> >
> > On 20.01.2016 12:55, Salman Ansari wrote:
> >
> >> Hi,
> >>
> >> I am looking for a way to return all documents from a collection.
> >> Currently, I am restricted to specifying the number of rows using
> Solr.NET
> >> but I am looking for a better approach to actually return all documents.
> >> If
> >> I specify a huge number such as 1M, the processing takes a long time.
> >>
> >> Any feedback/comment will be appreciated.
> >>
> >> Regards,
> >> Salman
> >>
> >>
> >
>


Re: Rolling upgrade to 5.4 from 5.0 - "bug" caused by leader changes - is there a workaround?

2016-01-20 Thread Michael Joyner

Unfortunately, it really couldn't wait.

I did a rolling upgrade to the 5.4.1RC2 then downgraded everything to 
5.4.0 and so far everything seems fine.


Couldn't take the cluster down.

On 01/19/2016 05:03 PM, Anshum Gupta wrote:

If you can wait, I'd suggest to be on the bug fix release. It should be out
around the weekend.

On Tue, Jan 19, 2016 at 1:48 PM, Michael Joyner  wrote:


ok,

I just found the 5.4.1 RC2 download, it seems to work ok for a rolling
upgrade.

I will see about downgrading back to 5.4.0 afterwards to be on an official
release ...



On 01/19/2016 04:27 PM, Michael Joyner wrote:


Hello all,

I downloaded 5.4 and started doing a rolling upgrade from a 5.0 solrcloud
cluster and discovered that there seems to be a compatibility issue where
doing a rolling upgrade from pre-5.4 which causes the 5.4 to fail with
unable to determine leader errors.

Is there a work around that does not require taking the cluster down to
upgrade to 5.4? Should I just stay with 5.3 for now? I need to implement
programmatic schema changes in our collection via solrj, and based on what
I'm reading this is a very new feature and requires the latest (or near
latest) solrcloud.

Thanks!

-Mike