Re: Whether SolrCloud can support 2 TB data?

2016-09-24 Thread Erick Erickson
John:

The MapReduceIndexerTool (in contrib) is intended for bulk indexing in
a Hadoop ecosystem. This doesn't preclude home-grown setups of course,
but it's available OOB. The only tricky bit is at the end. Either you
have your Solr indexes on HDFS in which case MRIT can merge them into
a live Solr cluster or you have to copy them from HDFS to your
local-disk indexes (and, of course, get the shards right). It's a
pretty slick utility, it reads from Zookeeper to understand the number
of shards required and does the whole map/reduce thing to distribute
the work.

As an aside, it uses EmbeddedSolrServer to do _exactly_ the same thing
as indexing to a Solr installation, reads the configs from ZK etc.

Then there's SparkSolr, a way to index from M/R jobs directly to live
Solr setups. The throughput there is limited by how many docs/second
you can process on each shard X #shards.

BTW, in a highly optimized-for-updates setup I've seen 1M+ docs/second
achieved. Don't try this at home, it takes quite a bit of
infrastructure

As Yago says,  adding replicas imposes about a penalty, I've typically
seen 20-30% in terms of indexing throughput. You can ameliorate this
by adding more shards, but that adds other complexities.

But I cannot over-emphasize how much "it depends" (tm). I was setting
up a stupid-simple index where all I wanted was a bunch of docs with
exactly one simple field plus the ID. On my laptop I was seeing 50K
docs/second in a single shard.

Then for another test case I was doing an ngrammed (mingram-2,
maxgram-32) and was seeing < 100 docs/second. There's simply no way to
translate from the raw data size to hardware specs, unfortunately.

Best,
Erick

On Sat, Sep 24, 2016 at 10:48 AM, Toke Eskildsen  
wrote:
> Regarding a 12TB index:
>
> Yago Riveiro  wrote:
>
>> Our cluster is small for the data we hold (12 machines with SSD and 32G of
>> RAM), but we don't need sub-second queries, we need facet with high
>> cardinality (in worst case scenarios we aggregate 5M unique string values)
>
>> In a peak of inserts we can handle around 25K docs per second without any
>> issue with 2 replicas and without compromise reads or put a node in stress.
>> Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or a
>> lack of CPU to communicate.
>
> I am surprised that you manage to have this working on that hardware. As you 
> have replicas, it seems to me that you handle 2*12TB of index with 12*32GB of 
> RAM? This is very close to our setup (22TB of index with 320GB of RAM 
> (updated last week from 256GB) per machine), but we benefit hugely from 
> having a static index.
>
> I assume the SSDs are local? How much memory do you use for heap on each 
> machine?
>
> - Toke Eskildsen


Re: Viewing the Cache Stats [SOLR 6.1.0]

2016-09-24 Thread Tomás Fernández Löbbe
That thread is pretty old and probably talking about the old(est) admin UI
(before 4.0). The cache stats can be found selecting the core in the
dropdown and then "Plugin/Stats".


See
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604180

Tomás


On Sat, Sep 24, 2016 at 12:14 PM, slee  wrote:

> I'm trying to view the Cache Stats.
> After reading this thread:  Cache Stats
>   , I can't
> seem to find the Statistic page in the SOLR Admin.
>
> Should I be installing some plug-in or do some configuration?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Viewing-the-Cache-Stats-SOLR-6-1-0-tp4297861.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: slow updates/searches

2016-09-24 Thread Erick Erickson
Hmm..

About <1>: Yep, GC is one of the "more art than science" bits of
Java/Solr. Siiih.

About <2>: that's what autowarming is about. Particularly the
filterCache and queryResultCache. My guess is that you have the
autowarm count on those two caches set to zero. Try setting it to some
modest number like 16 or 32. The whole _point_ of those parameters is
to smooth out these kinds of spikes. Additionally, the newSearcher
event (also in solrconfig.xml) is explicitly intended to allow you to
hard-code queries that fill the internal caches as well as the mmap OS
memory from disk, people include facets, sorts and the like in that
event. It's fired every time a new searcher is opened (i.e. whenever
you commit and open a new searcher)...

FirstSearcher is for restarts. The difference is that newSearcher
presumes Solr has been running for a while and the autowarm counts
have something to work from. OTOH, when you start Solr there's no
history to autowarm so firstSeracher can be quite a bit more complex
than newSearcher. Practically, most people just copy newSearcher into
firstSearcher on the assumption that restarting Solr is pretty
rare.

about <3> MMap stuff will be controlled by the OS I think. I actually
worked with a much more primitive system at one point that would be
dog-slow during off-hours. Someone wrote an equivalent of a cron job
to tickle the app upon occasion to prevent periodic slowness.

for a nauseating set of details about hard and soft commits, see:
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Best,
Erick


On Sat, Sep 24, 2016 at 11:35 AM, Rallavagu  wrote:
>
>
> On 9/22/16 5:59 AM, Shawn Heisey wrote:
>>
>> On 9/22/2016 5:46 AM, Muhammad Zahid Iqbal wrote:
>>>
>>> Did you find any solution to slow searches? As far as I know jetty
>>> container default configuration is bit slow for large production
>>> environment.
>>
>>
>> This might be true for the default configuration that comes with a
>> completely stock jetty downloaded from eclipse.org, but the jetty
>> configuration that *Solr* ships with is adequate for just about any Solr
>> installation.  The Solr configuration may require adjustment as the
>> query load increases, but the jetty configuration usually doesn't.
>>
>> Thanks,
>> Shawn
>>
>
> It turned out to be a "sequence of performance testing sessions" in order to
> locate slowness. Though I am not completely done with it, here are my
> finding so far. We are using NRT configuration (warm up count to 0 for
> caches and NRTCachingDirectoryFactory for index directory)
>
> 1. Essentially, solr searches (particularly with edismax and relevance)
> generate lot of "garbage" that makes GC activity to kick in more often. This
> becomes even more when facets are included. This has huge impact on QTimes
> (I have 12g heap and configured 6g to NewSize).
>
> 2. After a fresh restart (or core reload) when searches are performed, Solr
> would initially "populate" mmap entries and this is adding to total QTimes
> (I have made sure that index files are cached at filesystem layer using
> vmtouch - https://hoytech.com/vmtouch). When run the same test again with
> mmap entries populated from previous tests, it shows improved QTimes
> relative to previous test.
>
> 3. Seems the populated mmap entries are flushed away after certain idle time
> (not sure if it is controlled by Solr or underlying OS). This will make
> subsequent searches to fetch from "disk" (even though the disk items are
> cached by OS).
>
> So, what I am gonna try next is to tune the field(s) for facets to reduce
> the index size if possible. Though I am not sure, if it will have impact but
> would attempt to change the "caches" even though they will be invalidated
> after a softCommit (every 10 minutes in my case).
>
> Any other tips/clues/suggestions are welcome. Thanks.
>


Re: Viewing the Cache Stats [SOLR 6.1.0]

2016-09-24 Thread Erick Erickson
Solr evolves pretty quickly. The link you reference is from 2006,
almost 10 years ago, nothing about that link is relevant at this
point.

Go to http://host:port/solr. Then select a core from the drop-down.
>From there, there should be a plugins/stats choice, then the "cache"
section.

Best,
Erick

On Sat, Sep 24, 2016 at 12:14 PM, slee  wrote:
> I'm trying to view the Cache Stats.
> After reading this thread:  Cache Stats
>   , I can't
> seem to find the Statistic page in the SOLR Admin.
>
> Should I be installing some plug-in or do some configuration?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Viewing-the-Cache-Stats-SOLR-6-1-0-tp4297861.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Viewing the Cache Stats [SOLR 6.1.0]

2016-09-24 Thread slee
I'm trying to view the Cache Stats.
After reading this thread:  Cache Stats
  , I can't
seem to find the Statistic page in the SOLR Admin.

Should I be installing some plug-in or do some configuration?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Viewing-the-Cache-Stats-SOLR-6-1-0-tp4297861.html
Sent from the Solr - User mailing list archive at Nabble.com.


Challenges with new Solrcloud Backup/Restore functionality

2016-09-24 Thread Stephen Weiss
Hi everyone,

We're very excited about SolrCloud's new backup / restore collection APIs, 
which should introduce some major new efficiencies into our indexing workflow.  
Unfortunately, we've run into some snags with it that are preventing us from 
moving into production.  I was hoping someone on the list could help.

1) Data inconsistencies

There seems to be a problem getting all the data consistently.  Sometimes, the 
backup will contain all of the data in the collection, and sometimes, large 
portions of the collection (as much as 40%) will be missing.  We haven't quite 
figured out what might cause this yet, although one thing I've noticed is the 
chances of success are greater when we are only backing up one collection at a 
time.  Unfortunately, for our workflow, it will be difficult to make that work, 
and there still doesn't seem to be a guarantee of success either way.

2) Shards are not distributed

To make matters worse, for some reason, any type of restore operation always 
seems to put all shards of the collection on the same node.  We've tried 
setting maxShardsPerNode to 1 in the restore command, but this has no effect.  
We are seeing the same behavior on both 6.1 and 6.2.1.  No matter what we do, 
all the shards always go to the same node - and it's not even the node that we 
execute the restore request on, but oddly enough, a totally different node, and 
always the same one (the 4th one).  It should be noted that all nodes of our 8 
node cloud are up and totally functional when this happens.

To work around this, we wrote up a quick script to create an empty collection, 
which always distributes itself across the cloud quite well (another indication 
that there's nothing wrong with the nodes themselves), and then we rsync the 
individual shards' data into the empty shards and reload the collection.  This 
works fine, however, because of the data inconsistencies mentioned above, we 
can't really move forward anyway.


Problem #2, we have a reasonable workaround for, but problem #1 we do not.  If 
anyone has any thoughts about either of these problems, I would be very 
grateful.  Thanks!

--
Steve



WGSN is a global foresight business. Our experts provide deep insight and 
analysis of consumer, fashion and design trends. We inspire our clients to plan 
and trade their range with unparalleled confidence and accuracy. Together, we 
Create Tomorrow.

WGSN is part of WGSN Limited, comprising of 
market-leading products including WGSN.com, WGSN Lifestyle 
& Interiors, WGSN 
INstock, WGSN 
StyleTrial and WGSN 
Mindset, our bespoke consultancy 
services.

The information in or attached to this email is confidential and may be legally 
privileged. If you are not the intended recipient of this message, any use, 
disclosure, copying, distribution or any action taken in reliance on it is 
prohibited and may be unlawful. If you have received this message in error, 
please notify the sender immediately by return email and delete this message 
and any copies from your computer and network. WGSN does not warrant that this 
email and any attachments are free from viruses and accepts no liability for 
any loss resulting from infected email transmissions.

WGSN reserves the right to monitor all email through its networks. Any views 
expressed may be those of the originator and not necessarily of WGSN. WGSN is 
powered by Ascential plc, which transforms knowledge 
businesses to deliver exceptional performance.

Please be advised all phone calls may be recorded for training and quality 
purposes and by accepting and/or making calls from and/or to us you acknowledge 
and agree to calls being recorded.

WGSN Limited, Company number 4858491

registered address:

Ascential plc, The Prow, 1 Wilder Walk, London W1B 5AP

WGSN Inc., tax ID 04-3851246, registered office c/o National Registered Agents, 
Inc., 160 Greentree Drive, Suite 101, Dover DE 19904, United States

4C Serviços de Informação Ltda., CNPJ/MF (Taxpayer's Register): 
15.536.968/0001-04, Address: Avenida Cidade Jardim, 377, 7˚ andar CEP 
01453-000, Itaim Bibi, São Paulo

4C Business Information Consulting (Shanghai) Co., Ltd, 富新商务信息咨询(上海)有限公司, 
registered address Unit 4810/4811, 48/F Tower 1, Grand Gateway, 1 Hong Qiao 
Road, Xuhui District, Shanghai


Re: slow updates/searches

2016-09-24 Thread Rallavagu



On 9/22/16 5:59 AM, Shawn Heisey wrote:

On 9/22/2016 5:46 AM, Muhammad Zahid Iqbal wrote:

Did you find any solution to slow searches? As far as I know jetty
container default configuration is bit slow for large production
environment.


This might be true for the default configuration that comes with a
completely stock jetty downloaded from eclipse.org, but the jetty
configuration that *Solr* ships with is adequate for just about any Solr
installation.  The Solr configuration may require adjustment as the
query load increases, but the jetty configuration usually doesn't.

Thanks,
Shawn



It turned out to be a "sequence of performance testing sessions" in 
order to locate slowness. Though I am not completely done with it, here 
are my finding so far. We are using NRT configuration (warm up count to 
0 for caches and NRTCachingDirectoryFactory for index directory)


1. Essentially, solr searches (particularly with edismax and relevance) 
generate lot of "garbage" that makes GC activity to kick in more often. 
This becomes even more when facets are included. This has huge impact on 
QTimes (I have 12g heap and configured 6g to NewSize).


2. After a fresh restart (or core reload) when searches are performed, 
Solr would initially "populate" mmap entries and this is adding to total 
QTimes (I have made sure that index files are cached at filesystem layer 
using vmtouch - https://hoytech.com/vmtouch). When run the same test 
again with mmap entries populated from previous tests, it shows improved 
QTimes relative to previous test.


3. Seems the populated mmap entries are flushed away after certain idle 
time (not sure if it is controlled by Solr or underlying OS). This will 
make subsequent searches to fetch from "disk" (even though the disk 
items are cached by OS).


So, what I am gonna try next is to tune the field(s) for facets to 
reduce the index size if possible. Though I am not sure, if it will have 
impact but would attempt to change the "caches" even though they will be 
invalidated after a softCommit (every 10 minutes in my case).


Any other tips/clues/suggestions are welcome. Thanks.



Re: Whether SolrCloud can support 2 TB data?

2016-09-24 Thread Toke Eskildsen
Regarding a 12TB index:

Yago Riveiro  wrote:

> Our cluster is small for the data we hold (12 machines with SSD and 32G of
> RAM), but we don't need sub-second queries, we need facet with high
> cardinality (in worst case scenarios we aggregate 5M unique string values)

> In a peak of inserts we can handle around 25K docs per second without any
> issue with 2 replicas and without compromise reads or put a node in stress.
> Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or a
> lack of CPU to communicate.

I am surprised that you manage to have this working on that hardware. As you 
have replicas, it seems to me that you handle 2*12TB of index with 12*32GB of 
RAM? This is very close to our setup (22TB of index with 320GB of RAM (updated 
last week from 256GB) per machine), but we benefit hugely from having a static 
index.

I assume the SSDs are local? How much memory do you use for heap on each 
machine?

- Toke Eskildsen


Re: Whether SolrCloud can support 2 TB data?

2016-09-24 Thread Toke Eskildsen
John Bickerstaff  wrote:
> As an aside - I just spoke with somone the other day who is using Hadoop
> for re-index in order to save a lot of time.

If you control which documents goes into which shards, then that is certainly a 
possibility. We have a collection with long re-indexing time (about 20 CPU-core 
years), but are able to build the shards independently of each other, so it 
scales near-perfect with more hardware. The cheat is that our documents are 
never updated, so everything is always new and just appended to the latest 
shard being build. We don't use Hadoop, but the principle is the same.

- Toke Eskildsen


Re: Whether SolrCloud can support 2 TB data?

2016-09-24 Thread John Bickerstaff
As an aside - I just spoke with somone the other day who is using Hadoop
for re-index in order to save a lot of time.  I don't know the details, but
I assume they're using Hadoop to call Lucene code and index documents using
the map-reduce approach...

This was made in their own shop - I don't think the code is available as
open source, but it works for them as a way to really cut down re-indexing
time for extremely large data sets.

On Sat, Sep 24, 2016 at 8:15 AM, Yago Riveiro 
wrote:

>  "LucidWorks achieved 150k docs/second"
>
>
>
> This is only valid is you don't have replication, I don't know your use
> case,
> but a realistic use case normally use some type of redundancy to not lost
> data
> in a hardware failure, at least 2 replicas, more implicates a reduction of
> throughput. Also don't forget that in an realistic use case you should
> handle
> reads too.
>
> Our cluster is small for the data we hold (12 machines with SSD and 32G of
> RAM), but we don't need sub-second queries, we need facet with high
> cardinality (in worst case scenarios we aggregate 5M unique string values)
>
> As Shawn probably told you, sizing your cluster is a try and error path.
> Our
> cluster is optimize to handle a low rate of reads, facet queries and a high
> rate of inserts.
>
> In a peak of inserts we can handle around 25K docs per second without any
> issue with 2 replicas and without compromise reads or put a node in stress.
> Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or
> a
> lack of CPU to communicate.
>
> If you want accuracy data you need to do test.
>
> Keep in mind the most important thing about solr in my opinion, in a
> terabyte
> scale any field type schema change or LuceneCodec change will force you to
> do
> a full reindex. Each time I need to update Solr to a major release it's a
> pain
> in the ass to convert the segments if are not compatible with newer
> version.
> This can take months, will not ensure your data will be equal that a clean
> index (voodoo magic thing can happen, thrust me), and it will drain a huge
> amount of hardware resources to do it without downtime.
>
>
> \--
>
>
>
> /Yago Riveiro
>
> ![](https://link.nylas.com/open/m7fkqw0yim04itb62itnp7r9/local-277ee09e-
> 1aee?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
>
>
> On Sep 24 2016, at 7:48 am, S G  wrote:
>
> > Hey Yago,
>
> >
>
> > 12 T is very impressive.
>
> >
>
> > Can you also share some numbers about the shards, replicas, machine
> count/specs and docs/second for your case?
> I think you would not be having a single index of 12 TB too. So some
> details on that would be really helpful too.
>
> >
>
> > https://lucidworks.com/blog/2014/06/03/introducing-the-
> solr-scale-toolkit/
> is a good post how LucidWorks achieved 150k docs/second.
> If you have any such similar blog, that would be quite useful and popular
> too.
>
> >
>
> > \--SG
>
> >
>
> > On Fri, Sep 23, 2016 at 5:00 PM, Yago Riveiro 
> wrote:
>
> >
>
> > > In my company we have a SolrCloud cluster with 12T.
> >
> > My advices:
> >
> > Be nice with CPU you will needed in some point (very important if you
> have
> > not control over the kind of queries to the cluster, clients are greedy,
> > the want all results at the same time)
> >
> > SSD and memory (as many as you can afford if you will do facets)
> >
> > Full recoveries are a pain, network it's important and should be as fast
> > as possible, never less than 1Gbit.
> >
> > Divide and conquer, but too much can drive you to an expensive overhead,
> > data travels over the network. Find the sweet point (only testing you use
> > case you will know)
> >
> > \--
> >
> > /Yago Riveiro
> >
> > On 23 Sep 2016, 23:44 +0100, Pushkar Raste ,
> > wrote:
> > > Solr is RAM hungry. Make sure that you have enough RAM to have most if
> > the
> > > index of a core in the RAM itself.
> > >
> > > You should also consider using really good SSDs.
> > >
> > > That would be a good start. Like others said, test and verify your
> setup.
> > >
> > > \--Pushkar Raste
> > >
> > > On Sep 23, 2016 4:58 PM, "Jeffery Yuan"  wrote:
> > >
> > > Thanks so much for your prompt reply.
> > >
> > > We are definitely going to use SolrCloud.
> > >
> > > I am just wondering whether SolrCloud can scale even at TB data level
> and
> > > what kind of hardware configuration it should be.
> > >
> > > Thanks.
> > >
> > >
> > >
> > > \--
> > > View this message in context: [http://lucene.472066.n3.](htt
> p://lucene.472
> 066.n3.=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
> > > nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>


Re: Unsubscibe from mailing list

2016-09-24 Thread Customer

LOL, and you are senior engineer ?


On 23/09/16 23:00, Khalid Galal wrote:

Please, I need to unsubscribe from this mailing list. Thanks.





Re: Whether solr can support 2 TB data?

2016-09-24 Thread Toke Eskildsen
Jeffery Yuan  wrote:
>  In our application, every data there is about 800mb raw data, we are going
> to store this data for 5 years, then it's about 1 or 2 TB data.

>  I am wondering whether solr can support this much data?

Yes it can.

Or rather: You could probably construct a scenario where it is not feasible, 
but you would have to be very creative.

>  Usually how much data we store per node, how many nodes we can have in
> solr cloud, what hardware configuration each node should be?

As Shawn states, it is very hard to give advice on hardware (and I applaud him 
from refraining from giving the usual "free RAM == index size"-advice). 
However, we love to guesstimate, but to do that you really need to provide more 
details.


2TB of index that has hundreds of concurrent users, thousands of updates per 
seconds and heavy aggregations (grouping, faceting, streaming...) is a task 
that takes experimentation and beefy hardware.

2TB of index that is rarely updated and accessed by a few people at a time, 
which are okay with multi-second response times, can be handled by a 
desktop-class machine with SSDs.


Tell us about query types, update rates, latency requirements, document types 
and concurrent users. Then we can begin to guess.

- Toke Eskildsen


Re: Whether SolrCloud can support 2 TB data?

2016-09-24 Thread Yago Riveiro
 "LucidWorks achieved 150k docs/second"

  

This is only valid is you don't have replication, I don't know your use case,
but a realistic use case normally use some type of redundancy to not lost data
in a hardware failure, at least 2 replicas, more implicates a reduction of
throughput. Also don't forget that in an realistic use case you should handle
reads too.  
  
Our cluster is small for the data we hold (12 machines with SSD and 32G of
RAM), but we don't need sub-second queries, we need facet with high
cardinality (in worst case scenarios we aggregate 5M unique string values)  
  
As Shawn probably told you, sizing your cluster is a try and error path. Our
cluster is optimize to handle a low rate of reads, facet queries and a high
rate of inserts.  
  
In a peak of inserts we can handle around 25K docs per second without any
issue with 2 replicas and without compromise reads or put a node in stress.
Nodes in stress can eject him selfs from the Zookepeer cluster due a GC or a
lack of CPU to communicate.  
  
If you want accuracy data you need to do test.  
  
Keep in mind the most important thing about solr in my opinion, in a terabyte
scale any field type schema change or LuceneCodec change will force you to do
a full reindex. Each time I need to update Solr to a major release it's a pain
in the ass to convert the segments if are not compatible with newer version.
This can take months, will not ensure your data will be equal that a clean
index (voodoo magic thing can happen, thrust me), and it will drain a huge
amount of hardware resources to do it without downtime.

  
\--

  

/Yago Riveiro

![](https://link.nylas.com/open/m7fkqw0yim04itb62itnp7r9/local-277ee09e-
1aee?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)

  
On Sep 24 2016, at 7:48 am, S G  wrote:  

> Hey Yago,

>

> 12 T is very impressive.

>

> Can you also share some numbers about the shards, replicas, machine  
count/specs and docs/second for your case?  
I think you would not be having a single index of 12 TB too. So some  
details on that would be really helpful too.

>

> https://lucidworks.com/blog/2014/06/03/introducing-the-solr-scale-toolkit/  
is a good post how LucidWorks achieved 150k docs/second.  
If you have any such similar blog, that would be quite useful and popular  
too.

>

> \--SG

>

> On Fri, Sep 23, 2016 at 5:00 PM, Yago Riveiro   
wrote:

>

> > In my company we have a SolrCloud cluster with 12T.  
>  
> My advices:  
>  
> Be nice with CPU you will needed in some point (very important if you have  
> not control over the kind of queries to the cluster, clients are greedy,  
> the want all results at the same time)  
>  
> SSD and memory (as many as you can afford if you will do facets)  
>  
> Full recoveries are a pain, network it's important and should be as fast  
> as possible, never less than 1Gbit.  
>  
> Divide and conquer, but too much can drive you to an expensive overhead,  
> data travels over the network. Find the sweet point (only testing you use  
> case you will know)  
>  
> \--  
>  
> /Yago Riveiro  
>  
> On 23 Sep 2016, 23:44 +0100, Pushkar Raste ,  
> wrote:  
> > Solr is RAM hungry. Make sure that you have enough RAM to have most if  
> the  
> > index of a core in the RAM itself.  
> >  
> > You should also consider using really good SSDs.  
> >  
> > That would be a good start. Like others said, test and verify your setup.  
> >  
> > \--Pushkar Raste  
> >  
> > On Sep 23, 2016 4:58 PM, "Jeffery Yuan"  wrote:  
> >  
> > Thanks so much for your prompt reply.  
> >  
> > We are definitely going to use SolrCloud.  
> >  
> > I am just wondering whether SolrCloud can scale even at TB data level and  
> > what kind of hardware configuration it should be.  
> >  
> > Thanks.  
> >  
> >  
> >  
> > \--  
> > View this message in context: [http://lucene.472066.n3.](http://lucene.472
066.n3.=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)  
> > nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html  
> > Sent from the Solr - User mailing list archive at Nabble.com.  
>



Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

2016-09-24 Thread Alexandre Rafalovitch
Yes, swap will switch which core the name points to. For non Cloud setup.

Just remember that your directory name does not get renamed, when you are
deleting the old one. Just the core name in core.properties file.

Regards,
   Alex

On 24 Sep 2016 10:28 AM, "slee"  wrote:

Erick / Alex,

I want to thank you both. Your hints got me understand SOLR a bit better. I
ended up with reversewildcard, and it speeds up performance a lot. That's
what I'm expecting from SOLR...  I also no longer experience the huge memory
hog.

The only down-side I can think of is, you need to re-index when you change
the schema. But I can live with that, since I'll have 2 machines where one
is for reading, the other one is for indexing... I'll swap when the indexing
is done.. I presume that's what the swap from the Admin UI is for right?



--
View this message in context: http://lucene.472066.n3.
nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-
tp4297255p4297821.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Whether SolrCloud can support 2 TB data?

2016-09-24 Thread S G
Hey Yago,

12 T is very impressive.

Can you also share some numbers about the shards, replicas, machine
count/specs and docs/second for your case?
I think you would not be having a single index of 12 TB too. So some
details on that would be really helpful too.

https://lucidworks.com/blog/2014/06/03/introducing-the-solr-scale-toolkit/
is a good post how LucidWorks achieved 150k docs/second.
If you have any such similar blog, that would be quite useful and popular
too.

--SG

On Fri, Sep 23, 2016 at 5:00 PM, Yago Riveiro 
wrote:

> In my company we have a SolrCloud cluster with 12T.
>
> My advices:
>
> Be nice with CPU you will needed in some point (very important if you have
> not control over the kind of queries to the cluster, clients are greedy,
> the want all results at the same time)
>
> SSD and memory (as many as you can afford if you will do facets)
>
> Full recoveries are a pain, network it's important and should be as fast
> as possible, never less than 1Gbit.
>
> Divide and conquer, but too much can drive you to an expensive overhead,
> data travels over the network. Find the sweet point (only testing you use
> case you will know)
>
> --
>
> /Yago Riveiro
>
> On 23 Sep 2016, 23:44 +0100, Pushkar Raste ,
> wrote:
> > Solr is RAM hungry. Make sure that you have enough RAM to have most if
> the
> > index of a core in the RAM itself.
> >
> > You should also consider using really good SSDs.
> >
> > That would be a good start. Like others said, test and verify your setup.
> >
> > --Pushkar Raste
> >
> > On Sep 23, 2016 4:58 PM, "Jeffery Yuan"  wrote:
> >
> > Thanks so much for your prompt reply.
> >
> > We are definitely going to use SolrCloud.
> >
> > I am just wondering whether SolrCloud can scale even at TB data level and
> > what kind of hardware configuration it should be.
> >
> > Thanks.
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Whether-solr-can-support-2-TB-data-tp4297790p4297800.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>