date:20150802

Collection APIs to create collection and custom cores naming

2015-08-02 Thread davidphilip cherian

How to use the 'property.name=value' in the api example[1] to modify
core.properties value of 'name'

While creating the collection with below query[2], the core names become
'aggregator_shard1_replica1' and 'aggregator_shard2_replica1'. I wanted
have specific/custom name for each of these cores. I tried passing the
params as property.name=name&name=aggregator_s1, but it did not work.

Editing the core.properties key value pair of name=aggregator_s1 after
collection is created, works! But I was looking for setting this property
with create request itself.

[2]
http://example.com:8983/solr/admin/collections?action=CREATE&name=aggregator&numShards=1&replicationFactor=2&maxShardsPerNode=1&collection.configName=aggregator_config&property.name=name&name=aggregator_s1

[1]
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1

How to use BitDocSet within a PostFilter

2015-08-02 Thread Stephen Weiss

Hi everyone,

I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl 
through grandchild documents during a search through the parents and filter out 
documents based on statistics gathered from aggregating the grandchildren 
together.  I've been successful in getting the logic correct, but it does not 
perform so well - I'm grabbing too many documents from the index along the way. 
 I'm trying to filter out grandchild documents which are not relevant to the 
statistics I'm collecting, in order to reduce the number of document objects 
pulled from the IndexReader.

I've implemented the following code in my DelegatingCollector.collect:

if (inStockSkusBitSet == null) {
SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from 
IndexSearcher to expose getDocSet.
inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery);
inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from DocSet 
to expose getBits.
inStockSkusBitSet = inStockSkusBitDocSet.getBits();
}


My BitDocSet reports a size which matches a standard query for the more limited 
set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also reports this 
same cardinality.  Based on that fact, it seems that the getDocSet call itself 
must be working properly, and returning the right number of documents.  
However, when I try to filter out grandchild documents using either 
BitDocSet.exists or BitSet.get (passing over any grandchild document which 
doesn't exist in the bitdocset or return true from the bitset), I get about 1/3 
less results than I'm supposed to.   It seems many documents that should match 
the filter, are being excluded, and documents which should not match the 
filter, are being included.

I'm trying to use it either of these ways:

if (!inStockSkusBitSet.get(currentChildDocNumber)) continue;
if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue;

The currentChildDocNumber is simply the docNumber which is passed to 
DelegatingCollector.collect, decremented until I hit a document that doesn't 
belong to the parent document.

I can't seem to figure out a way to actually use the BitDocSet (or its 
derivatives) to quickly eliminate document IDs.  It seems like this is how it's 
supposed to be used.  What am I getting wrong?

Sorry if this is a newbie question, I've never written a PostFilter before, and 
frankly, the documentation out there is a little sketchy (mostly for version 4) 
- so many classes have changed names and so many of the more well-documented 
techniques are deprecated or removed now, it's tough to follow what the current 
best practice actually is.  I'm using the block join functionality heavily so 
I'm trying to keep more current than that.  I would be happy to send along the 
full source privately if it would help figure this out, and plan to write up 
some more elaborate instructions (updated for Solr 5) for the next person who 
decides to write a PostFilter and work with block joins, if I ever manage to 
get this performing well enough.

Thanks for any pointers!  Totally open to doing this an entirely different way. 
 I read DocValues might be a more elegant approach but currently that would 
require reindexing, so trying to avoid that.

Also, I've been wondering if the query above would read from the filter cache 
or not.  The query is constructed like this:


private Term inStockTrueTerm = new Term("sku_history.is_in_stock", "T");
private Term objectTypeSkuHistoryTerm = new Term("object_type", 
"sku_history");
...

inStockTrueTermQuery = new TermQuery(inStockTrueTerm);
objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm);
inStockSkusQuery = new BooleanQuery();
inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST);
inStockSkusQuery.add(objectTypeSkuHistoryTermQuery, BooleanClause.Occur.MUST);
--
Steve



WGSN is a global foresight business. Our experts provide deep insight and 
analysis of consumer, fashion and design trends. We inspire our clients to plan 
and trade their range with unparalleled confidence and accuracy. Together, we 
Create Tomorrow.

WGSN is part of WGSN Limited, comprising of 
market-leading products including WGSN.com, WGSN Lifestyle 
& Interiors, WGSN 
INstock, WGSN 
StyleTrial and WGSN 
Mindset, our bespoke consultancy 
services.

The information in or attached to this email is confidential and may be legally 
privileged. If you are not the intended recipient of this message, any use, 
disclosure, copying, distribution or any action taken in reliance on it is 
prohibited and may be unlawful. If you have received this message in error, 
please notify the sender immediately by return email and delete this message 
and any copies from your computer and network. WGSN does no

Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Jay Potharaju

Shawn,
Thanks for the feedback. I agree that increasing timeout might alleviate
the timeout issue. The main problem with increasing timeout is the
detrimental effect it will have on the user experience, therefore can't
increase it.
I have looked at the queries that threw errors, next time I try it
everything seems to work fine. Not sure how to reproduce the error.
My concern with increasing the memory to 32GB is what happens when the
index size grows over the next few months.
One of the other solutions I have been thinking about is to rebuild
index(weekly) and create a new collection and use it. Are there any good
references for doing that?
Thanks
Jay

On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey  wrote:

> On 8/2/2015 8:29 AM, Jay Potharaju wrote:
> > The document contains around 30 fields and have stored set to true for
> > almost 15 of them. And these stored fields are queried and updated all
> the
> > time. You will notice that the deleted documents is almost 30% of the
> > docs.  And it has stayed around that percent and has not come down.
> > I did try optimize but that was disruptive as it caused search errors.
> > I have been playing with merge factor to see if that helps with deleted
> > documents or not. It is currently set to 5.
> >
> > The server has 24 GB of memory out of which memory consumption is around
> 23
> > GB normally and the jvm is set to 6 GB. And have noticed that the
> available
> > memory on the server goes to 100 MB at times during a day.
> > All the updates are run through DIH.
>
> Using all availble memory is completely normal operation for ANY
> operating system.  If you hold up Windows as an example of one that
> doesn't ... it lies to you about "available" memory.  All modern
> operating systems will utilize memory that is not explicitly allocated
> for the OS disk cache.
>
> The disk cache will instantly give up any of the memory it is using for
> programs that request it.  Linux doesn't try to hide the disk cache from
> you, but older versions of Windows do.  In the newer versions of Windows
> that have the Resource Monitor, you can go there to see the actual
> memory usage including the cache.
>
> > Every day at least once i see the following error, which result in search
> > errors on the front end of the site.
> >
> > ERROR org.apache.solr.servlet.SolrDispatchFilter -
> > null:org.eclipse.jetty.io.EofException
> >
> > From what I have read these are mainly due to timeout and my timeout is
> set
> > to 30 seconds and cant set it to a higher number. I was thinking maybe
> due
> > to high memory usage, sometimes it leads to bad performance/errors.
>
> Although this error can be caused by timeouts, it has a specific
> meaning.  It means that the client disconnected before Solr responded to
> the request, so when Solr tried to respond (through jetty), it found a
> closed TCP connection.
>
> Client timeouts need to either be completely removed, or set to a value
> much longer than any request will take.  Five minutes is a good starting
> value.
>
> If all your client timeout is set to 30 seconds and you are seeing
> EofExceptions, that means that your requests are taking longer than 30
> seconds, and you likely have some performance issues.  It's also
> possible that some of your client timeouts are set a lot shorter than 30
> seconds.
>
> > My objective is to stop the errors, adding more memory to the server is
> not
> > a good scaling strategy. That is why i was thinking maybe there is a
> issue
> > with the way things are set up and need to be revisited.
>
> You're right that adding more memory to the servers is not a good
> scaling strategy for the general case ... but in this situation, I think
> it might be prudent.  For your index and heap sizes, I would want the
> company to pay for at least 32GB of RAM.
>
> Having said that ... I've seen Solr installs work well with a LOT less
> memory than the ideal.  I don't know that adding more memory is
> necessary, unless your system (CPU, storage, and memory speeds) is
> particularly slow.  Based on your document count and index size, your
> documents are quite small, so I think your memory size is probably good
> -- if the CPU, memory bus, and storage are very fast.  If one or more of
> those subsystems aren't fast, then make up the difference with lots of
> memory.
>
> Some light reading, where you will learn why I think 32GB is an ideal
> memory size for your system:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> It is possible that your 6GB heap is not quite big enough for good
> performance, or that your GC is not well-tuned.  These topics are also
> discussed on that wiki page.  If you increase your heap size, then the
> likelihood of needing more memory in the system becomes greater, because
> there will be less memory available for the disk cache.
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju

Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Shawn Heisey

On 8/2/2015 8:29 AM, Jay Potharaju wrote:
> The document contains around 30 fields and have stored set to true for
> almost 15 of them. And these stored fields are queried and updated all the
> time. You will notice that the deleted documents is almost 30% of the
> docs.  And it has stayed around that percent and has not come down.
> I did try optimize but that was disruptive as it caused search errors.
> I have been playing with merge factor to see if that helps with deleted
> documents or not. It is currently set to 5.
> 
> The server has 24 GB of memory out of which memory consumption is around 23
> GB normally and the jvm is set to 6 GB. And have noticed that the available
> memory on the server goes to 100 MB at times during a day.
> All the updates are run through DIH.

Using all availble memory is completely normal operation for ANY
operating system.  If you hold up Windows as an example of one that
doesn't ... it lies to you about "available" memory.  All modern
operating systems will utilize memory that is not explicitly allocated
for the OS disk cache.

The disk cache will instantly give up any of the memory it is using for
programs that request it.  Linux doesn't try to hide the disk cache from
you, but older versions of Windows do.  In the newer versions of Windows
that have the Resource Monitor, you can go there to see the actual
memory usage including the cache.

> Every day at least once i see the following error, which result in search
> errors on the front end of the site.
> 
> ERROR org.apache.solr.servlet.SolrDispatchFilter -
> null:org.eclipse.jetty.io.EofException
> 
> From what I have read these are mainly due to timeout and my timeout is set
> to 30 seconds and cant set it to a higher number. I was thinking maybe due
> to high memory usage, sometimes it leads to bad performance/errors.

Although this error can be caused by timeouts, it has a specific
meaning.  It means that the client disconnected before Solr responded to
the request, so when Solr tried to respond (through jetty), it found a
closed TCP connection.

Client timeouts need to either be completely removed, or set to a value
much longer than any request will take.  Five minutes is a good starting
value.

If all your client timeout is set to 30 seconds and you are seeing
EofExceptions, that means that your requests are taking longer than 30
seconds, and you likely have some performance issues.  It's also
possible that some of your client timeouts are set a lot shorter than 30
seconds.

> My objective is to stop the errors, adding more memory to the server is not
> a good scaling strategy. That is why i was thinking maybe there is a issue
> with the way things are set up and need to be revisited.

You're right that adding more memory to the servers is not a good
scaling strategy for the general case ... but in this situation, I think
it might be prudent.  For your index and heap sizes, I would want the
company to pay for at least 32GB of RAM.

Having said that ... I've seen Solr installs work well with a LOT less
memory than the ideal.  I don't know that adding more memory is
necessary, unless your system (CPU, storage, and memory speeds) is
particularly slow.  Based on your document count and index size, your
documents are quite small, so I think your memory size is probably good
-- if the CPU, memory bus, and storage are very fast.  If one or more of
those subsystems aren't fast, then make up the difference with lots of
memory.

Some light reading, where you will learn why I think 32GB is an ideal
memory size for your system:

https://wiki.apache.org/solr/SolrPerformanceProblems

It is possible that your 6GB heap is not quite big enough for good
performance, or that your GC is not well-tuned.  These topics are also
discussed on that wiki page.  If you increase your heap size, then the
likelihood of needing more memory in the system becomes greater, because
there will be less memory available for the disk cache.

Thanks,
Shawn

Re: Are Solr releases predictable? Every 2 months?

2015-08-02 Thread Alexandre Rafalovitch

They are not that predictable. Somebody has to volunteer to be a
release manager and then there is a flurry of cleanups, release
candidates, etc.

You can see all that on the Lucene-Dev mailing list. For example, a
5.3 has been proposed (as an idea) on July 30th. But not much happened
since. But it must be fairly close.

The specific JIRA is so far trunk only and not backported to 5_x. So,
it is very unlikely to make 5.3, in my mind.

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 2 August 2015 at 06:37, Gili Nachum  wrote:
> When is 5.3 coming out?
> When is SOLR-6273  (Cross
> Data Center Replication) to be released?
> Any way to tell?

Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Jay Potharaju

The document contains around 30 fields and have stored set to true for
almost 15 of them. And these stored fields are queried and updated all the
time. You will notice that the deleted documents is almost 30% of the
docs.  And it has stayed around that percent and has not come down.
I did try optimize but that was disruptive as it caused search errors.
I have been playing with merge factor to see if that helps with deleted
documents or not. It is currently set to 5.

The server has 24 GB of memory out of which memory consumption is around 23
GB normally and the jvm is set to 6 GB. And have noticed that the available
memory on the server goes to 100 MB at times during a day.
All the updates are run through DIH.

Every day at least once i see the following error, which result in search
errors on the front end of the site.

ERROR org.apache.solr.servlet.SolrDispatchFilter -
null:org.eclipse.jetty.io.EofException

>From what I have read these are mainly due to timeout and my timeout is set
to 30 seconds and cant set it to a higher number. I was thinking maybe due
to high memory usage, sometimes it leads to bad performance/errors.

My objective is to stop the errors, adding more memory to the server is not
a good scaling strategy. That is why i was thinking maybe there is a issue
with the way things are set up and need to be revisited.

Thanks

On Sat, Aug 1, 2015 at 7:06 PM, Shawn Heisey  wrote:

> On 8/1/2015 6:49 PM, Jay Potharaju wrote:
> > I currently have a single collection with 40 million documents and index
> > size of 25 GB. The collections gets updated every n minutes and as a
> result
> > the number of deleted documents is constantly growing. The data in the
> > collection is an amalgamation of more than 1000+ customer records. The
> > number of documents per each customer is around 100,000 records on
> average.
> >
> > Now that being said, I 'm trying to get an handle on the growing deleted
> > document size. Because of the growing index size both the disk space and
> > memory is being used up. And would like to reduce it to a manageable
> size.
> >
> > I have been thinking of splitting the data into multiple core, 1 for each
> > customer. This would allow me manage the smaller collection easily and
> can
> > create/update the collection also fast. My concern is that number of
> > collections might become an issue. Any suggestions on how to address this
> > problem. What are my other alternatives to moving to a multicore
> > collections.?
> >
> > Solr: 4.9
> > Index size:25 GB
> > Max doc: 40 million
> > Doc count:29 million
> >
> > Replication:4
> >
> > 4 servers in solrcloud.
>
> Creating 1000+ collections in SolrCloud is definitely problematic.  If
> you need to choose between a lot of shards and a lot of collections, I
> would definitely go with a lot of shards.  I would also want a lot of
> servers for an index with that many pieces.
>
> https://issues.apache.org/jira/browse/SOLR-7191
>
> I don't think it would matter how many collections or shards you have
> when it comes to how many deleted documents are in your index.  If you
> want to clean up a large number of deletes in an index, the best option
> is an optimize.  An optimize requires a large amount of disk I/O, so it
> can be extremely disruptive if the query volume is high.  It should be
> done when the query volume is at its lowest.  For the index you
> describe, a nightly or weekly optimize seems like a good option.
>
> Aside from having a lot of deleted documents in your index, what kind of
> problems are you trying to solve?
>
> Thanks,
> Shawn
>
>

-- 
Thanks
Jay Potharaju

Are Solr releases predictable? Every 2 months?

2015-08-02 Thread Gili Nachum

When is 5.3 coming out?
When is SOLR-6273  (Cross
Data Center Replication) to be released?
Any way to tell?

Re: Avoid re indexing

2015-08-02 Thread Upayavira

You do not want to add a new shard, first you want your docs evenly
spread, secondly, they are spread using hash ranges, to add more
capacity, you spread out those hash ranges using shard splitting.
"Adding" a new shard doesnt really make any sense here. Unless you go
for implicit routing where you decide for yourself which shard a doc
goes into, but it seems too late to make that decision in your case.

Upayavira

On Sun, Aug 2, 2015, at 12:40 AM, Nagasharath wrote:
> Yes, shard splitting will only help in managing large clusters and to
> improve query performance. In my case as index size is fully grown (no
> capacity to hold in the existing shards) across the collection adding a
> new shard will help and for which I have to re index.
> 
> 
> > On 01-Aug-2015, at 6:34 pm, Upayavira  wrote:
> > 
> > Erm, that doesn't seem to make sense. Seems like you are talking about
> > *merging* shards.
> > 
> > Say you had two shards, 3m docs each:
> > 
> > shard1: 3m docs
> > shard2: 3m docs
> > 
> > If you split shard1, you would have:
> > 
> > shard1_0: 1.5m docs
> > shard1_1: 1.5m docs
> > shard2: 3m docs
> > 
> > You could, of course, then split shard2. You could also split shard1
> > into three parts instead, if you preferred:
> > 
> > shard1_0: 1m docs
> > shard1_1: 1m docs
> > shard1_2: 1m docs
> > shard2: 3m docs
> > 
> > Upayavira
> > 
> >> On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote:
> >> If my current shard is holding 3 million documents will the new subshard
> >> after splitting also be able to hold 3 million documents?
> >> If that is the case After shard splitting the sub shards should hold 6
> >> million documents if a shard is split in to two. Am I right?
> >> 
> >>> On 01-Aug-2015, at 5:43 pm, Upayavira  wrote:
> >>> 
> >>> 
> >>> 
>  On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
>  I am using solrj to index documents
>  
>  i agree with you regarding the index update but i should not see any
>  deleted documents as it is a fresh index. Can we actually identify what
>  are
>  those deleted documents?
> >>> 
> >>> If you post doc 1234, then you post doc 1234 a second time, you will see
> >>> a deletion in your index. If you don't want deletions to show in your
> >>> index, be sure NEVER to update a document, only add new ones with
> >>> absolutely distinct document IDs.
> >>> 
> >>> You cannot see (via Solr) which docs are deleted. You could, I suppose,
> >>> introspect the Lucene index, but that would most definitely be an expert
> >>> task.
> >>> 
>  if there is no option of adding shards to existing collection i do not
>  like
>  the idea of re indexing the whole data (worth hours) and we have gone
>  with
>  good number of shards but there is a rapid increase of size in data over
>  the past few days, do you think is it worth logging a ticket?
> >>> 
> >>> You can split a shard. See the collections API:
> >>> 
> >>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> >>> 
> >>> What would you want to log a ticket for? I'm not sure that there's
> >>> anything that would require that.
> >>> 
> >>> Upayavira

Collection APIs to create collection and custom cores naming

How to use BitDocSet within a PostFilter

Re: solr multicore vs sharding vs 1 big collection

Re: solr multicore vs sharding vs 1 big collection

Re: Are Solr releases predictable? Every 2 months?

Re: solr multicore vs sharding vs 1 big collection

Are Solr releases predictable? Every 2 months?

Re: Avoid re indexing

8 matches

Site Navigation

Mail list logo

Footer information