RE: Using Multiple collections with streaming expressions
Many thanks for the info Joel --ufuk Sent from Mail for Windows 10 From: Joel Bernstein Sent: 12 November 2020 17:00 To: solr-user@lucene.apache.org Subject: Re: Using Multiple collections with streaming expressions T
Re: Using Multiple collections with streaming expressions
The multiple collection syntax has been implemented for only a few stream sources: search, timeseries, facet and stats. Eventually it will be implemented for all stream sources. Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Nov 10, 2020 at 12:32 PM ufuk yılmaz wrote: > Thanks again Erick, that’s a good idea! > > Alternatively, I use an alias covering multiple collections in these > situations, but there may be too many combinations of collections, so it’s > not always suitable. > > Merged significantTerms streams will have meaningles scores in tuples I > think, it would be comparing apples and oranges, but in this case I’m only > interested in getting foreground counts, so it’s another day’s problem > > What seemed strange to me was source code for streams appeared to be > handling this case. > > > Sent from Mail for Windows 10 > > From: Erick Erickson > Sent: 10 November 2020 16:48 > To: solr-user@lucene.apache.org > Subject: Re: Using Multiple collections with streaming expressions > > Y > >
RE: Using Multiple collections with streaming expressions
Thanks again Erick, that’s a good idea! Alternatively, I use an alias covering multiple collections in these situations, but there may be too many combinations of collections, so it’s not always suitable. Merged significantTerms streams will have meaningles scores in tuples I think, it would be comparing apples and oranges, but in this case I’m only interested in getting foreground counts, so it’s another day’s problem What seemed strange to me was source code for streams appeared to be handling this case. Sent from Mail for Windows 10 From: Erick Erickson Sent: 10 November 2020 16:48 To: solr-user@lucene.apache.org Subject: Re: Using Multiple collections with streaming expressions Y
Re: Using Multiple collections with streaming expressions
You need to open multiple streams, one to each collection then combine them. For instance, open a significantTerms stream to collection1, another to collection2 and wrap both in a merge stream. Best, Erick > On Nov 9, 2020, at 1:58 PM, ufuk yılmaz wrote: > > For example the streaming expression significantTerms: > > https://lucene.apache.org/solr/guide/8_4/stream-source-reference.html#significantterms > > > significantTerms(collection1, > q="body:Solr", > field="author", > limit="50", > minDocFreq="10", > maxDocFreq=".20", > minTermLength="5") > > Solr supports querying multiple collections at once, but I can’t figure out > how I can do that with streaming expressions. > When I try enclosing them in quotes like: > > significantTerms(“collection1, collection2”, > q="body:Solr", > field="author", > limit="50", > minDocFreq="10", > maxDocFreq=".20", > minTermLength="5") > > It gives the error: "EXCEPTION":"java.io.IOException: Slices not found for \" > collection1, collection2\"" > I think Solr thinks quotes as part of the collection names, hence it can’t > find slices for it. > > When I just use it without quotes: > significantTerms(collection1, collection2,… > It gives the error: "EXCEPTION":"invalid expression > significantTerms(collection1, collection2, … > > I tried single quotes, escaping the quotation mark but nothing Works… > > Any ideas? > > Best, ufuk > > Windows 10 için Posta ile gönderildi >
Using Multiple collections with streaming expressions
For example the streaming expression significantTerms: https://lucene.apache.org/solr/guide/8_4/stream-source-reference.html#significantterms significantTerms(collection1, q="body:Solr", field="author", limit="50", minDocFreq="10", maxDocFreq=".20", minTermLength="5") Solr supports querying multiple collections at once, but I can’t figure out how I can do that with streaming expressions. When I try enclosing them in quotes like: significantTerms(“collection1, collection2”, q="body:Solr", field="author", limit="50", minDocFreq="10", maxDocFreq=".20", minTermLength="5") It gives the error: "EXCEPTION":"java.io.IOException: Slices not found for \" collection1, collection2\"" I think Solr thinks quotes as part of the collection names, hence it can’t find slices for it. When I just use it without quotes: significantTerms(collection1, collection2,… It gives the error: "EXCEPTION":"invalid expression significantTerms(collection1, collection2, … I tried single quotes, escaping the quotation mark but nothing Works… Any ideas? Best, ufuk Windows 10 için Posta ile gönderildi
fetch streaming expression multiple collections problem
Hello all, When I try to use the "select" streaming expression with multiple collections it works without any problems, like: search( "collection1,collection2", q="*:*", fl="field1,field2", qt="/export", sort="field1 desc" ) but when I try to use the "fetch" expression similarly: fetch( "collection1,collection2" It gives me an error saying: "EXCEPTION": "java.io.IOException: Slices not found for \"collection1,collection2\"" when I use it without quotes problem is resolved but another problem arises: fetch( collection1,collection2 which fetches fields only from collection1.. and returns empty for documents residing in collection2. I took a look at the source code of fetch and select expressions, they both get collection parameter exactly the same way, using: String collectionName = factory.getValueOperand(expression, 0) I'm lost. When I use an alias in place of multiple collections it works as desired, but we have many collections and queries are generated dynamically so we would need many combination of aliases. Need help. Regards -- uyilmaz
Re: Multiple Collections in a Alias.
There may be other ways, easiest way is to write a script that gets the cluster status, and for each collection per replica you will have these details: "collections":{ “collection1":{ "pullReplicas":"0", "replicationFactor":"1", "shards":{ "shard1":{ "range":"8000-8ccb", "state":"active", "replicas":{"core_node33":{ "core”:"collection1_shard1_replica_n30", "base_url":"http://host:port/solr;, "node_name”:”host:port", "state":"active", "type":"NRT", "force_set_state":"false", "leader":"true"}}}, For each replica of each shard make a localized call for numRecords: base_url/core/sleect?q=*:*=shardX=false=0 If you have replicas that disagree with each other with the number of records per shard then u have an issue with replicas not being in sync for a collection. This is what I meant when I said “replicas out of sync”. Your situation was actually very simple :) one of you collections has less data. You seem to have a sync requirement between collections which is interesting, but thats beyond solr. Your inter collection sync script needs some debugging most likely :) > On Aug 12, 2020, at 4:29 PM, Jae Joo wrote: > > Good question. How can I validate if the replicas are all synched? > > > On Wed, Aug 12, 2020 at 7:28 PM Jae Joo wrote: > >> numFound is same but different score. >> >> >> >> >> >> >> >> >> On Wed, Aug 12, 2020 at 6:01 PM Aroop Ganguly >> wrote: >> >>> Try a simple test of querying each collection 5 times in a row, if the >>> numFound are different for a single collection within tase 5 calls then u >>> have it. >>> Please try it, what you may think is sync’d may actually not be. How do >>> you validate correct sync ? >>> On Aug 12, 2020, at 10:55 AM, Jae Joo wrote: The replications are all synched and there are no updates while I was testing. On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly wrote: > Most likely you have 1 or more collections behind the alias that have > replicas out of sync :) > > Try querying each collection to find the one out of sync. > >> On Aug 12, 2020, at 10:47 AM, Jae Joo wrote: >> >> I have 10 collections in single alias and having different result sets > for >> every time with the same query. >> >> Is it as designed or do I miss something? >> >> The configuration and schema for all 10 collections are identical. >> Thanks, >> >> Jae > > >>> >>>
Re: Multiple Collections in a Alias.
Glad u nailed the out of sync one :) > On Aug 12, 2020, at 4:38 PM, Jae Joo wrote: > > I found it the root cause. I have 3 collections assigned to a alias and one > of them are NOT synched. > By the alias. > > > > > > > > > > > > Collection 1 > > > > > > > > > > > > Collection 2 > > > > > > > > > > > > Collection 3 > > > > > > > > > > > > On Wed, Aug 12, 2020 at 7:29 PM Jae Joo wrote: > >> Good question. How can I validate if the replicas are all synched? >> >> >> On Wed, Aug 12, 2020 at 7:28 PM Jae Joo wrote: >> >>> numFound is same but different score. >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Aug 12, 2020 at 6:01 PM Aroop Ganguly >>> wrote: >>> Try a simple test of querying each collection 5 times in a row, if the numFound are different for a single collection within tase 5 calls then u have it. Please try it, what you may think is sync’d may actually not be. How do you validate correct sync ? > On Aug 12, 2020, at 10:55 AM, Jae Joo wrote: > > The replications are all synched and there are no updates while I was > testing. > > > On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly > wrote: > >> Most likely you have 1 or more collections behind the alias that have >> replicas out of sync :) >> >> Try querying each collection to find the one out of sync. >> >>> On Aug 12, 2020, at 10:47 AM, Jae Joo wrote: >>> >>> I have 10 collections in single alias and having different result sets >> for >>> every time with the same query. >>> >>> Is it as designed or do I miss something? >>> >>> The configuration and schema for all 10 collections are identical. >>> Thanks, >>> >>> Jae >> >>
Re: Multiple Collections in a Alias.
I found it the root cause. I have 3 collections assigned to a alias and one of them are NOT synched. By the alias. Collection 1 Collection 2 Collection 3 On Wed, Aug 12, 2020 at 7:29 PM Jae Joo wrote: > Good question. How can I validate if the replicas are all synched? > > > On Wed, Aug 12, 2020 at 7:28 PM Jae Joo wrote: > >> numFound is same but different score. >> >> >> >> >> >> >> >> >> On Wed, Aug 12, 2020 at 6:01 PM Aroop Ganguly >> wrote: >> >>> Try a simple test of querying each collection 5 times in a row, if the >>> numFound are different for a single collection within tase 5 calls then u >>> have it. >>> Please try it, what you may think is sync’d may actually not be. How do >>> you validate correct sync ? >>> >>> > On Aug 12, 2020, at 10:55 AM, Jae Joo wrote: >>> > >>> > The replications are all synched and there are no updates while I was >>> > testing. >>> > >>> > >>> > On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly >>> > wrote: >>> > >>> >> Most likely you have 1 or more collections behind the alias that have >>> >> replicas out of sync :) >>> >> >>> >> Try querying each collection to find the one out of sync. >>> >> >>> >>> On Aug 12, 2020, at 10:47 AM, Jae Joo wrote: >>> >>> >>> >>> I have 10 collections in single alias and having different result >>> sets >>> >> for >>> >>> every time with the same query. >>> >>> >>> >>> Is it as designed or do I miss something? >>> >>> >>> >>> The configuration and schema for all 10 collections are identical. >>> >>> Thanks, >>> >>> >>> >>> Jae >>> >> >>> >> >>> >>>
Re: Multiple Collections in a Alias.
Different absolute scores from different collections are OK, because the exact values depend on the number of deleted documents. For the set of documents that are in different orders from different collections, are the scores of that set identical? If they are, then it is normal to have a different order from different collections. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 12, 2020, at 4:29 PM, Jae Joo wrote: > > Good question. How can I validate if the replicas are all synched? > > > On Wed, Aug 12, 2020 at 7:28 PM Jae Joo wrote: > >> numFound is same but different score. >> >> >> >> >> >> >> >> >> On Wed, Aug 12, 2020 at 6:01 PM Aroop Ganguly >> wrote: >> >>> Try a simple test of querying each collection 5 times in a row, if the >>> numFound are different for a single collection within tase 5 calls then u >>> have it. >>> Please try it, what you may think is sync’d may actually not be. How do >>> you validate correct sync ? >>> On Aug 12, 2020, at 10:55 AM, Jae Joo wrote: The replications are all synched and there are no updates while I was testing. On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly wrote: > Most likely you have 1 or more collections behind the alias that have > replicas out of sync :) > > Try querying each collection to find the one out of sync. > >> On Aug 12, 2020, at 10:47 AM, Jae Joo wrote: >> >> I have 10 collections in single alias and having different result sets > for >> every time with the same query. >> >> Is it as designed or do I miss something? >> >> The configuration and schema for all 10 collections are identical. >> Thanks, >> >> Jae > > >>> >>>
Re: Multiple Collections in a Alias.
Good question. How can I validate if the replicas are all synched? On Wed, Aug 12, 2020 at 7:28 PM Jae Joo wrote: > numFound is same but different score. > > > > > > > > > On Wed, Aug 12, 2020 at 6:01 PM Aroop Ganguly > wrote: > >> Try a simple test of querying each collection 5 times in a row, if the >> numFound are different for a single collection within tase 5 calls then u >> have it. >> Please try it, what you may think is sync’d may actually not be. How do >> you validate correct sync ? >> >> > On Aug 12, 2020, at 10:55 AM, Jae Joo wrote: >> > >> > The replications are all synched and there are no updates while I was >> > testing. >> > >> > >> > On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly >> > wrote: >> > >> >> Most likely you have 1 or more collections behind the alias that have >> >> replicas out of sync :) >> >> >> >> Try querying each collection to find the one out of sync. >> >> >> >>> On Aug 12, 2020, at 10:47 AM, Jae Joo wrote: >> >>> >> >>> I have 10 collections in single alias and having different result sets >> >> for >> >>> every time with the same query. >> >>> >> >>> Is it as designed or do I miss something? >> >>> >> >>> The configuration and schema for all 10 collections are identical. >> >>> Thanks, >> >>> >> >>> Jae >> >> >> >> >> >>
Re: Multiple Collections in a Alias.
numFound is same but different score. On Wed, Aug 12, 2020 at 6:01 PM Aroop Ganguly wrote: > Try a simple test of querying each collection 5 times in a row, if the > numFound are different for a single collection within tase 5 calls then u > have it. > Please try it, what you may think is sync’d may actually not be. How do > you validate correct sync ? > > > On Aug 12, 2020, at 10:55 AM, Jae Joo wrote: > > > > The replications are all synched and there are no updates while I was > > testing. > > > > > > On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly > > wrote: > > > >> Most likely you have 1 or more collections behind the alias that have > >> replicas out of sync :) > >> > >> Try querying each collection to find the one out of sync. > >> > >>> On Aug 12, 2020, at 10:47 AM, Jae Joo wrote: > >>> > >>> I have 10 collections in single alias and having different result sets > >> for > >>> every time with the same query. > >>> > >>> Is it as designed or do I miss something? > >>> > >>> The configuration and schema for all 10 collections are identical. > >>> Thanks, > >>> > >>> Jae > >> > >> > >
Re: Multiple Collections in a Alias.
Try a simple test of querying each collection 5 times in a row, if the numFound are different for a single collection within tase 5 calls then u have it. Please try it, what you may think is sync’d may actually not be. How do you validate correct sync ? > On Aug 12, 2020, at 10:55 AM, Jae Joo wrote: > > The replications are all synched and there are no updates while I was > testing. > > > On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly > wrote: > >> Most likely you have 1 or more collections behind the alias that have >> replicas out of sync :) >> >> Try querying each collection to find the one out of sync. >> >>> On Aug 12, 2020, at 10:47 AM, Jae Joo wrote: >>> >>> I have 10 collections in single alias and having different result sets >> for >>> every time with the same query. >>> >>> Is it as designed or do I miss something? >>> >>> The configuration and schema for all 10 collections are identical. >>> Thanks, >>> >>> Jae >> >>
Re: Multiple Collections in a Alias.
Are the scores the same for the documents that are ordered differently? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 12, 2020, at 10:55 AM, Jae Joo wrote: > > The replications are all synched and there are no updates while I was > testing. > > > On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly > wrote: > >> Most likely you have 1 or more collections behind the alias that have >> replicas out of sync :) >> >> Try querying each collection to find the one out of sync. >> >>> On Aug 12, 2020, at 10:47 AM, Jae Joo wrote: >>> >>> I have 10 collections in single alias and having different result sets >> for >>> every time with the same query. >>> >>> Is it as designed or do I miss something? >>> >>> The configuration and schema for all 10 collections are identical. >>> Thanks, >>> >>> Jae >> >>
Re: Multiple Collections in a Alias.
The replications are all synched and there are no updates while I was testing. On Wed, Aug 12, 2020 at 1:49 PM Aroop Ganguly wrote: > Most likely you have 1 or more collections behind the alias that have > replicas out of sync :) > > Try querying each collection to find the one out of sync. > > > On Aug 12, 2020, at 10:47 AM, Jae Joo wrote: > > > > I have 10 collections in single alias and having different result sets > for > > every time with the same query. > > > > Is it as designed or do I miss something? > > > > The configuration and schema for all 10 collections are identical. > > Thanks, > > > > Jae > >
Re: Multiple Collections in a Alias.
Most likely you have 1 or more collections behind the alias that have replicas out of sync :) Try querying each collection to find the one out of sync. > On Aug 12, 2020, at 10:47 AM, Jae Joo wrote: > > I have 10 collections in single alias and having different result sets for > every time with the same query. > > Is it as designed or do I miss something? > > The configuration and schema for all 10 collections are identical. > Thanks, > > Jae
Multiple Collections in a Alias.
I have 10 collections in single alias and having different result sets for every time with the same query. Is it as designed or do I miss something? The configuration and schema for all 10 collections are identical. Thanks, Jae
Re: Reload synonyms without reloading the multiple collections
Sorry, I see that it may have been confusing. My webapp calls the reload of all the affected Collections (about a dozen of them) in sequential mode using the Collections API. Ideally I would be able to write some QueryTimeSynonymFilterFactory that would periodically or when told, reload the synonym's file from ZK, which is what the system edits when a user changes some synonyms. I understand that a Collection needs to be reloaded if the synonyms were to be used at indexation time, but this is not my case. The managed API is on the same situation, basically it does what I am doing on my own right now. At the end, there has to be a reload of the affected Collections. Regards, Simón On Sun, Dec 30, 2018 at 5:01 AM Shawn Heisey wrote: > On 12/29/2018 5:55 AM, Simón de Frosterus Pokrzywnicki wrote: > > The problem is that when the user changes the synonyms, it automatically > > triggers a sequential reload of all the Collections. > > What exactly is being done when you say "the user changes the > synonyms"? Just uploading a new synonyms definition file to zookeeper > would *NOT* result in a reload of *ANY* collection. As far as I am > aware, collection reloads only happen when they are explicitly > requested. Usage of the managed APIs to change aspects of the schema > could cause a reload, but it's only going to happen on the collection > where the API is used, not all collections. > > Basically, I cannot imagine any situation that would cause a reload of > all collections, other than explicitly asking Solr to do those reloads. > > Thanks, > Shawn > >
Re: Reload synonyms without reloading the multiple collections
On 12/29/2018 5:55 AM, Simón de Frosterus Pokrzywnicki wrote: The problem is that when the user changes the synonyms, it automatically triggers a sequential reload of all the Collections. What exactly is being done when you say "the user changes the synonyms"? Just uploading a new synonyms definition file to zookeeper would *NOT* result in a reload of *ANY* collection. As far as I am aware, collection reloads only happen when they are explicitly requested. Usage of the managed APIs to change aspects of the schema could cause a reload, but it's only going to happen on the collection where the API is used, not all collections. Basically, I cannot imagine any situation that would cause a reload of all collections, other than explicitly asking Solr to do those reloads. Thanks, Shawn
Reload synonyms without reloading the multiple collections
Hello, I have a solrcloud setup with multiple Collections based on the same configset. One of the features I have is that the user can define their own synonyms in order to improve their search experience which has worked fine until recently. Lately the platform has grown and the user has several dozen Collections, must of them with 200k or more documents of non-trivial size. The problem is that when the user changes the synonyms, it automatically triggers a sequential reload of all the Collections. This is now always causing problems, to a point where the platform becomes unstable and may need a restart of Solr, which means we have to access the platform and manually stabilize it. The synonyms are only used at query time, so there is no need to reindex anything and it seems like overkill to reload the Collections to change the synonyms. I have tried creating my own CustomSynonymGraphFilter and have it call the loadSynonyms() method as needed but this causes some weird behavior where queries sometimes have the newly added synonyms working fine but sometimes not. I get the impression that there may be like N "threads" handling the queries but I only change the SynonymMap for one of them, so when the query hits the right "thread" it works, but in most cases it does not. My custom fieldType looks like this: I would like to know if there is some other class I can redefine to make sure the new SynonymMap is used in all cases. Thanks, Simón PS: I have upgraded to Solr 7.6.
Re: Query to multiple collections
Hi, This kind of was one of the problems I was facing recently. While in my use case I am supposed to be showing spellcheck suggestions (collated) from two different collections. To also mention both these collections are using the same schema while they need to be segregated as for the business nature they serve. I considered using the aliasing approach too, while was little unsure if this might work for me. Weirdly the standard select URL itself is a trouble for me and I run into the following exception on my browser : http://:8983/solr/products.1,products.3/select?q=*:* { "responseHeader": { "zkConnected": true, "status": 500, "QTime": 24, "params": { "q": "*:*" } }, "error": { "trace": "java.lang.NullPointerException\n\tat org.apache.solr.handler.component.QueryComponent.unmarshalSortValues(QueryComponent.java:1034)\n\tat org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:885)\n\tat org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:585)\n\tat org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:564)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:423)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)\n\tat org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tat java.lang.Thread.run(Thread.java:748)\n", "code": 500 } } I would really appreciate if someone could possibly tell me what could be happening? Thanks, Atita On Tue, Oct 23, 2018 at 1:58 AM Rohan Kasat wrote: > Thanks Shawn for the update. > I am going ahead with the standard aliases approach , suits my use case. > > Regards, > Rohan Kasat > > > On Mon, Oct 22, 2018 at 4:49 PM Shawn Heisey wrote: > > > On 10/22/2018 1:26 PM, Chris Ulicny wrote: > > > There weren't any particular problems we ran into since the client that > > > makes the queries to multiple collections previously would query > multiple > > > cores using the 'shards' parameter before we moved to solrcloud. We > > didn't > > > have any complicated sorting or scoring requirements
Re: Query to multiple collections
Thanks Shawn for the update. I am going ahead with the standard aliases approach , suits my use case. Regards, Rohan Kasat On Mon, Oct 22, 2018 at 4:49 PM Shawn Heisey wrote: > On 10/22/2018 1:26 PM, Chris Ulicny wrote: > > There weren't any particular problems we ran into since the client that > > makes the queries to multiple collections previously would query multiple > > cores using the 'shards' parameter before we moved to solrcloud. We > didn't > > have any complicated sorting or scoring requirements fortunately. > > > > The one thing I remember looking into was what solr would do when two > > documents with the same id were found in both collections. I believe it > > just non-deterministically picked one, probably the one that came in > first > > or last. > > Yes, that is how it works. I do not know whether it is the first one to > respond or the last one to respond that ends up in the results. Solr is > designed to work with data where the uniqueKey field really is unique > across everything that is being queried. Results can vary when you have > the same uniqueKey value in more than one place and you query both of > them at once. > > > Depending on how many collections you need to query simultaneously, it's > > worth looking into using aliases for lists of collections as Alex > > mentioned. > > > > Unfortunately, in our use case, it wasn't worth the headache of managing > > aliases for every possible combination of collections that needed to be > > queried, but we would have preferred to use aliases. > > Aliases are the cleanest option. This syntax also works, sorta blew my > mind when somebody told me about it: > > http://host:port/solr/current,archive2,archive4/select?q=*:* > > If you're using a Solr client library, it might not be possible to > control the URL like that, but if you're building URLs yourself, you > could use it. > > I recently filed an issue related to alias handling, some unexpected > behavior: > > https://issues.apache.org/jira/browse/SOLR-12849 > > Thanks, > Shawn > > -- *Regards,Rohan Kasat*
Re: Query to multiple collections
On 10/22/2018 1:26 PM, Chris Ulicny wrote: There weren't any particular problems we ran into since the client that makes the queries to multiple collections previously would query multiple cores using the 'shards' parameter before we moved to solrcloud. We didn't have any complicated sorting or scoring requirements fortunately. The one thing I remember looking into was what solr would do when two documents with the same id were found in both collections. I believe it just non-deterministically picked one, probably the one that came in first or last. Yes, that is how it works. I do not know whether it is the first one to respond or the last one to respond that ends up in the results. Solr is designed to work with data where the uniqueKey field really is unique across everything that is being queried. Results can vary when you have the same uniqueKey value in more than one place and you query both of them at once. Depending on how many collections you need to query simultaneously, it's worth looking into using aliases for lists of collections as Alex mentioned. Unfortunately, in our use case, it wasn't worth the headache of managing aliases for every possible combination of collections that needed to be queried, but we would have preferred to use aliases. Aliases are the cleanest option. This syntax also works, sorta blew my mind when somebody told me about it: http://host:port/solr/current,archive2,archive4/select?q=*:* If you're using a Solr client library, it might not be possible to control the URL like that, but if you're building URLs yourself, you could use it. I recently filed an issue related to alias handling, some unexpected behavior: https://issues.apache.org/jira/browse/SOLR-12849 Thanks, Shawn
Re: Query to multiple collections
Thanks Chris. This help. Regards, Rohan On Mon, Oct 22, 2018 at 12:26 PM Chris Ulicny wrote: > There weren't any particular problems we ran into since the client that > makes the queries to multiple collections previously would query multiple > cores using the 'shards' parameter before we moved to solrcloud. We didn't > have any complicated sorting or scoring requirements fortunately. > > The one thing I remember looking into was what solr would do when two > documents with the same id were found in both collections. I believe it > just non-deterministically picked one, probably the one that came in first > or last. > > Depending on how many collections you need to query simultaneously, it's > worth looking into using aliases for lists of collections as Alex > mentioned. > > Unfortunately, in our use case, it wasn't worth the headache of managing > aliases for every possible combination of collections that needed to be > queried, but we would have preferred to use aliases. > > On Mon, Oct 22, 2018 at 2:27 PM Rohan Kasat wrote: > > > Thanks Alex. > > I check aliases but dint focused much , will try to relate more to my use > > case and have a look again at the same. > > I guess the specification of collection in the query should be useful. > > > > Regards, > > Rohan Kasat > > > > On Mon, Oct 22, 2018 at 11:21 AM Alexandre Rafalovitch < > arafa...@gmail.com > > > > > wrote: > > > > > Have you tried using aliases: > > > > > > > > > http://lucene.apache.org/solr/guide/7_5/collections-api.html#collections-api > > > > > > You can also - I think - specify a collection of shards/collections > > > directly in the query, but there may be side edge-cases with that (not > > > sure). > > > > > > Regards, > > > Alex. > > > On Mon, 22 Oct 2018 at 13:49, Rohan Kasat > wrote: > > > > > > > > Hi All , > > > > > > > > I have a SolrCloud setup with multiple collections. > > > > I have created say - two collections here as the data source for the > > > both > > > > collections are different and hence wanted to store them differently. > > > > There is a use case , where i need to query both the collections and > > show > > > > unified search results. > > > > The fields in the schema are same. ( say - title , description , > date ) > > > > Is there any specific way i can do this directly with the collections > > API > > > > or something like that? > > > > Or i need to write a federator and combine results from search to the > > > > respective collections and then unify them? > > > > > > > > -- > > > > > > > > *Regards,Rohan* > > > > > > > > > -- > > > > *Regards,Rohan Kasat* > > > -- *Regards,Rohan Kasat*
Re: Query to multiple collections
There weren't any particular problems we ran into since the client that makes the queries to multiple collections previously would query multiple cores using the 'shards' parameter before we moved to solrcloud. We didn't have any complicated sorting or scoring requirements fortunately. The one thing I remember looking into was what solr would do when two documents with the same id were found in both collections. I believe it just non-deterministically picked one, probably the one that came in first or last. Depending on how many collections you need to query simultaneously, it's worth looking into using aliases for lists of collections as Alex mentioned. Unfortunately, in our use case, it wasn't worth the headache of managing aliases for every possible combination of collections that needed to be queried, but we would have preferred to use aliases. On Mon, Oct 22, 2018 at 2:27 PM Rohan Kasat wrote: > Thanks Alex. > I check aliases but dint focused much , will try to relate more to my use > case and have a look again at the same. > I guess the specification of collection in the query should be useful. > > Regards, > Rohan Kasat > > On Mon, Oct 22, 2018 at 11:21 AM Alexandre Rafalovitch > > wrote: > > > Have you tried using aliases: > > > > > http://lucene.apache.org/solr/guide/7_5/collections-api.html#collections-api > > > > You can also - I think - specify a collection of shards/collections > > directly in the query, but there may be side edge-cases with that (not > > sure). > > > > Regards, > > Alex. > > On Mon, 22 Oct 2018 at 13:49, Rohan Kasat wrote: > > > > > > Hi All , > > > > > > I have a SolrCloud setup with multiple collections. > > > I have created say - two collections here as the data source for the > > both > > > collections are different and hence wanted to store them differently. > > > There is a use case , where i need to query both the collections and > show > > > unified search results. > > > The fields in the schema are same. ( say - title , description , date ) > > > Is there any specific way i can do this directly with the collections > API > > > or something like that? > > > Or i need to write a federator and combine results from search to the > > > respective collections and then unify them? > > > > > > -- > > > > > > *Regards,Rohan* > > > > > -- > > *Regards,Rohan Kasat* >
Re: Query to multiple collections
Thanks Alex. I check aliases but dint focused much , will try to relate more to my use case and have a look again at the same. I guess the specification of collection in the query should be useful. Regards, Rohan Kasat On Mon, Oct 22, 2018 at 11:21 AM Alexandre Rafalovitch wrote: > Have you tried using aliases: > > http://lucene.apache.org/solr/guide/7_5/collections-api.html#collections-api > > You can also - I think - specify a collection of shards/collections > directly in the query, but there may be side edge-cases with that (not > sure). > > Regards, > Alex. > On Mon, 22 Oct 2018 at 13:49, Rohan Kasat wrote: > > > > Hi All , > > > > I have a SolrCloud setup with multiple collections. > > I have created say - two collections here as the data source for the > both > > collections are different and hence wanted to store them differently. > > There is a use case , where i need to query both the collections and show > > unified search results. > > The fields in the schema are same. ( say - title , description , date ) > > Is there any specific way i can do this directly with the collections API > > or something like that? > > Or i need to write a federator and combine results from search to the > > respective collections and then unify them? > > > > -- > > > > *Regards,Rohan* > -- *Regards,Rohan Kasat*
Re: Query to multiple collections
Thanks Chris for the update. I was thinking on the same grounds just wanted to check if you faced any specific issues. Regards, Rohan Kasat On Mon, Oct 22, 2018 at 11:20 AM Chris Ulicny wrote: > Rohan, > > I do not remember where I came across it or what restrictions exist on it, > but it works for our use case of querying multiple archived collections > with identical schemas in the same SolrCloud cluster. The queries have the > following form: > > > http::/solr/current/select?collection=current,archive2,archive4=... > > > It seems like it might work for your use case, but you might need to tread > carefully depending on your requirements for the returned results. Sorting > and duplicate unique keys come to mind. > > Best, > Chris > > On Mon, Oct 22, 2018 at 1:49 PM Rohan Kasat wrote: > > > Hi All , > > > > I have a SolrCloud setup with multiple collections. > > I have created say - two collections here as the data source for the > both > > collections are different and hence wanted to store them differently. > > There is a use case , where i need to query both the collections and show > > unified search results. > > The fields in the schema are same. ( say - title , description , date ) > > Is there any specific way i can do this directly with the collections API > > or something like that? > > Or i need to write a federator and combine results from search to the > > respective collections and then unify them? > > > > -- > > > > *Regards,Rohan* > > > -- *Regards,Rohan Kasat*
Re: Query to multiple collections
Have you tried using aliases: http://lucene.apache.org/solr/guide/7_5/collections-api.html#collections-api You can also - I think - specify a collection of shards/collections directly in the query, but there may be side edge-cases with that (not sure). Regards, Alex. On Mon, 22 Oct 2018 at 13:49, Rohan Kasat wrote: > > Hi All , > > I have a SolrCloud setup with multiple collections. > I have created say - two collections here as the data source for the both > collections are different and hence wanted to store them differently. > There is a use case , where i need to query both the collections and show > unified search results. > The fields in the schema are same. ( say - title , description , date ) > Is there any specific way i can do this directly with the collections API > or something like that? > Or i need to write a federator and combine results from search to the > respective collections and then unify them? > > -- > > *Regards,Rohan*
Re: Query to multiple collections
Rohan, I do not remember where I came across it or what restrictions exist on it, but it works for our use case of querying multiple archived collections with identical schemas in the same SolrCloud cluster. The queries have the following form: http::/solr/current/select?collection=current,archive2,archive4=... It seems like it might work for your use case, but you might need to tread carefully depending on your requirements for the returned results. Sorting and duplicate unique keys come to mind. Best, Chris On Mon, Oct 22, 2018 at 1:49 PM Rohan Kasat wrote: > Hi All , > > I have a SolrCloud setup with multiple collections. > I have created say - two collections here as the data source for the both > collections are different and hence wanted to store them differently. > There is a use case , where i need to query both the collections and show > unified search results. > The fields in the schema are same. ( say - title , description , date ) > Is there any specific way i can do this directly with the collections API > or something like that? > Or i need to write a federator and combine results from search to the > respective collections and then unify them? > > -- > > *Regards,Rohan* >
Query to multiple collections
Hi All , I have a SolrCloud setup with multiple collections. I have created say - two collections here as the data source for the both collections are different and hence wanted to store them differently. There is a use case , where i need to query both the collections and show unified search results. The fields in the schema are same. ( say - title , description , date ) Is there any specific way i can do this directly with the collections API or something like that? Or i need to write a federator and combine results from search to the respective collections and then unify them? -- *Regards,Rohan*
Re: Multiple collections for a write-alias
We are actually very close to doing what Shawn has suggested. Emir has a good point about new collections failing on deletes/updates of older documents which were not present in the new collection. But even if this feature can be implemented for an append-only log, it would make a good feature IMO. Use-case for re-indexing everything again is generally that of an attribute change like enabling "indexed" or "docValues" on a field or adding a new field to a schema. While the reading client-code sits behind a flag to start using the new attribute/field, we have to re-index all the data without stopping older-format reads. Currently, we have to do dual writes to the new collections or play catch-up-after-a-bootstrap. Note that the catch-up-after-a-bootstrap is not very easy too (it is very similar to the one described by Shwan). If this special place is Kafka or some table in the DB, then we have to do dual writes to the regular source-of-truth and this special place. Dual writes with DB and Kafka suffer from being transaction-less (and thus lack consistency) while dual write to DB increase the load on DB. Having created_date / modified_date fields and querying the DB to find live-traffic documents has its own problems and is taxing on the DB again. Dual writes to Solr's multiple collections directly is the simplest to implement for a client and that is exactly what this new feature could be. With a dual-write-collection-alias, it becomes easier for the client to not implement any of the above if the dual-write-collection-alias does the following: - Deletes on missing documents in new collection are simply ignored. - Incremental updates just throw an error for not being supported on multi-write-collection-alias. - Regular updates (i.e. Delete-Then-Insert) should work just fine because they will just treat the document as a brand new one and versioning strategies can take care of out-of-order updates. SG On Fri, Nov 10, 2017 at 6:33 AM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > This approach could work only if it is append only index. In case you have > updates/deletes, you have to process in order, otherwise you will get > incorrect results. I am thinking that is one of the reasons why it might > not be supported since not too useful. > > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 9 Nov 2017, at 19:09, S G <sg.online.em...@gmail.com> wrote: > > > > Hi, > > > > We have a use-case to re-create a solr-collection by re-ingesting > > everything but not tolerate a downtime while that is happening. > > > > We are using collection alias feature to point to the new collection when > > it has been re-ingested fully. > > > > However, re-ingestion takes several hours to complete and during that > time, > > the customer has to write to both the collections - previous collection > and > > the one being bootstrapped. > > This dual-write is harder to do from the client side (because client > needs > > to have a retry logic to ensure any update does not succeed in one > > collection and fails in another - consistency problem) and it would be a > > real welcome addition if collection aliasing can support this. > > > > Proposal: > > If can enhance the write alias to point to multiple collections such that > > any update to the alias is written to all the collections it points to, > it > > would help the client to avoid dual writes and also issue just a single > > http call from the client instead of multiple. It would also reduce the > > retry logic inside the client code used to keep the collections > consistent. > > > > > > Thanks > > SG > >
Re: Multiple collections for a write-alias
This approach could work only if it is append only index. In case you have updates/deletes, you have to process in order, otherwise you will get incorrect results. I am thinking that is one of the reasons why it might not be supported since not too useful. Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 9 Nov 2017, at 19:09, S G <sg.online.em...@gmail.com> wrote: > > Hi, > > We have a use-case to re-create a solr-collection by re-ingesting > everything but not tolerate a downtime while that is happening. > > We are using collection alias feature to point to the new collection when > it has been re-ingested fully. > > However, re-ingestion takes several hours to complete and during that time, > the customer has to write to both the collections - previous collection and > the one being bootstrapped. > This dual-write is harder to do from the client side (because client needs > to have a retry logic to ensure any update does not succeed in one > collection and fails in another - consistency problem) and it would be a > real welcome addition if collection aliasing can support this. > > Proposal: > If can enhance the write alias to point to multiple collections such that > any update to the alias is written to all the collections it points to, it > would help the client to avoid dual writes and also issue just a single > http call from the client instead of multiple. It would also reduce the > retry logic inside the client code used to keep the collections consistent. > > > Thanks > SG
Re: Multiple collections for a write-alias
On 11/9/2017 11:09 AM, S G wrote: > However, re-ingestion takes several hours to complete and during that time, > the customer has to write to both the collections - previous collection and > the one being bootstrapped. > This dual-write is harder to do from the client side (because client needs > to have a retry logic to ensure any update does not succeed in one > collection and fails in another - consistency problem) and it would be a > real welcome addition if collection aliasing can support this. Let me explain how I handle this situation. I'm not running in cloud mode, but I use the "swap" feature of CoreAdmin to do much the same thing you're describing with collection aliases. My source data (mysql database) has a way to track the last new document that was added, as well as track which deletes have been applied, and which documents need to be reinserted. I use these pointers to decide what data to retrieve on each indexing cycle, and then I update them to new positions when the indexing cycle completes successfully. When I do a full rebuild, I grab the current positions for new docs, deletes, and reinserts, and store that information in a special place. Then I start building indexes in the "build" cores. In the meantime, I am continuing to update all the "live" cores, so users are unaware that anything special is happening. When the rebuild finishes (which can take a day or more), I go to that special place where I stored all the position information, and proceed to run a "catchup" indexing process on the build cores -- all the deletes, new documents, and reinserts that happened since the time the full rebuild started. When that completes, I swap the build cores with the live cores, and resume normal operation. Doing it this way, I do not need to worry about the normal indexing cycle handling writes to both the old index and the new index -- the ongoing cycle just updates the current live cores. > Proposal: > If can enhance the write alias to point to multiple collections such that > any update to the alias is written to all the collections it points to, it > would help the client to avoid dual writes and also issue just a single > http call from the client instead of multiple. It would also reduce the > retry logic inside the client code used to keep the collections consistent. Imagine an index with time-series data, where there is an alias called "today" that includes up to 24 hourly collections. If you were to write to that alias with the idea you've proposed, the data would end up in the wrong places and would in fact get incorrectly duplicated many times ... but the way it currently works, the writes would only go to the FIRST collection in the alias, which can be arranged to always be the "current" collection. Your proposal is an interesting idea, but would require some development work. Errors during indexing could be a major source of headaches, especially those errors that don't affect all collections in the alias equally. So as to not change how users expect Solr to work currently, aliases would need a special flag to indicate that writes *should* be duplicated to all collections in the alias, or maybe there would need to be two different kinds of aliases. Since such a feature is probably not going to happen quickly even if it is something that we agree to work on, would you be able to use something like the method that I outlined above? Thanks, Shawn
Re: Multiple collections for a write-alias
Aliases can already point to multiple collections, have you just tried that? I'm not totally sure what the behavior would be, but nothing you've written indicates you tried so I thought I'd point it out. It's not clear to me how useful this is though, or what failure messages are returned. Or how you figure out which collection failed. Or how you'd take remedial action. Best, Erick Erick On Thu, Nov 9, 2017 at 10:09 AM, S G <sg.online.em...@gmail.com> wrote: > Hi, > > We have a use-case to re-create a solr-collection by re-ingesting > everything but not tolerate a downtime while that is happening. > > We are using collection alias feature to point to the new collection when > it has been re-ingested fully. > > However, re-ingestion takes several hours to complete and during that time, > the customer has to write to both the collections - previous collection and > the one being bootstrapped. > This dual-write is harder to do from the client side (because client needs > to have a retry logic to ensure any update does not succeed in one > collection and fails in another - consistency problem) and it would be a > real welcome addition if collection aliasing can support this. > > Proposal: > If can enhance the write alias to point to multiple collections such that > any update to the alias is written to all the collections it points to, it > would help the client to avoid dual writes and also issue just a single > http call from the client instead of multiple. It would also reduce the > retry logic inside the client code used to keep the collections consistent. > > > Thanks > SG
Multiple collections for a write-alias
Hi, We have a use-case to re-create a solr-collection by re-ingesting everything but not tolerate a downtime while that is happening. We are using collection alias feature to point to the new collection when it has been re-ingested fully. However, re-ingestion takes several hours to complete and during that time, the customer has to write to both the collections - previous collection and the one being bootstrapped. This dual-write is harder to do from the client side (because client needs to have a retry logic to ensure any update does not succeed in one collection and fails in another - consistency problem) and it would be a real welcome addition if collection aliasing can support this. Proposal: If can enhance the write alias to point to multiple collections such that any update to the alias is written to all the collections it points to, it would help the client to avoid dual writes and also issue just a single http call from the client instead of multiple. It would also reduce the retry logic inside the client code used to keep the collections consistent. Thanks SG
Re: Multiple collections vs multiple shards for multitenancy
iment with how many > >> _documents_ you can have in a collection (however you partition that > >> up) and use the multi-tenant approach. So you have N collections and > >> each collection has a (varying) number of tenants. This also tends to > >> flatten out the update process on the assumption that your smaller > >> tenants also don't update their data as often. > >> > >> However, I really have to question one of your basic statements: > >> > >> "This works fine with aggressive autowarming, but I have a need to > reduce > >> my NRT > >> search capabilities to seconds as opposed to the minutes it is at > now,"... > >> > >> The implication here is that your autowarming takes minutes. Very > >> often people severely overdo the warmup by setting their autowarm > >> counts to 100s or 1000s. This is rarely necessary, especially if you > >> use docValues fields appropriately. Very often much of autowarming is > >> "uninverting" fields (look in your Solr log). Essentially for any > >> field you see this, use docValues and loading will be much faster. > >> > >> You also haven't said how many documents you have in a shard at > >> present. This is actually the metric I use most often to size > >> hardware. I claim you can find a sweet spot where minimal autowarming > >> will give you good enough performance, and that number is what you > >> should design to. Of course YMMV. > >> > >> Finally: push back really hard on how aggressive NRT support needs to > >> be. Often "requirements" like this are made without much thought as > >> "faster is better, let's make it 1 second!". There are situations > >> where that's true, but it comes at a cost. Users may be better served > >> by a predictable but fast system than one that's fast but > >> unpredictable. "Documents may take up to 5 minutes to appear and > >> searches will usually take less than a second" is nice and concise. I > >> have my expectations. "Documents are searchable in 1 second, but the > >> results may not come back for between 1 and 10 seconds" is much more > >> frustrating. > >> > >> FWIW, > >> Erick > >> > >> On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cptroul...@gmail.com> > >> wrote: > >> > Hi, > >> > > >> > I use Solr to serve multiple tenants and currently all tenant's data > >> > resides in one large collection, and queries have a tenant identifier. > >> This > >> > works fine with aggressive autowarming, but I have a need to reduce my > >> NRT > >> > search capabilities to seconds as opposed to the minutes it is at now, > >> > which will mean drastically reducing if not eliminating my > autowarming. > >> As > >> > such I am considering splitting my index out by tenant so that when > one > >> > tenant modifies their data it doesn't blow away all of the searcher > based > >> > caches for all tenants on soft commit. > >> > > >> > I have done a lot of research on the subject and it seems like Solr > Cloud > >> > can have problems handling large numbers of collections. I'm obviously > >> > going to have to run some tests to see how it performs, but my main > >> > question is this: are there pros and cons to splitting the index into > >> > multiple collections vs having 1 collection but splitting into > multiple > >> > shards? In my case I would have a shard per tenant and use implicit > >> routing > >> > to route to that specific shard. As I understand it a shard is > basically > >> > it's own lucene index, so I would still be eating that overhead with > >> either > >> > approach. What I don't know is if there are any other overheads > involved > >> > WRT collections vs shards, routing, zookeeper, etc. > >> > > >> > Thanks, > >> > > >> > Chris > >> >
Re: Multiple collections vs multiple shards for multitenancy
y for any >> field you see this, use docValues and loading will be much faster. >> >> You also haven't said how many documents you have in a shard at >> present. This is actually the metric I use most often to size >> hardware. I claim you can find a sweet spot where minimal autowarming >> will give you good enough performance, and that number is what you >> should design to. Of course YMMV. >> >> Finally: push back really hard on how aggressive NRT support needs to >> be. Often "requirements" like this are made without much thought as >> "faster is better, let's make it 1 second!". There are situations >> where that's true, but it comes at a cost. Users may be better served >> by a predictable but fast system than one that's fast but >> unpredictable. "Documents may take up to 5 minutes to appear and >> searches will usually take less than a second" is nice and concise. I >> have my expectations. "Documents are searchable in 1 second, but the >> results may not come back for between 1 and 10 seconds" is much more >> frustrating. >> >> FWIW, >> Erick >> >> On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cptroul...@gmail.com> >> wrote: >> > Hi, >> > >> > I use Solr to serve multiple tenants and currently all tenant's data >> > resides in one large collection, and queries have a tenant identifier. >> This >> > works fine with aggressive autowarming, but I have a need to reduce my >> NRT >> > search capabilities to seconds as opposed to the minutes it is at now, >> > which will mean drastically reducing if not eliminating my autowarming. >> As >> > such I am considering splitting my index out by tenant so that when one >> > tenant modifies their data it doesn't blow away all of the searcher based >> > caches for all tenants on soft commit. >> > >> > I have done a lot of research on the subject and it seems like Solr Cloud >> > can have problems handling large numbers of collections. I'm obviously >> > going to have to run some tests to see how it performs, but my main >> > question is this: are there pros and cons to splitting the index into >> > multiple collections vs having 1 collection but splitting into multiple >> > shards? In my case I would have a shard per tenant and use implicit >> routing >> > to route to that specific shard. As I understand it a shard is basically >> > it's own lucene index, so I would still be eating that overhead with >> either >> > approach. What I don't know is if there are any other overheads involved >> > WRT collections vs shards, routing, zookeeper, etc. >> > >> > Thanks, >> > >> > Chris >>
Re: Multiple collections vs multiple shards for multitenancy
gt; feeding... > > Sharding a single large collection and using custom routing to push > tenants to a single shard will be an administrative problem for you. > I'm assuming you have the typical multi-tenant problems, a bunch of > tenants have around N docs, some smaller percentage have 3N and a few > have 100N. Now you're having to keep track of how many docs are on > each shard, do the routing yourself, etc. Plus you can't commit > individually, a commit on one will _still_ commit on all so you're > right back where you started. > > I've seen people use a hybrid approach: experiment with how many > _documents_ you can have in a collection (however you partition that > up) and use the multi-tenant approach. So you have N collections and > each collection has a (varying) number of tenants. This also tends to > flatten out the update process on the assumption that your smaller > tenants also don't update their data as often. > > However, I really have to question one of your basic statements: > > "This works fine with aggressive autowarming, but I have a need to reduce > my NRT > search capabilities to seconds as opposed to the minutes it is at now,"... > > The implication here is that your autowarming takes minutes. Very > often people severely overdo the warmup by setting their autowarm > counts to 100s or 1000s. This is rarely necessary, especially if you > use docValues fields appropriately. Very often much of autowarming is > "uninverting" fields (look in your Solr log). Essentially for any > field you see this, use docValues and loading will be much faster. > > You also haven't said how many documents you have in a shard at > present. This is actually the metric I use most often to size > hardware. I claim you can find a sweet spot where minimal autowarming > will give you good enough performance, and that number is what you > should design to. Of course YMMV. > > Finally: push back really hard on how aggressive NRT support needs to > be. Often "requirements" like this are made without much thought as > "faster is better, let's make it 1 second!". There are situations > where that's true, but it comes at a cost. Users may be better served > by a predictable but fast system than one that's fast but > unpredictable. "Documents may take up to 5 minutes to appear and > searches will usually take less than a second" is nice and concise. I > have my expectations. "Documents are searchable in 1 second, but the > results may not come back for between 1 and 10 seconds" is much more > frustrating. > > FWIW, > Erick > > On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cptroul...@gmail.com> > wrote: > > Hi, > > > > I use Solr to serve multiple tenants and currently all tenant's data > > resides in one large collection, and queries have a tenant identifier. > This > > works fine with aggressive autowarming, but I have a need to reduce my > NRT > > search capabilities to seconds as opposed to the minutes it is at now, > > which will mean drastically reducing if not eliminating my autowarming. > As > > such I am considering splitting my index out by tenant so that when one > > tenant modifies their data it doesn't blow away all of the searcher based > > caches for all tenants on soft commit. > > > > I have done a lot of research on the subject and it seems like Solr Cloud > > can have problems handling large numbers of collections. I'm obviously > > going to have to run some tests to see how it performs, but my main > > question is this: are there pros and cons to splitting the index into > > multiple collections vs having 1 collection but splitting into multiple > > shards? In my case I would have a shard per tenant and use implicit > routing > > to route to that specific shard. As I understand it a shard is basically > > it's own lucene index, so I would still be eating that overhead with > either > > approach. What I don't know is if there are any other overheads involved > > WRT collections vs shards, routing, zookeeper, etc. > > > > Thanks, > > > > Chris >
Re: Multiple collections vs multiple shards for multitenancy
Well, it's not either/or. And you haven't said how many tenants we're talking about here. Solr startup times for a single _instance_ of Solr when there are thousands of collections can be slow. But note what I am talking about here: A single Solr on a single node where there are hundreds and hundreds of collections (or replicas for that matter). I know of very large installations with 100s of thousands of _replicas_ that run. Admittedly with a lot of care and feeding... Sharding a single large collection and using custom routing to push tenants to a single shard will be an administrative problem for you. I'm assuming you have the typical multi-tenant problems, a bunch of tenants have around N docs, some smaller percentage have 3N and a few have 100N. Now you're having to keep track of how many docs are on each shard, do the routing yourself, etc. Plus you can't commit individually, a commit on one will _still_ commit on all so you're right back where you started. I've seen people use a hybrid approach: experiment with how many _documents_ you can have in a collection (however you partition that up) and use the multi-tenant approach. So you have N collections and each collection has a (varying) number of tenants. This also tends to flatten out the update process on the assumption that your smaller tenants also don't update their data as often. However, I really have to question one of your basic statements: "This works fine with aggressive autowarming, but I have a need to reduce my NRT search capabilities to seconds as opposed to the minutes it is at now,"... The implication here is that your autowarming takes minutes. Very often people severely overdo the warmup by setting their autowarm counts to 100s or 1000s. This is rarely necessary, especially if you use docValues fields appropriately. Very often much of autowarming is "uninverting" fields (look in your Solr log). Essentially for any field you see this, use docValues and loading will be much faster. You also haven't said how many documents you have in a shard at present. This is actually the metric I use most often to size hardware. I claim you can find a sweet spot where minimal autowarming will give you good enough performance, and that number is what you should design to. Of course YMMV. Finally: push back really hard on how aggressive NRT support needs to be. Often "requirements" like this are made without much thought as "faster is better, let's make it 1 second!". There are situations where that's true, but it comes at a cost. Users may be better served by a predictable but fast system than one that's fast but unpredictable. "Documents may take up to 5 minutes to appear and searches will usually take less than a second" is nice and concise. I have my expectations. "Documents are searchable in 1 second, but the results may not come back for between 1 and 10 seconds" is much more frustrating. FWIW, Erick On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cptroul...@gmail.com> wrote: > Hi, > > I use Solr to serve multiple tenants and currently all tenant's data > resides in one large collection, and queries have a tenant identifier. This > works fine with aggressive autowarming, but I have a need to reduce my NRT > search capabilities to seconds as opposed to the minutes it is at now, > which will mean drastically reducing if not eliminating my autowarming. As > such I am considering splitting my index out by tenant so that when one > tenant modifies their data it doesn't blow away all of the searcher based > caches for all tenants on soft commit. > > I have done a lot of research on the subject and it seems like Solr Cloud > can have problems handling large numbers of collections. I'm obviously > going to have to run some tests to see how it performs, but my main > question is this: are there pros and cons to splitting the index into > multiple collections vs having 1 collection but splitting into multiple > shards? In my case I would have a shard per tenant and use implicit routing > to route to that specific shard. As I understand it a shard is basically > it's own lucene index, so I would still be eating that overhead with either > approach. What I don't know is if there are any other overheads involved > WRT collections vs shards, routing, zookeeper, etc. > > Thanks, > > Chris
Multiple collections vs multiple shards for multitenancy
Hi, I use Solr to serve multiple tenants and currently all tenant's data resides in one large collection, and queries have a tenant identifier. This works fine with aggressive autowarming, but I have a need to reduce my NRT search capabilities to seconds as opposed to the minutes it is at now, which will mean drastically reducing if not eliminating my autowarming. As such I am considering splitting my index out by tenant so that when one tenant modifies their data it doesn't blow away all of the searcher based caches for all tenants on soft commit. I have done a lot of research on the subject and it seems like Solr Cloud can have problems handling large numbers of collections. I'm obviously going to have to run some tests to see how it performs, but my main question is this: are there pros and cons to splitting the index into multiple collections vs having 1 collection but splitting into multiple shards? In my case I would have a shard per tenant and use implicit routing to route to that specific shard. As I understand it a shard is basically it's own lucene index, so I would still be eating that overhead with either approach. What I don't know is if there are any other overheads involved WRT collections vs shards, routing, zookeeper, etc. Thanks, Chris
Re: Is SolrCloudClient Singleton Pattern possible with multiple collections?
The thing is that back in solr4.8, when I was using solr standalone and I had to make a distributed query among multiple shards, I found that for each shard in the param "shards" it makes a request (which is the correct behaviour I know) but when I put just one shard in the "shards" param then it makes two identical requests. So, now because I'm using SolrCloud I replaced the "shards" with "collection" param and I was wondering if it would have the same erratic behaviour. Now I tried and I found that it has the correct behaviour. Thanks, and sorry for asking before testing it. 2016-07-14 15:26 GMT-03:00 Erick Erickson <erickerick...@gmail.com>: > bq: if using the param "collection" is the same > > Did you just try it? If so what happened? > > Not sure what you're asking here. It's the name of the > collection you want to query against. It's only > necessary when you want to go against a > collection that isn't the default which you can set with > setDefaultCollection() > > Best, > Erick > > On Thu, Jul 14, 2016 at 10:51 AM, Pablo Anzorena > <anzorena.f...@gmail.com> wrote: > > I was using > > public QueryResponse query(ModifiableSolrParams params, METHOD method) > > > > And my actual code is parsing that object. I can change it to your > method, > > but before that let me ask you if using the param "collection" is the > same. > > > > Actually I am using the param "collection" only when I need to request to > > multiple collections. > > > > Thanks. > > > > > > > > 2016-07-14 14:15 GMT-03:00 Erick Erickson <erickerick...@gmail.com>: > > > >> Just use the > >> > >> public NamedList request(SolrRequest request, String collection) > >> > >> method on the SolrCloudClient? > >> > >> Best, > >> Erick > >> > >> On Thu, Jul 14, 2016 at 9:18 AM, Pablo Anzorena < > anzorena.f...@gmail.com> > >> wrote: > >> > Hey, > >> > So the question is quite simple, Is it possible to use Singleton > Pattern > >> > with SolrCloudClient instantiation and then reuse that instance to > handle > >> > multiple requests concurrently accessing differente collections? > >> > > >> > Thanks. > >> >
Re: Is SolrCloudClient Singleton Pattern possible with multiple collections?
bq: if using the param "collection" is the same Did you just try it? If so what happened? Not sure what you're asking here. It's the name of the collection you want to query against. It's only necessary when you want to go against a collection that isn't the default which you can set with setDefaultCollection() Best, Erick On Thu, Jul 14, 2016 at 10:51 AM, Pablo Anzorena <anzorena.f...@gmail.com> wrote: > I was using > public QueryResponse query(ModifiableSolrParams params, METHOD method) > > And my actual code is parsing that object. I can change it to your method, > but before that let me ask you if using the param "collection" is the same. > > Actually I am using the param "collection" only when I need to request to > multiple collections. > > Thanks. > > > > 2016-07-14 14:15 GMT-03:00 Erick Erickson <erickerick...@gmail.com>: > >> Just use the >> >> public NamedList request(SolrRequest request, String collection) >> >> method on the SolrCloudClient? >> >> Best, >> Erick >> >> On Thu, Jul 14, 2016 at 9:18 AM, Pablo Anzorena <anzorena.f...@gmail.com> >> wrote: >> > Hey, >> > So the question is quite simple, Is it possible to use Singleton Pattern >> > with SolrCloudClient instantiation and then reuse that instance to handle >> > multiple requests concurrently accessing differente collections? >> > >> > Thanks. >>
Re: Is SolrCloudClient Singleton Pattern possible with multiple collections?
I was using public QueryResponse query(ModifiableSolrParams params, METHOD method) And my actual code is parsing that object. I can change it to your method, but before that let me ask you if using the param "collection" is the same. Actually I am using the param "collection" only when I need to request to multiple collections. Thanks. 2016-07-14 14:15 GMT-03:00 Erick Erickson <erickerick...@gmail.com>: > Just use the > > public NamedList request(SolrRequest request, String collection) > > method on the SolrCloudClient? > > Best, > Erick > > On Thu, Jul 14, 2016 at 9:18 AM, Pablo Anzorena <anzorena.f...@gmail.com> > wrote: > > Hey, > > So the question is quite simple, Is it possible to use Singleton Pattern > > with SolrCloudClient instantiation and then reuse that instance to handle > > multiple requests concurrently accessing differente collections? > > > > Thanks. >
Re: Is SolrCloudClient Singleton Pattern possible with multiple collections?
Just use the public NamedList request(SolrRequest request, String collection) method on the SolrCloudClient? Best, Erick On Thu, Jul 14, 2016 at 9:18 AM, Pablo Anzorenawrote: > Hey, > So the question is quite simple, Is it possible to use Singleton Pattern > with SolrCloudClient instantiation and then reuse that instance to handle > multiple requests concurrently accessing differente collections? > > Thanks.
Is SolrCloudClient Singleton Pattern possible with multiple collections?
Hey, So the question is quite simple, Is it possible to use Singleton Pattern with SolrCloudClient instantiation and then reuse that instance to handle multiple requests concurrently accessing differente collections? Thanks.
Re: SolrCloud multiple collections each with unique schema via SolrJ?
Got it! I now use uploadConfig to load the default config for each new collection I create, and then modify the schema. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-multiple-collections-each-with-unique-schema-via-SolrJ-tp4277397p4277406.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud multiple collections each with unique schema via SolrJ?
On 5/17/2016 7:00 PM, Boman wrote: > I load the defaul config using scripts/cloud-scripts/zkcli.sh -cmd upconfig > after which collections are created programmatically and the schema modified > as per each collection's requirements. > > I now notice that it is the SAME "default" original schema that holds ALL > the modifications (new fields). What I really want is that during collection > creation time (using SolrJ) as follows: > > CollectionAdminRequest.Create createRequest = new > CollectionAdminRequest.Create(); > createRequest.setConfigName("default-config"); > > the new collection would "inherit" a copy of the default schema, and > following any updates to that schema, it should remain Collection-specific. > > Any suggestions on how to achieve this programmatically? Thanks. If you want a different config/schema combo for each collection, you need to upload a different configset for every collection. When your collections are all using the same config, any change that you make for one of them will affect them all (after reload). You can't share just part of the configset -- it's a cohesive unit covering the solrconfig.xml, the schema, and all the other files in the configset. Thanks, Shawn
SolrCloud multiple collections each with unique schema via SolrJ?
I load the defaul config using scripts/cloud-scripts/zkcli.sh -cmd upconfig after which collections are created programmatically and the schema modified as per each collection's requirements. I now notice that it is the SAME "default" original schema that holds ALL the modifications (new fields). What I really want is that during collection creation time (using SolrJ) as follows: CollectionAdminRequest.Create createRequest = new CollectionAdminRequest.Create(); createRequest.setConfigName("default-config"); the new collection would "inherit" a copy of the default schema, and following any updates to that schema, it should remain Collection-specific. Any suggestions on how to achieve this programmatically? Thanks. --Boman. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-multiple-collections-each-with-unique-schema-via-SolrJ-tp4277397.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solrcloud: 1 server, 1 configset, multiple collections, multiple schemas
F*ck. I switched from normal Solr to SolrCloud, thanks to the feature that allow to create cores (collections) on-the-fly with the API, without having to tell Solr where to find a schema.xml / a solrconfig.xml and let it create them itself from a pre-defined configset. If I understand well, there is actually no way to create a core or a collection from the API, with a defined-at-once configset, without having to do some CLI commands on the remote server? Thanks for your reply, Ben -- View this message in context: http://lucene.472066.n3.nabble.com/Solrcloud-1-server-1-configset-multiple-collections-multiple-schemas-tp4243584p4244010.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solrcloud: 1 server, 1 configset, multiple collections, multiple schemas
On 12/7/2015 9:46 AM, bengates wrote: > If I understand well, there is actually no way to create a core or a > collection from the API, with a defined-at-once configset, without having to > do some CLI commands on the remote server? With SolrCloud, the only step that requires commandline is uploading the configuration to zookeeper, which is done with the zkcli script included with Solr. This script talks to zookeeper over the TCP network socket, so it can be run from anywhere with network access to the zookeeper servers. You do not need to run it directly on the remote Solr server. With a zookeeper client that's not solr-specific, you may be able to have even more control, but it won't be as easy as zkcli. I've used the zookeeper plugin for eclipse, but their website seems to be broken. Here's the URL, I hope it starts working at some point: http://www.massedynamic.org/mediawiki/index.php?title=Eclipse_Plug-in_for_ZooKeeper Thanks, Shawn
Re: Solrcloud: 1 server, 1 configset, multiple collections, multiple schemas
You have to upload the different configset with the zookeeper client (this is done for you when you do the examples) using zkcli, see the "upconfig" command here; https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities Similarly, you need to make changes locally (perhaps after doing a "downconfig" and push them back up. the new Admin UI does allow you to manipulate schemas from the UI, but you have to both have them be "managed" and do the initial upconfig(s) yourself. Now, apart from this step, the rest of the collection operations are available through the API. Best, Erick On Sat, Dec 5, 2015 at 12:56 AM, bengates <benga...@aliceadsl.fr> wrote: > I understand. > > How to do this via the API? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solrcloud-1-server-1-configset-multiple-collections-multiple-schemas-tp4243584p4243737.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solrcloud: 1 server, 1 configset, multiple collections, multiple schemas
I understand. How to do this via the API? -- View this message in context: http://lucene.472066.n3.nabble.com/Solrcloud-1-server-1-configset-multiple-collections-multiple-schemas-tp4243584p4243737.html Sent from the Solr - User mailing list archive at Nabble.com.
Solrcloud: 1 server, 1 configset, multiple collections, multiple schemas
Hello, I'm having usage issues with *Solrcloud*. What I want to do: - Manage a solr server *only with the API* (create / reload / delete collections, create / replace / delete fields, etc). - A new collection should* start with pre-defined default fields, fieldTypes and copyFields* (let's say, field1 and field2 for fields). - Each collection must *have its own schema*. What I've setup yet: - Installed a *Solr 5.3.1* in //opt/solr/ on an Ubuntu 14.04 server - Installed *Zookeeper 3.4.6* in //opt/zookeeper/ as described in the solr wiki - Added line "server.1=127.0.0.1:2888:3888" in //opt/zookeeper/conf/zoo.cfg/ - Added line "127.0.0.1:2181" in //var/solr/data/solr.xml/ - Told solr or zookeeper somewhere (don't remember where I setup this) to use //home/me/configSet/managed-schema.xml/ and //home/me/configSet/solrconfig.xml/ for configSet - Run solr on port 8983 My //home/me/configSet/managed-schema.xml/ contains *field1* and *field2*. Now let's create a collection: http://my.remote.addr:8983/solr/admin/collections?action=CREATE=collection1=1 - *collection1 *is created, with *field1 *and *field2*. Perfect. Let's create another collection: http://my.remote.addr:8983/solr/admin/collections?action=CREATE=collection2=1 - *collection2 *is created, with *field1 *and *field2*. Perfect. No, if I *add some fields* on *collection1 *by POSTing to : /http://my.remote.addr:8983/solr/collection1/schema/ the following: - *field3 *and *field4 *are successfully added to *collection1* - ... but they are *also added* to *collection2* (verified by GETting /http://my.remote.addr:8983/solr/collection2/schema/fields/) How to prevent this behavior, since my collections have *different kind of datas*, and may have the same field names but not the same types? Thanks, Ben -- View this message in context: http://lucene.472066.n3.nabble.com/Solrcloud-1-server-1-configset-multiple-collections-multiple-schemas-tp4243584.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solrcloud: 1 server, 1 configset, multiple collections, multiple schemas
If you want two different collections to have two different schemas, those collections need to reference two different configsets. So you need another copy of your config available using a different name, and to reference that other name when you create the second collection. On 12/4/15, 6:26 AM, "bengates" <benga...@aliceadsl.fr> wrote: >Hello, > >I'm having usage issues with *Solrcloud*. > >What I want to do: >- Manage a solr server *only with the API* (create / reload / delete >collections, create / replace / delete fields, etc). >- A new collection should* start with pre-defined default fields, >fieldTypes >and copyFields* (let's say, field1 and field2 for fields). >- Each collection must *have its own schema*. > >What I've setup yet: >- Installed a *Solr 5.3.1* in //opt/solr/ on an Ubuntu 14.04 server >- Installed *Zookeeper 3.4.6* in //opt/zookeeper/ as described in the solr >wiki >- Added line "server.1=127.0.0.1:2888:3888" in >//opt/zookeeper/conf/zoo.cfg/ >- Added line "127.0.0.1:2181" in >//var/solr/data/solr.xml/ >- Told solr or zookeeper somewhere (don't remember where I setup this) to >use //home/me/configSet/managed-schema.xml/ and >//home/me/configSet/solrconfig.xml/ for configSet >- Run solr on port 8983 > >My //home/me/configSet/managed-schema.xml/ contains *field1* and *field2*. > >Now let's create a collection: >http://my.remote.addr:8983/solr/admin/collections?action=CREATE=colle >ction1=1 >- *collection1 *is created, with *field1 *and *field2*. Perfect. > >Let's create another collection: >http://my.remote.addr:8983/solr/admin/collections?action=CREATE=colle >ction2=1 >- *collection2 *is created, with *field1 *and *field2*. Perfect. > >No, if I *add some fields* on *collection1 *by POSTing to : >/http://my.remote.addr:8983/solr/collection1/schema/ the following: > > >- *field3 *and *field4 *are successfully added to *collection1* >- ... but they are *also added* to *collection2* (verified by GETting >/http://my.remote.addr:8983/solr/collection2/schema/fields/) > >How to prevent this behavior, since my collections have *different kind of >datas*, and may have the same field names but not the same types? > >Thanks, >Ben > > > >-- >View this message in context: >http://lucene.472066.n3.nabble.com/Solrcloud-1-server-1-configset-multiple >-collections-multiple-schemas-tp4243584.html >Sent from the Solr - User mailing list archive at Nabble.com.
Solrcloud - How to merge multiple collections to a single collection
Is it possible to merge multiple collections to single collection in solrcloud 5.x ? Say we index daily logs to a collection per day and merge 7 day collections to a week collection -- View this message in context: http://lucene.472066.n3.nabble.com/Solrcloud-How-to-merge-multiple-collections-to-a-single-collection-tp4210222.html Sent from the Solr - User mailing list archive at Nabble.com.
Spellcheck across multiple collections
Hi, Is there a way to collate the spellcheck across different collection? I understand that for select query, this can be done by setting collection=collection1,collection2 at the query. However, when I do that for spellcheck, Solr did not return me any result on the spellcheck when I entered a wrong spelling in the query. I can only get results when I search on a single collection. I'm using Solr 5.1 Regards, Edwin
Query multiple collections together
Hi, Would like to check, is there a way to query multiple collections together in a single query and return the results in one result set? For example, I have 2 collections and I want to search for records with the word 'solr' in both of the collections. Is there a query to do that, or must I query both collections separately, and get two different result sets? Regards, Edwin
Re: Query multiple collections together
You can query multiple collections by specifying the list of collections e.g.: http://hostname:port /solr/gettingstarted/select?q=testcollection=collection1,collection2,collection3 On Sun, May 10, 2015 at 11:49 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Would like to check, is there a way to query multiple collections together in a single query and return the results in one result set? For example, I have 2 collections and I want to search for records with the word 'solr' in both of the collections. Is there a query to do that, or must I query both collections separately, and get two different result sets? Regards, Edwin -- Anshum Gupta
Re: Query multiple collections together
Thank you for the query. Just to confirm, for the 'gettingstarted' in the query, does it matter which collection name I put? Regards, Edwin On 11 May 2015 15:51, Anshum Gupta ans...@anshumgupta.net wrote: You can query multiple collections by specifying the list of collections e.g.: http://hostname:port /solr/gettingstarted/select?q=testcollection=collection1,collection2,collection3 On Sun, May 10, 2015 at 11:49 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Would like to check, is there a way to query multiple collections together in a single query and return the results in one result set? For example, I have 2 collections and I want to search for records with the word 'solr' in both of the collections. Is there a query to do that, or must I query both collections separately, and get two different result sets? Regards, Edwin -- Anshum Gupta
Re: Query multiple collections together
FWIR, you just need to make sure that it's a valid collection. It doesn't have to be one from the list of collections that you want to query, but the collection name you use in the URL should exist. e.g, assuming you have 2 collections foo (10 docs) and bar (5 docs): */solr/foo/select?q=*:*collection=bar* #results: 5 */solr/xyz/select?q=*:*collection=bar* will lead to a HTTP 404 response */solr/foo/select?q=*:* *#results: 10 On Mon, May 11, 2015 at 12:59 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Thank you for the query. Just to confirm, for the 'gettingstarted' in the query, does it matter which collection name I put? Regards, Edwin On 11 May 2015 15:51, Anshum Gupta ans...@anshumgupta.net wrote: You can query multiple collections by specifying the list of collections e.g.: http://hostname:port /solr/gettingstarted/select?q=testcollection=collection1,collection2,collection3 On Sun, May 10, 2015 at 11:49 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Would like to check, is there a way to query multiple collections together in a single query and return the results in one result set? For example, I have 2 collections and I want to search for records with the word 'solr' in both of the collections. Is there a query to do that, or must I query both collections separately, and get two different result sets? Regards, Edwin -- Anshum Gupta -- Anshum Gupta
Re: Query multiple collections together
Ok, thank you so much. Regards, Edwin On 11 May 2015 16:15, Anshum Gupta ans...@anshumgupta.net wrote: FWIR, you just need to make sure that it's a valid collection. It doesn't have to be one from the list of collections that you want to query, but the collection name you use in the URL should exist. e.g, assuming you have 2 collections foo (10 docs) and bar (5 docs): */solr/foo/select?q=*:*collection=bar* #results: 5 */solr/xyz/select?q=*:*collection=bar* will lead to a HTTP 404 response */solr/foo/select?q=*:* *#results: 10 On Mon, May 11, 2015 at 12:59 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Thank you for the query. Just to confirm, for the 'gettingstarted' in the query, does it matter which collection name I put? Regards, Edwin On 11 May 2015 15:51, Anshum Gupta ans...@anshumgupta.net wrote: You can query multiple collections by specifying the list of collections e.g.: http://hostname:port /solr/gettingstarted/select?q=testcollection=collection1,collection2,collection3 On Sun, May 10, 2015 at 11:49 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Would like to check, is there a way to query multiple collections together in a single query and return the results in one result set? For example, I have 2 collections and I want to search for records with the word 'solr' in both of the collections. Is there a query to do that, or must I query both collections separately, and get two different result sets? Regards, Edwin -- Anshum Gupta -- Anshum Gupta
Re: Unable to setup solr cloud with multiple collections.
You're still mixing master/slave with SolrCloud. Do _not_ reconfigure the replication. If you want your core (we call them replicas in SolrCloud) to appear on various nodes in your cluster, either create the collection with the nodes specified (createNodeSet) or, once the collection is created on any node (or set of nodes) do an ADDREPLICA (again with the collections API) where you want replicas to appear. The rest is automatic, i.e. the replica's index will be copied from the leader, all updates will be forwarded etc., without you doing any other configuration. I think you're shooting yourself in the foot by trying to fiddle with replication. Or I misunderstand your problem entirely. Best, Erick On Tue, Mar 24, 2015 at 8:09 PM, sthita sthit...@gmail.com wrote: Thanks Erick for your reply. I am trying to create a new core i.e dict_cn , which is totally different in terms of index data, configs etc from the existing core abc. The core is created successfully in my master (i.e mail) and i can do solr query on this newly created core . All the config files(Schema.xml and solrconfig.xml) are in mail server and zookeper helps it for me to share all config files to other collections. I did the similar setup to other collection , so that newly created core should be available to all the collections, but it is still showing down. -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833p4195078.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unable to setup solr cloud with multiple collections.
Why are you doing this in the first place? SolrCloud and master/slave are fundamentally different. When running in SolrCloud mode, there is no need whatsoever to configure replication as per the Wiki link you've outlined above, that's for the older style master/slave setups. Just change it back and watch the magic would be my advice. So if you'd tell us why you thought this was necessary, perhaps we can suggest alternatives because from a quick glance it looks unnecessary, and in fact harmful. Best, Erick On Mon, Mar 23, 2015 at 10:08 PM, sthita sthit...@gmail.com wrote: I have newly created a new collection and activated the replication for 4 nodes(Including masters). After doing the config changes as suggested on http://wiki.apache.org/solr/SolrReplication http://wiki.apache.org/solr/SolrReplication The nodes of the newly created collections are down on solr cloud. We are not able to add or remove any document on newly created core i.e dict_cn in our case. All the configuration files look ok on solr cloud http://lucene.472066.n3.nabble.com/file/n4194833/solr_issue.png This is my replication changes on solrconfig.xml requestHandler name=/replication class=solr.ReplicationHandler startup=lazy lst name=master str name=replicateAftercommit/str str name=replicateAfterstartup/str str name=confFilessolrconfig_cn.xml,schema_cn.xml/str /lst lst name=slave str name=masterUrlhttp://mail:8983/solr/dict_cn/str /lst Note: I am using solr 4.4.0, zookeeper-3.4.5 Can anyone help me on this ? -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unable to setup solr cloud with multiple collections.
Thanks Erick for your reply. I am trying to create a new core i.e dict_cn , which is totally different in terms of index data, configs etc from the existing core abc. The core is created successfully in my master (i.e mail) and i can do solr query on this newly created core . All the config files(Schema.xml and solrconfig.xml) are in mail server and zookeper helps it for me to share all config files to other collections. I did the similar setup to other collection , so that newly created core should be available to all the collections, but it is still showing down. -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833p4195078.html Sent from the Solr - User mailing list archive at Nabble.com.
Unable to setup solr cloud with multiple collections.
I have newly created a new collection and activated the replication for 4 nodes(Including masters). After doing the config changes as suggested on http://wiki.apache.org/solr/SolrReplication http://wiki.apache.org/solr/SolrReplication The nodes of the newly created collections are down on solr cloud. We are not able to add or remove any document on newly created core i.e dict_cn in our case. All the configuration files look ok on solr cloud http://lucene.472066.n3.nabble.com/file/n4194833/solr_issue.png This is my replication changes on solrconfig.xml requestHandler name=/replication class=solr.ReplicationHandler startup=lazy lst name=master str name=replicateAftercommit/str str name=replicateAfterstartup/str str name=confFilessolrconfig_cn.xml,schema_cn.xml/str /lst lst name=slave str name=masterUrlhttp://mail:8983/solr/dict_cn/str /lst Note: I am using solr 4.4.0, zookeeper-3.4.5 Can anyone help me on this ? -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can a single SolrServer instance update multiple collections?
@Shawn, I can definitely upgrade to SolrJ 4.x and would prefer that so as to target 4.x cores as well. I'm already on Java 7. One attempt I made was this UpdateRequest updateRequest = new UpdateRequest(); updateRequest.setParam(collection, collectionName); updateRequest.setMethod(SolrRequest.METHOD.POST); updateRequest.add(solrdoc); UpdateResponse updateResponse = updateRequest.process(solrServer); but I kept getting Bad Request which I suspect was a SOLR/SolrJ version conflict. I'm all ears! Dan -- View this message in context: http://lucene.472066.n3.nabble.com/Can-a-single-SolrServer-instance-update-multiple-collections-tp4192480p4192520.html Sent from the Solr - User mailing list archive at Nabble.com.
Can a single SolrServer instance update multiple collections?
I have a SolrJ application that reads from a Redis queue and updates different collections based on the message content. New collections are added without my knowledge, so I am creating SolrServer objects on the fly as follows: def solrHost = http://myhost/solr/; (defined at startup) def solrTarget = solrHost + collectionName SolrServer solrServer = new CommonsHttpSolrServer(solrTarget) updateResponse = solrServer.add(solrdoc) This does work but obviously creates a new CommonsHttpSolrServer instance for each message. I assume GC will eliminate these but is there a way to do this with a single SolrServer object? The SOLR host is version 3.5 and I am using the 3.5 jars for my application (not sure if that is necessary). -- View this message in context: http://lucene.472066.n3.nabble.com/Can-a-single-SolrServer-instance-update-multiple-collections-tp4192480.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can a single SolrServer instance update multiple collections?
On 3/11/2015 12:23 PM, tuxedomoon wrote: I have a SolrJ application that reads from a Redis queue and updates different collections based on the message content. New collections are added without my knowledge, so I am creating SolrServer objects on the fly as follows: def solrHost = http://myhost/solr/; (defined at startup) def solrTarget = solrHost + collectionName SolrServer solrServer = new CommonsHttpSolrServer(solrTarget) updateResponse = solrServer.add(solrdoc) This does work but obviously creates a new CommonsHttpSolrServer instance for each message. I assume GC will eliminate these but is there a way to do this with a single SolrServer object? The SOLR host is version 3.5 and I am using the 3.5 jars for my application (not sure if that is necessary). What you want to accomplish should be possible, with some attention to how SolrJ code is used. We won't talk about SolrCloud, since you're not running Solr 4.x or 5.0. Upgrading the server side is generally more involved than upgrading the client side, and switching to SolrCloud can be a fairly major conceptual leap. To do what I'm thinking about, you will need to ugprade SolrJ. When SolrCloud is not involved, cross-version compatibility between Solr and SolrJ is pretty good, although there can be some hiccups when crossing the 3.x/4.x barrier relating to the update handlers. Those hiccups are normally easy to fix, but they are something you need to be aware of. Once you've decided on whether you're upgrading Solr and which version of SolrJ you will upgrade to, we can get down to the actual Java code you'll need. Note that recent 4.x and 5.0 versions require Java 7, so if you're still on Java 6, you'll be limited to version 4.7.2. It might even be possible to do this with SolrJ 3.5, but I am already pretty familiar with how you can do it using new features in 4.x, and since you're going to need to change the source code anyway, you might as well take advantage of more modern client functionality that will make the code easier to understand. Just FYI, there are changes coming (currently planned for SolrJ 5.1) that will make this VERY easy. Thanks, Shawn
Re: Can a single SolrServer instance update multiple collections?
On 3/11/2015 3:35 PM, tuxedomoon wrote: I can definitely upgrade to SolrJ 4.x and would prefer that so as to target 4.x cores as well. I'm already on Java 7. One attempt I made was this UpdateRequest updateRequest = new UpdateRequest(); updateRequest.setParam(collection, collectionName); updateRequest.setMethod(SolrRequest.METHOD.POST); updateRequest.add(solrdoc); UpdateResponse updateResponse = updateRequest.process(solrServer); but I kept getting Bad Request which I suspect was a SOLR/SolrJ version conflict. I'm all ears! Can you share the full stacktrace? If you can't see it on the client, grab it from the server log. The collection request parameter is only useful if you're running SolrCloud. The 3.x version and 4.x/5.x in non-cloud mode should ignore it. UpdateRequest objects are created by default with a POST method, you don't need to include that. When I have some time to actually work on the code, I'm going to write it using 4.x classes because that's what I have immediate access to, but if you do 5.x, SolrServer becomes SolrClient, and HttpSolrServer becomes HttpSolrClient. I think everything else will be the same. If I'm wrong about that, it very likely will not be very hard to fix. Thanks, Shawn
Re: Can a single SolrServer instance update multiple collections?
@Shawn I'm getting the Bad Request again, with the original code snippet I posted, it appears to be an 'illegal' string field. SOLR log - INFO: {add=[mgid:arc:content:jokers.com:694d5bf8-ecfd-11e0-aca6-0026b9414f30]} 0 7 Mar 12, 2015 12:15:09 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=mgid:arc:content:jokers.com:694d5bf8-ecfd-11e0-aca6-0026b9414f30] multiple values encountered for non multiValued field image_url_s: [mgid:file:gsp:movie-assets:/movie-assets/cc/images/shows/miami-beach/episode-thumbnails/specials/iamstupid-the-movie_4x3.jpg, mgid:file:gsp:movie-assets:/movie-assets/cc/images/shows/miami-beach/episode-thumbnails/specials/iamstupid-the-movie_4x3.jpg] at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:246) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:158) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) SolrJ Log shows the doc being sent (this is the offending field only) field name=image_url_s/field I will investigate on the feeds side, the existing SolrJ code is not the culprit. But I'd still like a more elegant solution. If a SolrJ 5 client can talk to a 3.5 host I'm willing to go there. I know I'm not the only one who would like to address collections on the fly. thx Dan -- View this message in context: http://lucene.472066.n3.nabble.com/Can-a-single-SolrServer-instance-update-multiple-collections-tp4192480p4192545.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can a single SolrServer instance update multiple collections?
On 3/11/2015 4:28 PM, Shawn Heisey wrote: When I have some time to actually work on the code, I'm going to write it using 4.x classes because that's what I have immediate access to, but if you do 5.x, SolrServer becomes SolrClient, and HttpSolrServer becomes HttpSolrClient. At the URL below is the code I came up with. It shows how to do an add, a commit, and a query where the Solr core (collection) is specified as part of the request, rather than the server connection: http://apaste.info/lRi I did test this code successfully, although there was one difference in that code (/update instead of /update/javabin) because my dev Solr server is running 4.9.1, not 3.5. The code I've shared uses SolrJ 4.x, but is tailored to a server running 3.x with a typical 3.x config. I hope this code will work as-is ... and if it doesn't, that it will be easy for you to figure out what I did wrong. If you want to figure out how to use SolrRequest to implement a query with a specific handler path, you could probably implement all of this in SolrJ 3.5, where SolrQuery#setRequestHandler does not exist. I'm sure that if you look at the SolrQuery class and the CommonsHttpSolrServer#query method from the 3.5 source code, you could piece together how to do this. It might be a good idea to abstract these procedures for add, commit, and query into your own local methods that include the collection parameter. If you need it, you can also implement UpdateRequest.ACTION.OPTIMIZE in a similar manner to the way that I used UpdateRequest.ACTION.COMMIT. See the following issue for the recent work that will go into a new 5.x version (probably 5.1), which adds the capability you are seeking directly to HttpSolrClient, implementing abstract methods from SolrClient: https://issues.apache.org/jira/browse/SOLR-7201 Thanks, Shawn
Use multiple collections having different configuration
Hello, I have scenario where I want to create/use 2 collection into same Solr named as collection1 and collection2. I want to use distributed servers. Each collection has multiple shards. Each collection contains different configurations(solrconfig.xml and schema.xml). How can I do? In between, If I want to re-configure any collection then how to do that? As I know, If we use single collection which having multiple shards then we need to use this upconfig link - * example/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir example/solr/collection1/conf -confname default * and restart all the nodes. For 2 collections into same solr. How can I do re-configure?
Re: Use multiple collections having different configuration
On 2/20/2015 4:06 AM, Nitin Solanki wrote: I have scenario where I want to create/use 2 collection into same Solr named as collection1 and collection2. I want to use distributed servers. Each collection has multiple shards. Each collection contains different configurations(solrconfig.xml and schema.xml). How can I do? In between, If I want to re-configure any collection then how to do that? As I know, If we use single collection which having multiple shards then we need to use this upconfig link - * example/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir example/solr/collection1/conf -confname default * and restart all the nodes. For 2 collections into same solr. How can I do re-configure? First, upload your two different configurations with zkcli upconfig using two different names. Create your collections with the Collections API, and tell each one to use a different collection.configName. If the collection already exists, use the zkcli linkconfig command, and reload the collection. If you need to change a config, edit the config on disk and re-do the zkcli upconfig. Then reload the collection with the Collections API. Alternately you could upload a whole new config and then link it to the existing collection. The Collections API is not yet exposed in the admin interface, you will need to do those calls yourself. If you're doing this with SolrJ, there are some objects inside CollectionAdminRequest that let you do all the API actions. Thanks, Shawn
Re: Use multiple collections having different configuration
Thanks Shawn.. On Fri, Feb 20, 2015 at 7:53 PM, Shawn Heisey apa...@elyograg.org wrote: On 2/20/2015 4:06 AM, Nitin Solanki wrote: I have scenario where I want to create/use 2 collection into same Solr named as collection1 and collection2. I want to use distributed servers. Each collection has multiple shards. Each collection contains different configurations(solrconfig.xml and schema.xml). How can I do? In between, If I want to re-configure any collection then how to do that? As I know, If we use single collection which having multiple shards then we need to use this upconfig link - * example/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir example/solr/collection1/conf -confname default * and restart all the nodes. For 2 collections into same solr. How can I do re-configure? First, upload your two different configurations with zkcli upconfig using two different names. Create your collections with the Collections API, and tell each one to use a different collection.configName. If the collection already exists, use the zkcli linkconfig command, and reload the collection. If you need to change a config, edit the config on disk and re-do the zkcli upconfig. Then reload the collection with the Collections API. Alternately you could upload a whole new config and then link it to the existing collection. The Collections API is not yet exposed in the admin interface, you will need to do those calls yourself. If you're doing this with SolrJ, there are some objects inside CollectionAdminRequest that let you do all the API actions. Thanks, Shawn
Re: Bootstrapping SolrCloud cluster with multiple collections in differene sharding/replication setup
Hi, got a nice talk on IRC about this. The right thing to do is to start with a clean SOLR cluster (no cores) and then create all the proper collections with the Collections API. Ugo On Thu, Mar 20, 2014 at 7:26 PM, Jeff Wartes jwar...@whitepages.com wrote: Please note that although the article talks about the ADDREPLICA command, that feature is coming in Solr 4.8, so don¹t be confused if you can¹t find it yet. See https://issues.apache.org/jira/browse/SOLR-5130 On 3/20/14, 7:45 AM, Erick Erickson erickerick...@gmail.com wrote: You might find this useful: http://heliosearch.org/solrcloud-assigning-nodes-machines/ It uses the collections API to create your collection with zero nodes, then shows how to assign your leaders to specific machines (well, at least specify the nodes the leaders will be created on, it doesn't show how to assign, for instance, shard1 to nodeX) It also shows a way to assign specific replicas on specific nodes to specific shards, although as Mark says this is a transitional technique. I know there's an addreplica command in the works for the collections API that should make this easier, but that's not released yet. Best, Erick On Thu, Mar 20, 2014 at 7:23 AM, Ugo Matrangolo ugo.matrang...@gmail.com wrote: Hi, I would like some advice about the best way to bootstrap from scratch a SolrCloud cluster housing at least two collections with different sharding/replication setup. Going through the docs/'Solr In Action' book what I have sees so far is that there is a way to bootstrap a SolrCloud cluster with sharding configuration using the: -DnumShards=2 but this (afaik) works only for a single collection. What I need is a way to deploy from scratch a SolrCloud cluster housing (e.g.) two collections Foo and Bar where Foo has only one shard and is replicated everywhere while Bar has three shards and ,again, is replicated. I can't find a config file where to put this sharding plan and I'm starting to think that the only way to do this is after the deploy using the Collections API. Is there a best approach way to do this ? Ugo
Bootstrapping SolrCloud cluster with multiple collections in differene sharding/replication setup
Hi, I would like some advice about the best way to bootstrap from scratch a SolrCloud cluster housing at least two collections with different sharding/replication setup. Going through the docs/'Solr In Action' book what I have sees so far is that there is a way to bootstrap a SolrCloud cluster with sharding configuration using the: -DnumShards=2 but this (afaik) works only for a single collection. What I need is a way to deploy from scratch a SolrCloud cluster housing (e.g.) two collections Foo and Bar where Foo has only one shard and is replicated everywhere while Bar has three shards and ,again, is replicated. I can't find a config file where to put this sharding plan and I'm starting to think that the only way to do this is after the deploy using the Collections API. Is there a best approach way to do this ? Ugo
Re: Bootstrapping SolrCloud cluster with multiple collections in differene sharding/replication setup
Honestly, the best approach is to start with no collections defined and use the collections api. If you want to prefconfigure (which has it’s warts and will likely go away as an option), it’s tricky to do it with different numShards, as that is a global property per node. You would basically set -DnumShards=1 and start your cluster with Foo defined. Then you stop the cluster and define Bar and start with -DnumShards=3. The ability to preconfigure and bootstrap like this was kind of a transitional system meant to help people that knew Solr pre SolrCloud get something up quickly back before we had a collections api. The collections API is much better if you want multiple collections and it’s the future. -- Mark Miller about.me/markrmiller On March 20, 2014 at 10:24:18 AM, Ugo Matrangolo (ugo.matrang...@gmail.com) wrote: Hi, I would like some advice about the best way to bootstrap from scratch a SolrCloud cluster housing at least two collections with different sharding/replication setup. Going through the docs/'Solr In Action' book what I have sees so far is that there is a way to bootstrap a SolrCloud cluster with sharding configuration using the: -DnumShards=2 but this (afaik) works only for a single collection. What I need is a way to deploy from scratch a SolrCloud cluster housing (e.g.) two collections Foo and Bar where Foo has only one shard and is replicated everywhere while Bar has three shards and ,again, is replicated. I can't find a config file where to put this sharding plan and I'm starting to think that the only way to do this is after the deploy using the Collections API. Is there a best approach way to do this ? Ugo
Re: Bootstrapping SolrCloud cluster with multiple collections in differene sharding/replication setup
You might find this useful: http://heliosearch.org/solrcloud-assigning-nodes-machines/ It uses the collections API to create your collection with zero nodes, then shows how to assign your leaders to specific machines (well, at least specify the nodes the leaders will be created on, it doesn't show how to assign, for instance, shard1 to nodeX) It also shows a way to assign specific replicas on specific nodes to specific shards, although as Mark says this is a transitional technique. I know there's an addreplica command in the works for the collections API that should make this easier, but that's not released yet. Best, Erick On Thu, Mar 20, 2014 at 7:23 AM, Ugo Matrangolo ugo.matrang...@gmail.com wrote: Hi, I would like some advice about the best way to bootstrap from scratch a SolrCloud cluster housing at least two collections with different sharding/replication setup. Going through the docs/'Solr In Action' book what I have sees so far is that there is a way to bootstrap a SolrCloud cluster with sharding configuration using the: -DnumShards=2 but this (afaik) works only for a single collection. What I need is a way to deploy from scratch a SolrCloud cluster housing (e.g.) two collections Foo and Bar where Foo has only one shard and is replicated everywhere while Bar has three shards and ,again, is replicated. I can't find a config file where to put this sharding plan and I'm starting to think that the only way to do this is after the deploy using the Collections API. Is there a best approach way to do this ? Ugo
Re: Bootstrapping SolrCloud cluster with multiple collections in differene sharding/replication setup
Please note that although the article talks about the ADDREPLICA command, that feature is coming in Solr 4.8, so don¹t be confused if you can¹t find it yet. See https://issues.apache.org/jira/browse/SOLR-5130 On 3/20/14, 7:45 AM, Erick Erickson erickerick...@gmail.com wrote: You might find this useful: http://heliosearch.org/solrcloud-assigning-nodes-machines/ It uses the collections API to create your collection with zero nodes, then shows how to assign your leaders to specific machines (well, at least specify the nodes the leaders will be created on, it doesn't show how to assign, for instance, shard1 to nodeX) It also shows a way to assign specific replicas on specific nodes to specific shards, although as Mark says this is a transitional technique. I know there's an addreplica command in the works for the collections API that should make this easier, but that's not released yet. Best, Erick On Thu, Mar 20, 2014 at 7:23 AM, Ugo Matrangolo ugo.matrang...@gmail.com wrote: Hi, I would like some advice about the best way to bootstrap from scratch a SolrCloud cluster housing at least two collections with different sharding/replication setup. Going through the docs/'Solr In Action' book what I have sees so far is that there is a way to bootstrap a SolrCloud cluster with sharding configuration using the: -DnumShards=2 but this (afaik) works only for a single collection. What I need is a way to deploy from scratch a SolrCloud cluster housing (e.g.) two collections Foo and Bar where Foo has only one shard and is replicated everywhere while Bar has three shards and ,again, is replicated. I can't find a config file where to put this sharding plan and I'm starting to think that the only way to do this is after the deploy using the Collections API. Is there a best approach way to do this ? Ugo
Re: SolrCloud: Programmatically create multiple collections?
Hey Shawn .Thanks for your reply. I just want to access the base_url easily by a short instanceDir name. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Programmatically-create-multiple-collections-tp3916927p4084480.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud: Programmatically create multiple collections?
Thank you Ani. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Programmatically-create-multiple-collections-tp3916927p4084485.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud: Programmatically create multiple collections?
On 8/14/2013 12:34 AM, xinwu wrote: Hey Shawn .Thanks for your reply. I just want to access the base_url easily by a short instanceDir name. For index updates and queries, you *can* access it by the /solr/mycollection name. Although there may be no core by that name, the base URL will work. Just now, I also tried /solr/mycollection/admin/system, which I expected would NOT work because I have the collection_shardN_replicaN core names. On my 4.2.1 production cloud, this DOES work. Your email had given me the idea of filing a feature request to allow this shortcut, but it appears that it's already a feature. In situations where maxShardsPerNode is used, you wouldn't be able to use that shortcut to get all the info, but you could get most of it. I can think of a workaround for the maxShardsPerNode limitation: If you access /solr/admin/cores on a machine before asking for further info, your program will know what cores exist on that machine, so you'd be able to get ALL info. Thanks, Shawn
Re: SolrCloud: Programmatically create multiple collections?
HI,Mark. When I managed collections via the Collections API. How can I set the 'instanceDir' name? eg:http://localhost:8983/solr/admin/collections?action=CREATEname=mycollectionnumShards=3replicationFactor=4 My instanceDir is 'mycollection_shard2_replica1'. How can I change it to 'mycollection'? -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Programmatically-create-multiple-collections-tp3916927p4084202.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud: Programmatically create multiple collections?
On 8/13/2013 3:07 AM, xinwu wrote: When I managed collections via the Collections API. How can I set the 'instanceDir' name? eg:http://localhost:8983/solr/admin/collections?action=CREATEname=mycollectionnumShards=3replicationFactor=4 My instanceDir is 'mycollection_shard2_replica1'. How can I change it to 'mycollection'? I don't think the collections API can do this, and to be honest, I don't know why you would want to. It would make it impossible to have more than one shard per Solr node, a capability that many people require. The question of why would you want to? is something I'm genuinely asking here. Admin URLs accessed directly by client programs are the only logical reason I can think of. For querying and updating the index, you can use /solr/mycollection as a base URL to access your index, even though the shard names are different. As for the admin URLs that let you access system information, SOLR-4943 will make most of that available without a core name in Solr 4.5. To access core-specific information, you need to use the actual core name, but it should be possible to gather information about which machine has which core in an automated way. That said, if you create your collection a different way, you should be able to do exactly what you want. What you would want to do is use the zkcli command linkconfig to link a new collection with an already uploaded config set, and then create the individual cores in your collection using the CoreAdmin API instead of the Collections API. http://wiki.apache.org/solr/SolrCloud#Command_Line_Util http://wiki.apache.org/solr/SolrCloud#Creating_cores_via_CoreAdmin Thanks, Shawn
Re: SolrCloud: Programmatically create multiple collections?
At this point you would need a higher level service sitting on top on solr clusters which also talks to your zk setup in order to create custom collections on the fly. its not super difficult, but seems out of scope for solrcloud now. let me know if others have a different opinion. thanks, Ani On Tue, Aug 13, 2013 at 9:52 AM, Shawn Heisey s...@elyograg.org wrote: On 8/13/2013 3:07 AM, xinwu wrote: When I managed collections via the Collections API. How can I set the 'instanceDir' name? eg: http://localhost:8983/solr/admin/collections?action=CREATEname=mycollectionnumShards=3replicationFactor=4 My instanceDir is 'mycollection_shard2_replica1'. How can I change it to 'mycollection'? I don't think the collections API can do this, and to be honest, I don't know why you would want to. It would make it impossible to have more than one shard per Solr node, a capability that many people require. The question of why would you want to? is something I'm genuinely asking here. Admin URLs accessed directly by client programs are the only logical reason I can think of. For querying and updating the index, you can use /solr/mycollection as a base URL to access your index, even though the shard names are different. As for the admin URLs that let you access system information, SOLR-4943 will make most of that available without a core name in Solr 4.5. To access core-specific information, you need to use the actual core name, but it should be possible to gather information about which machine has which core in an automated way. That said, if you create your collection a different way, you should be able to do exactly what you want. What you would want to do is use the zkcli command linkconfig to link a new collection with an already uploaded config set, and then create the individual cores in your collection using the CoreAdmin API instead of the Collections API. http://wiki.apache.org/solr/SolrCloud#Command_Line_Util http://wiki.apache.org/solr/SolrCloud#Creating_cores_via_CoreAdmin Thanks, Shawn -- Anirudha P. Jadhav
Re: Querying multiple collections in SolrCloud
I'd _guess_ that this is unsupported across collections if for no other reason than scores really aren't comparable across collections and the default ordering within groups is score. This is really a federated search type problem. But if it makes sense to use N collections for other reasons, it's really the same thing as grouping functionally, you just send a separate request to each collection and combine the results of those N requests rather than from N groups in a single query. If the collections are hosted on different machines for instance, you might get quicker overall response by firing off parallel queries, It Depends (tm)... Best Erick On Wed, Jun 26, 2013 at 1:46 PM, Chris Toomey ctoo...@gmail.com wrote: Thanks Erick, that's a very helpful answer. Regarding the grouping option, does that require all the docs to be put into a single collection, or could it be done with across N collections (assuming each collection had a common type field for grouping on)? Chris On Wed, Jun 26, 2013 at 7:01 AM, Erick Erickson erickerick...@gmail.com wrote: bq: Would the above setup qualify as multiple compatible collections No. While there may be enough fields in common to form a single query, the TF/IDF calculations will not be compatible and the scores from the various collections will NOT be comparable. So simply getting the list of top N docs will probably be dominated by the docs from a single type. bq: How does SolrCloud combine the query results from multiple collections? It doesn't. SolrCloud sorts the results from multiple nodes in the _same_ collection according to whatever sort criteria are specified, defaulting to score. Say you ask for the top 20 docs. A node from each shard returns the top 20 docs for that shard. The node processing them just merges all the returned lists and only keeps the top 20. I don't think your last two questions are really relevant, SolrCloud isn't built to query multiple collections and return the results coherently. The root problem here is that you're trying to compare docs from different collections for goodness to return the top N. This isn't actually hard _except_ when goodness is the score, then it just doesn't work. You can't even compare scores from different queries on the _same_ collection, much less different ones. Consider two collections, books and songs. One consists of lots and lots of text and the ter frequency and inverse doc freq (TF/IDF) will be hugely different than songs. Not to mention field length normalization. Now, all that aside there's an option. Index all the docs in a single collection and use grouping (aka field collapsing) to get a single response that has the top N docs from each type (they'll be in different sections of the original response) and present them to the user however makes sense. You'll get hands on experience in why this isn't something that's easy to do automatically if you try to sort these into a single list by relevance G... Best Erick On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey ctoo...@gmail.com wrote: Thanks Jack for the alternatives. The first is interesting but has the downside of requiring multiple queries to get the full matching docs. The second is interesting and very simple, but has the downside of not being modular and being difficult to configure field boosting when the collections have overlapping field names with different boosts being needed for the same field in different document types. I'd still like to know about the viability of my original approach though too. Chris On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky j...@basetechnology.com wrote: One simple scenario to consider: N+1 collections - one collection per document type with detailed fields for that document type, and one common collection that indexes a subset of the fields. The main user query would be an edismax over the common fields in that main collection. You can then display summary results from the common collection. You can also then support drill down into the type-specific collection based on a type field for each document in the main collection. Or, sure, you actually CAN index multiple document types in the same collection - add all the fields to one schema - there is no time or space penalty if most of the field are empty for most documents. -- Jack Krupansky -Original Message- From: Chris Toomey Sent: Tuesday, June 25, 2013 6:08 PM To: solr-user@lucene.apache.org Subject: Querying multiple collections in SolrCloud Hi, I'm investigating using SolrCloud for querying documents of different but similar/related types, and have read through docs. on the wiki and done many searches in these archives, but still have some questions. Thanks in advance for your help
Re: Querying multiple collections in SolrCloud
bq: Would the above setup qualify as multiple compatible collections No. While there may be enough fields in common to form a single query, the TF/IDF calculations will not be compatible and the scores from the various collections will NOT be comparable. So simply getting the list of top N docs will probably be dominated by the docs from a single type. bq: How does SolrCloud combine the query results from multiple collections? It doesn't. SolrCloud sorts the results from multiple nodes in the _same_ collection according to whatever sort criteria are specified, defaulting to score. Say you ask for the top 20 docs. A node from each shard returns the top 20 docs for that shard. The node processing them just merges all the returned lists and only keeps the top 20. I don't think your last two questions are really relevant, SolrCloud isn't built to query multiple collections and return the results coherently. The root problem here is that you're trying to compare docs from different collections for goodness to return the top N. This isn't actually hard _except_ when goodness is the score, then it just doesn't work. You can't even compare scores from different queries on the _same_ collection, much less different ones. Consider two collections, books and songs. One consists of lots and lots of text and the ter frequency and inverse doc freq (TF/IDF) will be hugely different than songs. Not to mention field length normalization. Now, all that aside there's an option. Index all the docs in a single collection and use grouping (aka field collapsing) to get a single response that has the top N docs from each type (they'll be in different sections of the original response) and present them to the user however makes sense. You'll get hands on experience in why this isn't something that's easy to do automatically if you try to sort these into a single list by relevance G... Best Erick On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey ctoo...@gmail.com wrote: Thanks Jack for the alternatives. The first is interesting but has the downside of requiring multiple queries to get the full matching docs. The second is interesting and very simple, but has the downside of not being modular and being difficult to configure field boosting when the collections have overlapping field names with different boosts being needed for the same field in different document types. I'd still like to know about the viability of my original approach though too. Chris On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky j...@basetechnology.comwrote: One simple scenario to consider: N+1 collections - one collection per document type with detailed fields for that document type, and one common collection that indexes a subset of the fields. The main user query would be an edismax over the common fields in that main collection. You can then display summary results from the common collection. You can also then support drill down into the type-specific collection based on a type field for each document in the main collection. Or, sure, you actually CAN index multiple document types in the same collection - add all the fields to one schema - there is no time or space penalty if most of the field are empty for most documents. -- Jack Krupansky -Original Message- From: Chris Toomey Sent: Tuesday, June 25, 2013 6:08 PM To: solr-user@lucene.apache.org Subject: Querying multiple collections in SolrCloud Hi, I'm investigating using SolrCloud for querying documents of different but similar/related types, and have read through docs. on the wiki and done many searches in these archives, but still have some questions. Thanks in advance for your help. Setup: * Say that I have N distinct types of documents and I want to do queries that return the best matches regardless document type. I.e., something akin to a Google search where I'd like to get the best matches from the web, news, images, and maps. * Our main use case is supporting simple user-entered searches, which would just contain terms / phrases and wouldn't specify fields. * The document types will not all have the same fields, though there may be some overlap in the fields. * We plan to use a separate collection for each document type, and to use the eDisMax query parser. Each collection would have a document-specific schema configuration with appropriate defaults for query fields and boosts, etc. Questions: * Would the above setup qualify as multiple compatible collections, such that we could search all N collections with a single SolrCloud query, as in the example query http://localhost:8983/solr/**collection1/select?q=apple%** 20piecollection=c1,c2,..http://localhost:8983/solr/collection1/select?q=apple%20piecollection=c1,c2,.. .,cN**? Again, we're not querying against specific fields. * How does SolrCloud combine the query results from multiple collections? Does it re-sort the combined result set, or does it just return
Re: Querying multiple collections in SolrCloud
Thanks Erick, that's a very helpful answer. Regarding the grouping option, does that require all the docs to be put into a single collection, or could it be done with across N collections (assuming each collection had a common type field for grouping on)? Chris On Wed, Jun 26, 2013 at 7:01 AM, Erick Erickson erickerick...@gmail.comwrote: bq: Would the above setup qualify as multiple compatible collections No. While there may be enough fields in common to form a single query, the TF/IDF calculations will not be compatible and the scores from the various collections will NOT be comparable. So simply getting the list of top N docs will probably be dominated by the docs from a single type. bq: How does SolrCloud combine the query results from multiple collections? It doesn't. SolrCloud sorts the results from multiple nodes in the _same_ collection according to whatever sort criteria are specified, defaulting to score. Say you ask for the top 20 docs. A node from each shard returns the top 20 docs for that shard. The node processing them just merges all the returned lists and only keeps the top 20. I don't think your last two questions are really relevant, SolrCloud isn't built to query multiple collections and return the results coherently. The root problem here is that you're trying to compare docs from different collections for goodness to return the top N. This isn't actually hard _except_ when goodness is the score, then it just doesn't work. You can't even compare scores from different queries on the _same_ collection, much less different ones. Consider two collections, books and songs. One consists of lots and lots of text and the ter frequency and inverse doc freq (TF/IDF) will be hugely different than songs. Not to mention field length normalization. Now, all that aside there's an option. Index all the docs in a single collection and use grouping (aka field collapsing) to get a single response that has the top N docs from each type (they'll be in different sections of the original response) and present them to the user however makes sense. You'll get hands on experience in why this isn't something that's easy to do automatically if you try to sort these into a single list by relevance G... Best Erick On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey ctoo...@gmail.com wrote: Thanks Jack for the alternatives. The first is interesting but has the downside of requiring multiple queries to get the full matching docs. The second is interesting and very simple, but has the downside of not being modular and being difficult to configure field boosting when the collections have overlapping field names with different boosts being needed for the same field in different document types. I'd still like to know about the viability of my original approach though too. Chris On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky j...@basetechnology.com wrote: One simple scenario to consider: N+1 collections - one collection per document type with detailed fields for that document type, and one common collection that indexes a subset of the fields. The main user query would be an edismax over the common fields in that main collection. You can then display summary results from the common collection. You can also then support drill down into the type-specific collection based on a type field for each document in the main collection. Or, sure, you actually CAN index multiple document types in the same collection - add all the fields to one schema - there is no time or space penalty if most of the field are empty for most documents. -- Jack Krupansky -Original Message- From: Chris Toomey Sent: Tuesday, June 25, 2013 6:08 PM To: solr-user@lucene.apache.org Subject: Querying multiple collections in SolrCloud Hi, I'm investigating using SolrCloud for querying documents of different but similar/related types, and have read through docs. on the wiki and done many searches in these archives, but still have some questions. Thanks in advance for your help. Setup: * Say that I have N distinct types of documents and I want to do queries that return the best matches regardless document type. I.e., something akin to a Google search where I'd like to get the best matches from the web, news, images, and maps. * Our main use case is supporting simple user-entered searches, which would just contain terms / phrases and wouldn't specify fields. * The document types will not all have the same fields, though there may be some overlap in the fields. * We plan to use a separate collection for each document type, and to use the eDisMax query parser. Each collection would have a document-specific schema configuration with appropriate defaults for query fields and boosts, etc. Questions: * Would the above setup qualify as multiple compatible collections
Querying multiple collections in SolrCloud
Hi, I'm investigating using SolrCloud for querying documents of different but similar/related types, and have read through docs. on the wiki and done many searches in these archives, but still have some questions. Thanks in advance for your help. Setup: * Say that I have N distinct types of documents and I want to do queries that return the best matches regardless document type. I.e., something akin to a Google search where I'd like to get the best matches from the web, news, images, and maps. * Our main use case is supporting simple user-entered searches, which would just contain terms / phrases and wouldn't specify fields. * The document types will not all have the same fields, though there may be some overlap in the fields. * We plan to use a separate collection for each document type, and to use the eDisMax query parser. Each collection would have a document-specific schema configuration with appropriate defaults for query fields and boosts, etc. Questions: * Would the above setup qualify as multiple compatible collections, such that we could search all N collections with a single SolrCloud query, as in the example query http://localhost:8983/solr/collection1/select?q=apple%20piecollection=c1,c2,...,cN;? Again, we're not querying against specific fields. * How does SolrCloud combine the query results from multiple collections? Does it re-sort the combined result set, or does it just return the concatenation of the (unmerged) results from each of the collections? * Does SolrCloud impose any restrictions on querying multiple, sharded collections? I know it supports querying say all 3 shards of a single collection, so want to make sure it would also support say all Nx3 shards of N collections. * When SolrCloud queries multiple shards/collections, it queries them concurrently vs. serially, correct? thanks much, Chris
Re: Querying multiple collections in SolrCloud
One simple scenario to consider: N+1 collections - one collection per document type with detailed fields for that document type, and one common collection that indexes a subset of the fields. The main user query would be an edismax over the common fields in that main collection. You can then display summary results from the common collection. You can also then support drill down into the type-specific collection based on a type field for each document in the main collection. Or, sure, you actually CAN index multiple document types in the same collection - add all the fields to one schema - there is no time or space penalty if most of the field are empty for most documents. -- Jack Krupansky -Original Message- From: Chris Toomey Sent: Tuesday, June 25, 2013 6:08 PM To: solr-user@lucene.apache.org Subject: Querying multiple collections in SolrCloud Hi, I'm investigating using SolrCloud for querying documents of different but similar/related types, and have read through docs. on the wiki and done many searches in these archives, but still have some questions. Thanks in advance for your help. Setup: * Say that I have N distinct types of documents and I want to do queries that return the best matches regardless document type. I.e., something akin to a Google search where I'd like to get the best matches from the web, news, images, and maps. * Our main use case is supporting simple user-entered searches, which would just contain terms / phrases and wouldn't specify fields. * The document types will not all have the same fields, though there may be some overlap in the fields. * We plan to use a separate collection for each document type, and to use the eDisMax query parser. Each collection would have a document-specific schema configuration with appropriate defaults for query fields and boosts, etc. Questions: * Would the above setup qualify as multiple compatible collections, such that we could search all N collections with a single SolrCloud query, as in the example query http://localhost:8983/solr/collection1/select?q=apple%20piecollection=c1,c2,...,cN;? Again, we're not querying against specific fields. * How does SolrCloud combine the query results from multiple collections? Does it re-sort the combined result set, or does it just return the concatenation of the (unmerged) results from each of the collections? * Does SolrCloud impose any restrictions on querying multiple, sharded collections? I know it supports querying say all 3 shards of a single collection, so want to make sure it would also support say all Nx3 shards of N collections. * When SolrCloud queries multiple shards/collections, it queries them concurrently vs. serially, correct? thanks much, Chris
Re: Querying multiple collections in SolrCloud
Thanks Jack for the alternatives. The first is interesting but has the downside of requiring multiple queries to get the full matching docs. The second is interesting and very simple, but has the downside of not being modular and being difficult to configure field boosting when the collections have overlapping field names with different boosts being needed for the same field in different document types. I'd still like to know about the viability of my original approach though too. Chris On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky j...@basetechnology.comwrote: One simple scenario to consider: N+1 collections - one collection per document type with detailed fields for that document type, and one common collection that indexes a subset of the fields. The main user query would be an edismax over the common fields in that main collection. You can then display summary results from the common collection. You can also then support drill down into the type-specific collection based on a type field for each document in the main collection. Or, sure, you actually CAN index multiple document types in the same collection - add all the fields to one schema - there is no time or space penalty if most of the field are empty for most documents. -- Jack Krupansky -Original Message- From: Chris Toomey Sent: Tuesday, June 25, 2013 6:08 PM To: solr-user@lucene.apache.org Subject: Querying multiple collections in SolrCloud Hi, I'm investigating using SolrCloud for querying documents of different but similar/related types, and have read through docs. on the wiki and done many searches in these archives, but still have some questions. Thanks in advance for your help. Setup: * Say that I have N distinct types of documents and I want to do queries that return the best matches regardless document type. I.e., something akin to a Google search where I'd like to get the best matches from the web, news, images, and maps. * Our main use case is supporting simple user-entered searches, which would just contain terms / phrases and wouldn't specify fields. * The document types will not all have the same fields, though there may be some overlap in the fields. * We plan to use a separate collection for each document type, and to use the eDisMax query parser. Each collection would have a document-specific schema configuration with appropriate defaults for query fields and boosts, etc. Questions: * Would the above setup qualify as multiple compatible collections, such that we could search all N collections with a single SolrCloud query, as in the example query http://localhost:8983/solr/**collection1/select?q=apple%** 20piecollection=c1,c2,..http://localhost:8983/solr/collection1/select?q=apple%20piecollection=c1,c2,.. .,cN**? Again, we're not querying against specific fields. * How does SolrCloud combine the query results from multiple collections? Does it re-sort the combined result set, or does it just return the concatenation of the (unmerged) results from each of the collections? * Does SolrCloud impose any restrictions on querying multiple, sharded collections? I know it supports querying say all 3 shards of a single collection, so want to make sure it would also support say all Nx3 shards of N collections. * When SolrCloud queries multiple shards/collections, it queries them concurrently vs. serially, correct? thanks much, Chris
Re: Search across multiple collections
You pretty much need to issue separate queries against each collection and creatively combine them. All of Solr's distributed search stuff pre-supposes two things 1 the schemas are very similar 2 the types of docs in each collection are also very similar. 2 is a bit subtle. If you store different kinds of docs in different cores, then that statistics for term frequency etc. will be different. There's some work being done (I think) to support distributed tf/idf. But anyway, in this case the scores of the docs from one collection will tend to dominate the result set. Or if you're talking about joining, see Anria's comments. Best Erick On Wed, Jun 5, 2013 at 7:34 PM, abillav...@innoventsolutions.com wrote: hi I've successfully searched over several separate collections (cores with unique schemas) using this kind of syntax. This demonstrates a 2 core search http://localhost:8983/solr/collection1/select? q=my phrase to search on start=0 rows=25 fl=*,score fq={!join+fromIndex=collection2+from=sku+to=sku}id:1571 I've split up the parameters so you see easily fq={!join+fromIndex=collection2+from=sku+to=sku}id:1571 -- collection1/select = use the select requestHandler out of collection1 as a base -- collection2 is the 2nd core : equivalent of a table join in SQL -- sku is the field shared in both collection1, and collection2 -- id is the field I want to find the id=1571 in. Hope this helps Anria On 2013-06-05 16:17, bbarani wrote: I am not sure the best way to search across multiple collection using SOLR 4.3. Suppose, each collection have their own config files and I perform various operations on collections individually but when I search I want the search to happen across all collections. Can someone let me know how to perform search on multiple collections? Do I need to use sharding again? -- View this message in context: http://lucene.472066.n3.nabble.com/Search-across-multiple-collections-tp4068469.html Sent from the Solr - User mailing list archive at Nabble.com.
Search across multiple collections
I am not sure the best way to search across multiple collection using SOLR 4.3. Suppose, each collection have their own config files and I perform various operations on collections individually but when I search I want the search to happen across all collections. Can someone let me know how to perform search on multiple collections? Do I need to use sharding again? -- View this message in context: http://lucene.472066.n3.nabble.com/Search-across-multiple-collections-tp4068469.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search across multiple collections
hi I've successfully searched over several separate collections (cores with unique schemas) using this kind of syntax. This demonstrates a 2 core search http://localhost:8983/solr/collection1/select? q=my phrase to search on start=0 rows=25 fl=*,score fq={!join+fromIndex=collection2+from=sku+to=sku}id:1571 I've split up the parameters so you see easily fq={!join+fromIndex=collection2+from=sku+to=sku}id:1571 -- collection1/select = use the select requestHandler out of collection1 as a base -- collection2 is the 2nd core : equivalent of a table join in SQL -- sku is the field shared in both collection1, and collection2 -- id is the field I want to find the id=1571 in. Hope this helps Anria On 2013-06-05 16:17, bbarani wrote: I am not sure the best way to search across multiple collection using SOLR 4.3. Suppose, each collection have their own config files and I perform various operations on collections individually but when I search I want the search to happen across all collections. Can someone let me know how to perform search on multiple collections? Do I need to use sharding again? -- View this message in context: http://lucene.472066.n3.nabble.com/Search-across-multiple-collections-tp4068469.html Sent from the Solr - User mailing list archive at Nabble.com.
Searching across multiple collections (cores)
I've been looking all over for a clear answer to this question and can't seem to find one. It seems like a very basic concept to me though so maybe I'm using the wrong terminology. I want to be able to search across multiple collections (as it is now called in SolrCloud world, previously called Cores). I want the scoring, sorting, faceting etc. to be blended, that is to be relevant to data from all the collections, not just a set of independent results per collection. Is that possible? A real-world example would be a merchandise site that has books, movies and music. The index for each of those is quite different and they would have their own schema.xml (and therefore be their own Collection). When in the 'books' area of a website the users could search on fields specific to books (ISBN for example). However on a 'home' page a search would span across all 3 product lines, and the results should be scored relative to each other, not just relative to other items in their specific collection. Is this possible in v4.0? I'm pretty sure it wasn't in v1.4.1. But it seems to be a fundamentally useful concept, I was wondering if it had been addressed yet. Thanks, Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-across-multiple-collections-cores-tp4047457.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Searching across multiple collections (cores)
Yes, with SolrCloud, it's just the collection param (as long as the schemas are compatible for this): http://wiki.apache.org/solr/SolrCloud#Distributed_Requests - Mark On Mar 14, 2013, at 2:55 PM, kfdroid kfdr...@gmail.com wrote: I've been looking all over for a clear answer to this question and can't seem to find one. It seems like a very basic concept to me though so maybe I'm using the wrong terminology. I want to be able to search across multiple collections (as it is now called in SolrCloud world, previously called Cores). I want the scoring, sorting, faceting etc. to be blended, that is to be relevant to data from all the collections, not just a set of independent results per collection. Is that possible? A real-world example would be a merchandise site that has books, movies and music. The index for each of those is quite different and they would have their own schema.xml (and therefore be their own Collection). When in the 'books' area of a website the users could search on fields specific to books (ISBN for example). However on a 'home' page a search would span across all 3 product lines, and the results should be scored relative to each other, not just relative to other items in their specific collection. Is this possible in v4.0? I'm pretty sure it wasn't in v1.4.1. But it seems to be a fundamentally useful concept, I was wondering if it had been addressed yet. Thanks, Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-across-multiple-collections-cores-tp4047457.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple Collections in one Zookeeper
Ok, I'm a little confused. I had originally bootstrapped zookeeper using a solr.xml file which specified the following cores: cats dogs birds In my /solr/#/cloud?view=tree view I see that I have /collections /cats /dogs /birds /configs /cats /dogs /birds When I launch a new server and connect it to zookeeper, it creates all three collections. What I'd like to do is move cats to it's own set of boxes. When I run: java -DzkHost=zookeeper:9893/cats -jar start.jar or java -DzkHost=zookeeper:9893,zookeeper:9893/cats -jar start.jar I get this error: SEVERE: Could not create Overseer node For simplicity, I'd like to only have zookeeper ensemble. -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Collections-in-one-Zookeeper-tp4045936p4045981.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple Collections in one Zookeeper
You want to create both under different root nodes in zk, so that you would have /cluster1 and /cluster2 Then you startup with addresses of: zookeeper:{port1},zookeeper:{port2}/cluster1 zookeeper:{port2},zookeeper:{port2}/cluster2 If you are using one of the bootstrap calls on startup, it should create those for you with Solr 4.1, otherwise you have to create the root nodes ahead of time (you can use the zkcli tool we provide). - mark On Mar 9, 2013, at 2:38 AM, jimtronic jimtro...@gmail.com wrote: Ok, I'm a little confused. I had originally bootstrapped zookeeper using a solr.xml file which specified the following cores: cats dogs birds In my /solr/#/cloud?view=tree view I see that I have /collections /cats /dogs /birds /configs /cats /dogs /birds When I launch a new server and connect it to zookeeper, it creates all three collections. What I'd like to do is move cats to it's own set of boxes. When I run: java -DzkHost=zookeeper:9893/cats -jar start.jar or java -DzkHost=zookeeper:9893,zookeeper:9893/cats -jar start.jar I get this error: SEVERE: Could not create Overseer node For simplicity, I'd like to only have zookeeper ensemble. -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Collections-in-one-Zookeeper-tp4045936p4045981.html Sent from the Solr - User mailing list archive at Nabble.com.
Multiple Collections in one Zookeeper
Hi, I have a solrcloud cluster running several cores and pointing at one zookeeper. For performance reasons, I'd like to move one of the cores on to it's own dedicated cluster of servers. Can I use the same zookeeper to keep track of both clusters. Thanks! Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Collections-in-one-Zookeeper-tp4045936.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple Collections in one Zookeeper
Yes, but you'll need to append a sub path on to the zookeeper path for your second cluster. For ex: zookeeper1.example.com,zookeeper2.example.com,zookeeper3.example.com/subpath On Mar 8, 2013 6:46 PM, jimtronic jimtro...@gmail.com wrote: Hi, I have a solrcloud cluster running several cores and pointing at one zookeeper. For performance reasons, I'd like to move one of the cores on to it's own dedicated cluster of servers. Can I use the same zookeeper to keep track of both clusters. Thanks! Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Collections-in-one-Zookeeper-tp4045936.html Sent from the Solr - User mailing list archive at Nabble.com.