Re: Using of Streaming to join between shards

2017-06-25 Thread mganeshs
Hi Erick,

My scenario goes with two kind of SOLR documents

Document #1 - Real document
#D_uniqueId #D_documentId(unique), #D_documentname, #D_documentdesc,
#D_documentinfo1, #D_documentInfo2, #D_documentInfo3, ... 

Document #2 - to hold documents ACL
#P_uniqueId #P_acl_perm ( multi value field, it contains values of user like
U1, U2, U3, U4.. etc )

Now currently (we have only one shard as of now ) with simple join my query
looks like {!join from=P_uniqueId to=D_uniqueId)P_acl_perm:U1

Number of ACL values per document can grow up to 1M fields.

Now as the number of documents are increasing. we are planning to add one
more shard, by splitting the shard to two. 

As join won't be working with multiple shards. we are planning to use
streams. 

So what should be streaming query to replace this normal join query ( {!join
from=P_uniqueId to=D_uniqueId)P_acl_perm:U1 ) ?

Early responses would be really appreciated !

Regards,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-of-Streaming-to-join-between-shards-tp4342563p4342778.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boosting Documents using the field Value

2017-06-25 Thread govind nitk
Hi Erick,

Exactly this is what I was looking for.
Thanks a lot.


Regards,
Govind

On Mon, Jun 26, 2017 at 12:03 AM, Erick Erickson 
wrote:

> Take a look at function queries. You're probably looking for "field",
> "termfreq" and "if" functions or some other combination like that.
>
> On Sun, Jun 25, 2017 at 9:01 AM, govind nitk 
> wrote:
> > Hi Erik, Thanks for the reply.
> >
> > My intention of using the domain_ct in the qf was, giving the weight
> > present in the that document.
> >
> > e.g
> > qf=category^domain_ct
> >
> > if the current query matched in the category, the boost given will be
> > domain_ct, which is present in the current matched document.
> >
> >
> > So if I have category_1ct, category_2ct, category_3ct, category_4ct as 4
> > indexed categories(text_general fields) and the same document has
> > domain_1ct, domain_2ct, domain_3ct, domain_4ct as 4 different count
> > fields(int), is there any way to achieve:
> >
> > qf=category_1ct^domain_1ct&qf=category_2ct^domain_2ct&qf=
> category_3ct^domain_3ct&qf=category_4ct^domain_4ct
> >   ?
> >
> >
> >
> >
> > Regards
> >
> >
> >
> >
> > On Sat, Jun 24, 2017 at 3:42 PM, Erik Hatcher 
> > wrote:
> >
> >> With dismax use bf=domain_ct. you can also use boost=domain_ct with
> >> edismax.
> >>
> >> > On Jun 23, 2017, at 23:01, govind nitk  wrote:
> >> >
> >> > Hi Solr,
> >> >
> >> > My Index Data:
> >> >
> >> > id name category domain domain_ct
> >> > 1 Banana Fruits Home > Fruits > Banana 2
> >> > 2 Orange Fruits Home > Fruits > Orange 4
> >> > 3 Samsung Mobile Electronics > Mobile > Samsung 3
> >> >
> >> >
> >> > I am able to retrieve the documents with dismax parser with the
> weights
> >> > mentioned as below.
> >> >
> >> > http://localhost:8983/solr/my_index/select?defType=dismax&;
> >> indent=on&q=fruits&qf=category
> >> > ^0.9&qf=name^0.7&wt=json
> >> >
> >> >
> >> > Is it possible to retrieve the documents with weight taken from the
> >> indexed
> >> > field like:
> >> >
> >> > http://localhost:8983/solr/my_index/select?defType=dismax&;
> >> indent=on&q=fruits&qf=category
> >> > ^domain_ct&qf=name^domain_ct&wt=json
> >> >
> >> > Is this possible to give weight from an indexed field ? Am I doing
> >> > something wrong?
> >> > Is there any other way of doing this?
> >> >
> >> >
> >> > Regards
> >>
>


async backup

2017-06-25 Thread Damien Kamerman
I've noticed an issue with the Solr 6.5.1 Collections API BACKUP async
command returning early. The state is finished well before one shard is
finished.

The collection I'm backing up has 12 shards across 6 nodes and I suspect
the issue is that it is not waiting for all backups on the node to finish.

Alternatively, I if I change the request to not be async it works OK but
sometimes I get the exception "backup the collection time out:180s".

Has anyone seen this, or knows a workaround?

Cheers,
Damien.


Re: SOLR Suggester returns either the full field value or single terms only

2017-06-25 Thread govind nitk
Hi Angel,

Please Look at these documents.
1. https://home.apache.org/~ctargett/RefGuidePOC/jekyll-full/suggester.html
2.
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory


Regards,
Govind



On Mon, Jun 26, 2017 at 3:12 AM, Angel Todorov  wrote:

> Hi guys,
>
> I am trying to configure the Suggester in a way that i get google-style
> auto suggestions:
>
> - I don't want the suggestions to be the _whole_ field value
> - I don't want the suggestions to be single terms
>
> For example, if I have a field that has the value "The brown fox jumped
> over the fence"
>
> and I type "br" for example, I would like the get things like "Brown" and
> "brown fox", but not the whole sentence. Also, i don't want my results to
> be just single terms, but to also include phrases.
>
> I have tried a lot of configurations and have found out that if I use the
> Document Dictionary Factory, I get the whole field value as a result, If I
> use the Fuzzy Dictionary, I only get single terms as results. Nothing
> similar to my requirements.  My config is properly configured and my field
> type is stored, because i am getting results, it's just that the results
> are not what I'd expect.
>
> Would greatly appreciate if you can guide me to the right config.
>
> Thanks,
> Angel
>


admin/metrics API or read JMX by jolokia?

2017-06-25 Thread S G
Hi,

The API admin/metrics

in
6.x version of Solr seems to be very good.
Is it performance friendly as well?

We want to use this API to query the metrics every minute or so from all
Solr nodes and push to grafana.
How does this compare with the performance overhead of reading JMX metrics
via Jolokia?

Rest API is surely easier to understand and parse.
However it involves making a REST call that will pass through jetty,
probably take up a thread for each request? etc.
Is Jolokia lighter-weight in this respect?

Some recommendation on this would be great.

Thanks
SG


SOLR Suggester returns either the full field value or single terms only

2017-06-25 Thread Angel Todorov
Hi guys,

I am trying to configure the Suggester in a way that i get google-style
auto suggestions:

- I don't want the suggestions to be the _whole_ field value
- I don't want the suggestions to be single terms

For example, if I have a field that has the value "The brown fox jumped
over the fence"

and I type "br" for example, I would like the get things like "Brown" and
"brown fox", but not the whole sentence. Also, i don't want my results to
be just single terms, but to also include phrases.

I have tried a lot of configurations and have found out that if I use the
Document Dictionary Factory, I get the whole field value as a result, If I
use the Fuzzy Dictionary, I only get single terms as results. Nothing
similar to my requirements.  My config is properly configured and my field
type is stored, because i am getting results, it's just that the results
are not what I'd expect.

Would greatly appreciate if you can guide me to the right config.

Thanks,
Angel


Re: Boosting Documents using the field Value

2017-06-25 Thread Erick Erickson
Take a look at function queries. You're probably looking for "field",
"termfreq" and "if" functions or some other combination like that.

On Sun, Jun 25, 2017 at 9:01 AM, govind nitk  wrote:
> Hi Erik, Thanks for the reply.
>
> My intention of using the domain_ct in the qf was, giving the weight
> present in the that document.
>
> e.g
> qf=category^domain_ct
>
> if the current query matched in the category, the boost given will be
> domain_ct, which is present in the current matched document.
>
>
> So if I have category_1ct, category_2ct, category_3ct, category_4ct as 4
> indexed categories(text_general fields) and the same document has
> domain_1ct, domain_2ct, domain_3ct, domain_4ct as 4 different count
> fields(int), is there any way to achieve:
>
> qf=category_1ct^domain_1ct&qf=category_2ct^domain_2ct&qf=category_3ct^domain_3ct&qf=category_4ct^domain_4ct
>   ?
>
>
>
>
> Regards
>
>
>
>
> On Sat, Jun 24, 2017 at 3:42 PM, Erik Hatcher 
> wrote:
>
>> With dismax use bf=domain_ct. you can also use boost=domain_ct with
>> edismax.
>>
>> > On Jun 23, 2017, at 23:01, govind nitk  wrote:
>> >
>> > Hi Solr,
>> >
>> > My Index Data:
>> >
>> > id name category domain domain_ct
>> > 1 Banana Fruits Home > Fruits > Banana 2
>> > 2 Orange Fruits Home > Fruits > Orange 4
>> > 3 Samsung Mobile Electronics > Mobile > Samsung 3
>> >
>> >
>> > I am able to retrieve the documents with dismax parser with the weights
>> > mentioned as below.
>> >
>> > http://localhost:8983/solr/my_index/select?defType=dismax&;
>> indent=on&q=fruits&qf=category
>> > ^0.9&qf=name^0.7&wt=json
>> >
>> >
>> > Is it possible to retrieve the documents with weight taken from the
>> indexed
>> > field like:
>> >
>> > http://localhost:8983/solr/my_index/select?defType=dismax&;
>> indent=on&q=fruits&qf=category
>> > ^domain_ct&qf=name^domain_ct&wt=json
>> >
>> > Is this possible to give weight from an indexed field ? Am I doing
>> > something wrong?
>> > Is there any other way of doing this?
>> >
>> >
>> > Regards
>>


Re: Boosting Documents using the field Value

2017-06-25 Thread govind nitk
Hi Erik, Thanks for the reply.

My intention of using the domain_ct in the qf was, giving the weight
present in the that document.

e.g
qf=category^domain_ct

if the current query matched in the category, the boost given will be
domain_ct, which is present in the current matched document.


So if I have category_1ct, category_2ct, category_3ct, category_4ct as 4
indexed categories(text_general fields) and the same document has
domain_1ct, domain_2ct, domain_3ct, domain_4ct as 4 different count
fields(int), is there any way to achieve:

qf=category_1ct^domain_1ct&qf=category_2ct^domain_2ct&qf=category_3ct^domain_3ct&qf=category_4ct^domain_4ct
  ?




Regards




On Sat, Jun 24, 2017 at 3:42 PM, Erik Hatcher 
wrote:

> With dismax use bf=domain_ct. you can also use boost=domain_ct with
> edismax.
>
> > On Jun 23, 2017, at 23:01, govind nitk  wrote:
> >
> > Hi Solr,
> >
> > My Index Data:
> >
> > id name category domain domain_ct
> > 1 Banana Fruits Home > Fruits > Banana 2
> > 2 Orange Fruits Home > Fruits > Orange 4
> > 3 Samsung Mobile Electronics > Mobile > Samsung 3
> >
> >
> > I am able to retrieve the documents with dismax parser with the weights
> > mentioned as below.
> >
> > http://localhost:8983/solr/my_index/select?defType=dismax&;
> indent=on&q=fruits&qf=category
> > ^0.9&qf=name^0.7&wt=json
> >
> >
> > Is it possible to retrieve the documents with weight taken from the
> indexed
> > field like:
> >
> > http://localhost:8983/solr/my_index/select?defType=dismax&;
> indent=on&q=fruits&qf=category
> > ^domain_ct&qf=name^domain_ct&wt=json
> >
> > Is this possible to give weight from an indexed field ? Am I doing
> > something wrong?
> > Is there any other way of doing this?
> >
> >
> > Regards
>


Re: Swapping indexes on disk

2017-06-25 Thread Mike Lissner
Weirdly, this happened again today, deleting a brand new 300GB index that
we created after last time, and which had been working for several days.

This time, the index was deleted by our log rotation script, which
restarted solr, so it's very easy to see what happened before and after the
problem. Before the restart, queries were working just fine, and then
during shutdown, the only problematic line is the last, which shows:

590125499 [Thread-0] WARN  org.eclipse.jetty.util.thread.QueuedThreadPool
– 1 threads could not be stopped
590125500 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
– Couldn't stop Thread[qtp597653135-10081,5,main]
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.lucene.search.FieldValueHitQueue$OneComparatorFieldValueHitQueue.lessThan(FieldValueHitQueue.java:84)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.lucene.search.FieldValueHitQueue$OneComparatorFieldValueHitQueue.lessThan(FieldValueHitQueue.java:58)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:246)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at org.apache.lucene.util.PriorityQueue.pop(PriorityQueue.java:177)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.lucene.search.TopFieldCollector.populateResults(TopFieldCollector.java:1207)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.lucene.search.TopDocsCollector.topDocs(TopDocsCollector.java:156)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1622)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:514)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:484)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1976)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
590125501 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.handler.ContextHandlerCollecti

Re: solr /export handler - behavior during close()

2017-06-25 Thread Susmit Shukla
Hi Joel,

Looked at the fix for SOLR-10698, there could be 2 potential issues

- Parallel Stream does not set stream context on newly created SolrStreams
in open() method.

- This results in creation of new uncached HttpSolrClient in open() method
of SolrStream. This client is created using deprecated methods of http
client library (HttpClientUtil.createClient) and behaves differently on
close() than the one created using HttpClientBuilder API. SolrClientCache
too uses the same deprecated API

This test case shows the problem

ParallelStream ps = new parallelStream(tupleStream,...)

while(true){

read();

break after 2 iterations

}

ps.close()

//close() reads through the end of tupleStream.

I tried with HttpClient created by *org**.**apache**.**http**.**impl**.*
*client**.HttpClientBuilder.create()* and close() is working for that.


Thanks,

Susmit

On Wed, May 17, 2017 at 7:33 AM, Susmit Shukla 
wrote:

> Thanks Joel, will try that.
> Binary response would be more performant.
> I observed the server sends responses in 32 kb chunks and the client reads
> it with 8 kb buffer on inputstream. I don't know if changing that can
> impact anything on performance. Even if buffer size is increased on
> httpclient, it can't override the hardcoded 8kb buffer on
> sun.nio.cs.StreamDecoder
>
> Thanks,
> Susmit
>
> On Wed, May 17, 2017 at 5:49 AM, Joel Bernstein 
> wrote:
>
>> Susmit,
>>
>> You could wrap a LimitStream around the outside of all the relational
>> algebra. For example:
>>
>> parallel(limit((intersect(intersect(search, search), union(search,
>> search)
>>
>> In this scenario the limit would happen on the workers.
>>
>> As far as the worker/replica ratio. This will depend on how heavy the
>> export is. If it's a light export, small number of fields, mostly numeric,
>> simple sort params, then I've seen a ratio of 5 (workers) to 1 (replica)
>> work well. This will basically saturate the CPU on the replica. But
>> heavier
>> exports will saturate the replicas with fewer workers.
>>
>> Also I tend to use Direct DocValues to get the best performance. I'm not
>> sure how much difference this makes, but it should eliminate the
>> compression overhead fetching the data from the DocValues.
>>
>> Varun's suggestion of using the binary transport will provide a nice
>> performance increase as well. But you'll need to upgrade. You may need to
>> do that anyway as the fix on the early stream close will be on a later
>> version that was refactored to support the binary transport.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, May 16, 2017 at 8:03 PM, Joel Bernstein 
>> wrote:
>>
>> > Yep, saw it. I'll comment on the ticket for what I believe needs to be
>> > done.
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Tue, May 16, 2017 at 8:00 PM, Varun Thacker 
>> wrote:
>> >
>> >> Hi Joel,Susmit
>> >>
>> >> I created https://issues.apache.org/jira/browse/SOLR-10698 to track
>> the
>> >> issue
>> >>
>> >> @Susmit looking at the stack trace I see the expression is using
>> >> JSONTupleStream
>> >> . I wonder if you tried using JavabinTupleStreamParser could it help
>> >> improve performance ?
>> >>
>> >> On Tue, May 16, 2017 at 9:39 AM, Susmit Shukla <
>> shukla.sus...@gmail.com>
>> >> wrote:
>> >>
>> >> > Hi Joel,
>> >> >
>> >> > queries can be arbitrarily nested with AND/OR/NOT joins e.g.
>> >> >
>> >> > (intersect(intersect(search, search), union(search, search))). If I
>> cut
>> >> off
>> >> > the innermost stream with a limit, the complete intersection would
>> not
>> >> > happen at upper levels. Also would the limit stream have same effect
>> as
>> >> > using /select handler with rows parameter?
>> >> > I am trying to force input stream close through reflection, just to
>> see
>> >> if
>> >> > it gives performance gains.
>> >> >
>> >> > 2) would experiment with null streams. Is workers = number of
>> replicas
>> >> in
>> >> > data collection a good thumb rule? is parallelstream performance
>> upper
>> >> > bounded by number of replicas?
>> >> >
>> >> > Thanks,
>> >> > Susmit
>> >> >
>> >> > On Tue, May 16, 2017 at 5:59 AM, Joel Bernstein 
>> >> > wrote:
>> >> >
>> >> > > Your approach looks OK. The single sharded worker collection is
>> only
>> >> > needed
>> >> > > if you were using CloudSolrStream to send the initial Streaming
>> >> > Expression
>> >> > > to the /stream handler. You are not doing this, so you're approach
>> is
>> >> > fine.
>> >> > >
>> >> > > Here are some thoughts on what you described:
>> >> > >
>> >> > > 1) If you are closing the parallel stream after the top 1000
>> results,
>> >> > then
>> >> > > try wrapping the intersect in a LimitStream. This stream doesn't
>> exist
>> >> > yet
>> >> > > so it will be a custom stream. The LimitStream can return the EOF
>> >> tuple
>> >> > > after it reads N tuples. This will cause the worker nodes to close
>> the
>> >> > > underlying stream and cause the Broken Pipe exception to occur at
>> the
>> >> > > /ex