Re: tipping point for using solrcloud—or not?

2017-10-02 Thread John Blythe
Nope, NRT is within seconds at most in several cases. Sounds like cloud
needs to be whah we plan for.

Thanks!

On Mon, Oct 2, 2017 at 5:39 PM Erick Erickson 
wrote:

> Short form: Use SolCloud from what you've described.
>
> NRT and M/S is simply oil and water. The _very_ best you can do when
> searching slaves is
> master's commit interval + slave polling interval + time to transmit
> the index to the slave + autowarming time on the slave.
>
> Now, that said, when you say NRT it's really "10 minutes is OK" then
> M/S will work for you.
>
> But otherwise I'd be using SolrCloud.
>
> Best,
> Erick
>
> On Mon, Oct 2, 2017 at 1:48 PM, John Blythe  wrote:
> > thanks for the responses, guys.
> >
> > erick: we do need NRT in several cases. also in need of HA pending where
> > the line is drawn. we do need it relatively speaking, i.e. w/i our user
> > base. if the largest of our cores falters then our business is completely
> > stopped till we can get everything reindexed.
> >
> > is there a general rule when it comes to query rate and efficiency
> between
> > Cloud and M/S? in either case we need to add complexity to the system so,
> > if it's a jump ball, that will be the thing that likely tips in favor.
> >
> > emir: the logs aren't write intensive. what are the core benefits to
> > splitting up the machine if there isn't a jvm load issue we're currently
> > experiencing?
> >
> > i can def provide more info that could help in the discussion. help me
> know
> > the best way / stuff to send if you can please.
> >
> > thanks again for the help guys-
> >
> > --
> > John Blythe
> >
> > On Fri, Sep 29, 2017 at 10:27 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> SolrCloud. SolrCloud. SolrCloud.
> >>
> >> Well, it actually depends. I recommend people go to SolrCloud when any
> >> of the following apply:
> >>
> >> > The instant you need to break any collection up into shards because
> >> you're running into the constraints of your hardware (you can't just
> keep
> >> adding memory to the JVM forever).
> >>
> >> > You need NRT searching and need multiple replicas for either your
> >> traffic rate or HA purposes.
> >>
> >> > You find yourself dealing with lots of administrative complexity for
> >> various indexes. You have what sounds like 6-10 cores laying around. You
> >> can move them to different machines without going to SolrCloud, but then
> >> something has to keep track of where they all are and route requests
> >> appropriately. If that gets onerous, SolrCloud will simplify it.
> >>
> >> If none of the above apply, master/slave is just fine. Since you can
> >> rebuild in a couple of hours, most of the difficulty with M/S when the
> >> master goes down are manageable. With a master and several slaves, you
> >> have HA, and a load balancer will see to it that some are used.
> >> There's no real need to exclusively search on the slaves, I've seen
> >> situations where the master is used for queries as well as indexing.
> >>
> >> To increase your query rate, you can just add more slaves to the hot
> >> index, assuming you're content with the latency between indexing and
> >> being able to search newly indexed documents.
> >>
> >> SolrCloud, of course, comes with the added complexity of ZooKeeper.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> On Fri, Sep 29, 2017 at 5:34 AM, John Blythe 
> wrote:
> >> > hi all.
> >> >
> >> > complete noob as to solrcloud here. almost-non-noob on solr in
> general.
> >> >
> >> > we're experiencing growing pains in our data and am thinking through
> >> moving
> >> > to solrcloud as a result. i'm hoping to find out if it seems like a
> good
> >> > strategy or if we need to get other areas of interest handled first
> >> before
> >> > introducing new complexities.
> >> >
> >> > here's a rundown of things:
> >> > - we are on a 30g ram aws instance
> >> > - we have ~30g tucked away in the ../solr/server/ dir
> >> > - our largest core is 6.8g w/ ~25 segments at any given time. this is
> >> also
> >> > the core that our business directly runs off of, users interact with,
> >> etc.
> >> > - 5g is for a logs type of dataset that analytics can be built off of
> to
> >> > help inform the primary core above
> >> > - 3g are taken up by 3 different third party sources that we use solr
> to
> >> > warehouse and have available for query for the sake of linking items
> in
> >> our
> >> > primary core to these cores for data enrichment
> >> > - several others take up < 1g each
> >> > - and then we have dev- and demo- flavors for some of these
> >> >
> >> > we had been operating on a 16gb machine till a few weeks ago (actually
> >> > bumped while at lucene revolution bc i hadn't noticed how much we'd
> >> > outgrown the cache size's needs till the week before!). the load when
> >> doing
> >> > an import or running our heavier operations is much better and doesn't
> >> fall
> >> > under the weight of the operations like it had been doing.
> >> >
> >> > we have no master/slave replic

Re: Is there a parsing issue with "OR NOT" or is something else going on? (Solr 6)

2017-10-02 Thread Erick Erickson
Solr does not (and never has) implemented pure boolean logic. See:
https://lucidworks.com/2011/12/28/why-not-and-or-and-not/

I think your second clause is evaluated as though it were:

("batman" AND "indiana jones") OR (*:* -"cancer")

which is much more what you want.

Best,
Erick

On Mon, Oct 2, 2017 at 10:41 AM, Michael Joyner  wrote:
> Hello all,
>
> What is the difference between the following two queries that causes them to
> give different results? Is there a parsing issue with "OR NOT" or is
> something else going on?
>
> a) ("batman" AND "indiana jones") OR NOT ("cancer") /*only seems to match
> the and clause*/
>
> parsedquery=BoostedQuery(boost(+(+((+((_text_ws:batman)^2.0 |
> (_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1)
> +((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 |
> (_text_txt_en_split:"indiana jone")^0.1)) -(+((_text_ws:cancer)^2.0 |
> (_text_txt:cancer)^0.5 | (_text_txt_en_split:cancer)^0.1
>
> b) ("batman" AND "indiana jones") OR (NOT ("cancer")) /*gives the results we
> expected*/
>
> parsedquery=BoostedQuery(boost(+(+((+((_text_ws:batman)^2.0 |
> (_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1)
> +((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 |
> (_text_txt_en_split:"indiana jone")^0.1)) (-(+((_text_ws:cancer)^2.0 |
> (_text_txt:cancer)^0.5 | (_text_txt_en_split:cancer)^0.1)) +*:*)^1.0))
>
> The first thing I notice is the '+*.*)^1.0' component in the 2nd query's
> parsedquery which is not in the 1st query's parsedquery response. The first
> query does not seem to be matching any of the "NOT" articles to include in
> the union of sets and is not giving us the expected results. Is wrapping
> "NOT" a general requirement when preceded by an operator?
>
> We are using SolrCloud 6.6 and are using q.op=AND with edismax.
>
> Thanks!
>
> -Michael/NewsRx
>
> Full debug outputs:
>
> {rawquerystring={!boost
> b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}{!edismax}(("batman" AND
> "indiana jones") OR NOT ("cancer")), querystring={!boost
> b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}{!edismax}(("batman" AND
> "indiana jones") OR NOT ("cancer")),
> parsedquery=BoostedQuery(boost(+(+((+((_text_ws:batman)^2.0 |
> (_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1)
> +((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 |
> (_text_txt_en_split:"indiana jone")^0.1)) -(+((_text_ws:cancer)^2.0 |
> (_text_txt:cancer)^0.5 |
> (_text_txt_en_split:cancer)^0.1,1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0))),
> parsedquery_toString=boost(+(+((+((_text_ws:batman)^2.0 |
> (_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1)
> +((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 |
> (_text_txt_en_split:"indiana jone")^0.1)) -(+((_text_ws:cancer)^2.0 |
> (_text_txt:cancer)^0.5 |
> (_text_txt_en_split:cancer)^0.1,1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0)),
> QParser=ExtendedDismaxQParser, altquerystring=null, boost_queries=null,
> parsed_boost_queries=[], boostfuncs=null,
> boost_str=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1),
> boost_parsed=org.apache.lucene.queries.function.valuesource.ReciprocalFloatFunction:1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0),
> filter_queries=[issuedate_tdt:[2000\-09\-18T04\:00\:00Z/DAY TO
> 2017\-10\-02T04\:00\:00Z/DAY+1DAY}, types_ss:(TrademarkApp OR Stockmarket OR
> AllClinicalTrials OR PressRelease OR Patent OR SEC OR Scholarly OR
> ClinicalTrial)], parsed_filter_queries=[+issuedate_tdt:[96924960 TO
> 150700320}, +(types_ss:TrademarkApp types_ss:Stockmarket
> types_ss:AllClinicalTrials types_ss:PressRelease types_ss:Patent
> types_ss:SEC types_ss:Scholarly types_ss:ClinicalTrial)]}
>
> {rawquerystring={!boost
> b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}{!edismax}(("batman" AND
> "indiana jones") OR (NOT ("cancer"))), querystring={!boost
> b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}{!edismax}(("batman" AND
> "indiana jones") OR (NOT ("cancer"))),
> parsedquery=BoostedQuery(boost(+(+((+((_text_ws:batman)^2.0 |
> (_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1)
> +((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 |
> (_text_txt_en_split:"indiana jone")^0.1)) (-(+((_text_ws:cancer)^2.0 |
> (_text_txt:cancer)^0.5 | (_text_txt_en_split:cancer)^0.1))
> +*:*)^1.0)),1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0))),
> parsedquery_toString=boost(+(+((+((_text_ws:batman)^2.0 |
> (_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1)
> +((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 |
> (_text_txt_en_split:"indiana jone")^0.1)) (-(+((_text_ws:cancer)^2.0 |
> (_text_txt:cancer)^0.5 | (_text_txt_en_split:cancer)^0.1))
> +*:*)^1.0)),1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0)),
> QParser=ExtendedDismaxQParser, altquerystring=null, boost_queries=null,
> parsed_boost_queries=[], boos

Re: tipping point for using solrcloud—or not?

2017-10-02 Thread Erick Erickson
Short form: Use SolCloud from what you've described.

NRT and M/S is simply oil and water. The _very_ best you can do when
searching slaves is
master's commit interval + slave polling interval + time to transmit
the index to the slave + autowarming time on the slave.

Now, that said, when you say NRT it's really "10 minutes is OK" then
M/S will work for you.

But otherwise I'd be using SolrCloud.

Best,
Erick

On Mon, Oct 2, 2017 at 1:48 PM, John Blythe  wrote:
> thanks for the responses, guys.
>
> erick: we do need NRT in several cases. also in need of HA pending where
> the line is drawn. we do need it relatively speaking, i.e. w/i our user
> base. if the largest of our cores falters then our business is completely
> stopped till we can get everything reindexed.
>
> is there a general rule when it comes to query rate and efficiency between
> Cloud and M/S? in either case we need to add complexity to the system so,
> if it's a jump ball, that will be the thing that likely tips in favor.
>
> emir: the logs aren't write intensive. what are the core benefits to
> splitting up the machine if there isn't a jvm load issue we're currently
> experiencing?
>
> i can def provide more info that could help in the discussion. help me know
> the best way / stuff to send if you can please.
>
> thanks again for the help guys-
>
> --
> John Blythe
>
> On Fri, Sep 29, 2017 at 10:27 AM, Erick Erickson 
> wrote:
>
>> SolrCloud. SolrCloud. SolrCloud.
>>
>> Well, it actually depends. I recommend people go to SolrCloud when any
>> of the following apply:
>>
>> > The instant you need to break any collection up into shards because
>> you're running into the constraints of your hardware (you can't just keep
>> adding memory to the JVM forever).
>>
>> > You need NRT searching and need multiple replicas for either your
>> traffic rate or HA purposes.
>>
>> > You find yourself dealing with lots of administrative complexity for
>> various indexes. You have what sounds like 6-10 cores laying around. You
>> can move them to different machines without going to SolrCloud, but then
>> something has to keep track of where they all are and route requests
>> appropriately. If that gets onerous, SolrCloud will simplify it.
>>
>> If none of the above apply, master/slave is just fine. Since you can
>> rebuild in a couple of hours, most of the difficulty with M/S when the
>> master goes down are manageable. With a master and several slaves, you
>> have HA, and a load balancer will see to it that some are used.
>> There's no real need to exclusively search on the slaves, I've seen
>> situations where the master is used for queries as well as indexing.
>>
>> To increase your query rate, you can just add more slaves to the hot
>> index, assuming you're content with the latency between indexing and
>> being able to search newly indexed documents.
>>
>> SolrCloud, of course, comes with the added complexity of ZooKeeper.
>>
>> Best,
>> Erick
>>
>>
>>
>> On Fri, Sep 29, 2017 at 5:34 AM, John Blythe  wrote:
>> > hi all.
>> >
>> > complete noob as to solrcloud here. almost-non-noob on solr in general.
>> >
>> > we're experiencing growing pains in our data and am thinking through
>> moving
>> > to solrcloud as a result. i'm hoping to find out if it seems like a good
>> > strategy or if we need to get other areas of interest handled first
>> before
>> > introducing new complexities.
>> >
>> > here's a rundown of things:
>> > - we are on a 30g ram aws instance
>> > - we have ~30g tucked away in the ../solr/server/ dir
>> > - our largest core is 6.8g w/ ~25 segments at any given time. this is
>> also
>> > the core that our business directly runs off of, users interact with,
>> etc.
>> > - 5g is for a logs type of dataset that analytics can be built off of to
>> > help inform the primary core above
>> > - 3g are taken up by 3 different third party sources that we use solr to
>> > warehouse and have available for query for the sake of linking items in
>> our
>> > primary core to these cores for data enrichment
>> > - several others take up < 1g each
>> > - and then we have dev- and demo- flavors for some of these
>> >
>> > we had been operating on a 16gb machine till a few weeks ago (actually
>> > bumped while at lucene revolution bc i hadn't noticed how much we'd
>> > outgrown the cache size's needs till the week before!). the load when
>> doing
>> > an import or running our heavier operations is much better and doesn't
>> fall
>> > under the weight of the operations like it had been doing.
>> >
>> > we have no master/slave replica. all of our data is 'replicated' by the
>> > fact that it exists in mysql. if solr were to go down it'd be a nice big
>> > fire but one we could recover from within a couple hours by simply
>> > reimporting.
>> >
>> > i'd like to have a more sophisticated set up in place for fault tolerance
>> > than that, of course. i'd also like to see our heavy, many-query based
>> > operations be speedier and better capable of handling multi-thr

Re: Keeping the index naturally ordered by some field

2017-10-02 Thread Erick Erickson
Have you looked at Streaming and Streaming Expressions? This is pretty
 much what they were built for.  Since you're talking a billion
documents, you're probably sharding anyway, in which case I'd guess
you're using SolrCloud.

That's what I'd be using first if at all possible.

Best,
Erick

On Mon, Oct 2, 2017 at 3:15 PM, alexpusch  wrote:
> The reason I'm interested in this is kind of unique. I'm writing a custom
> query parser and search component. These components go over the search
> results and perform some calculation over it. This calculation depends on
> input sorted by a certain value. In this scenario, regular solr sorting is
> insufficient as it's performed in post-search, and only collects needed rows
> to satisfy the query. The alternative for naturally sorted  index is to sort
> all the docs myself, and I wish to avoid this. I use docValues extensively,
> it really is a great help.
>
> Erick, I've tried using SortingMergePolicyFactory. It brings me close to my
> goal, but it's not quite there. The problem with this approach is that while
> each segment is sorted by itself there might be overlapping in ranges
> between the segments. For example, lets say that some query results lay in
> segments A, B, and C. Each one of the segments is sorted, so the docs coming
> from segment A will be sorted in the range 0-50, docs coming from segment B
> will be sorted in the range 20-70, and segment C will hold values in the
> 50-90 range. The query result will be 0-50,20-70, 50-90. Almost sorted, but
> not quite there.
>
> A helpful detail about my data is that the fields I'm interested in sorting
> the index by is a timestamp. Docs are indexed more or less in the correct
> order. As a result, if the merge policy I'm using will merge only
> consecutive segments, it should satisfy my need. TieredMergePolicy does
> merge non-consecutive segments so it's clearly a bad fit. I'm hoping to get
> some insight about some additional steps I may take so that
> SortingMergePolicyFactory could achieve perfection.
>
> Thanks!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Authentication error : request has come without principal. failed permission

2017-10-02 Thread Shamik Bandopadhyay
Hi,

  I'm seeing this random Authentication failure in our Solr Cloud cluster
which is eventually rendering the nodes in "down" state. This doesn't seem
to have a pattern, just starts to happen out of the blue. I've 2 shards,
each having two replicas. They are using Solr basic authentication plugin.

Here's the error log:

org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147)
- request has come without principal. failed permission {
  "name":"select",
  "collection":"knowledge",
  "path":"/select",
  "role":[
"admin",
"dev",
"read"]}

org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
USER_REQUIRED auth header null context : userPrincipal: [null] type:
[READ], collections: [knowledge,], Path: [/select] path : /select params
:q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2

It eventually throws zookeeper timeout session and disappears from the
cluster.

org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1156) - *Client
session timed out, have not heard from server in 229984ms for sessionid
0x35ec984bea00016, closing socket connection and attempting reconnect*

If I restart the node, it goes into recovery mode, but at the same time,
the other healthy replica starts throwing the authentication error and
eventually spirals into the downed state. This happens across all the nodes
till everyone has gone through one restart cycle.

Here are a couple of other exceptions I've seen in the log:

org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1539)
- Exception while writing response for params:
generation=14327&qt=/replication&file=_1cww.fdt&offset=127926272&checksum=true&wt=filestream&command=filecontent
java.io.IOException: java.util.concurrent.TimeoutException: *Idle timeout
expired: 50001/5 ms*
at
org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:219)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:220)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:491)
at
org.apache.commons.io.output.ProxyOutputStream.write(ProxyOutputStream.java:90)
at
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:213)
at
org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:83)
at
org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1520)
at org.apache.solr.core.SolrCore$3.write(SolrCore.java:2601)
at
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:49)
at
org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:809)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:538)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)


org.apache.solr.security.PKIAuthenticationPlugin.parseCipher(PKIAuthenticationPlugin.java:175)
- *Decryption failed , key must be wrong*
java.security.InvalidKeyException: No installed provider supports this key:
(null)
at javax.crypto.Cipher.chooseProvider(Cipher.java:893)
at javax.crypto.Cipher.init(Cipher.java:1249)
at javax.crypto.Cipher.init(Cipher.java:1186)
at org.apache.solr.util.CryptoKeys.decryptRSA(CryptoKeys.java:277)
at
org.apache.solr.security.PKIAuthenticationPlugin.parseCipher(PKIAuthenticationPlugin.java:173)
at
org.apache.solr.security.PKIAuthenticationPlugin.decipherHeader(PKIAuthenticationPlugin.java:160)
at
org.apache.solr.security.PKIAuthenticationPlugin.doAuthenticate(PKIAuthenticationPlugin.java:118)
at
org.apache.solr.servlet.SolrDispatchFilter.authenticateRequest(SolrDispatchFilter.java:430)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)


Re: Keeping the index naturally ordered by some field

2017-10-02 Thread alexpusch
The reason I'm interested in this is kind of unique. I'm writing a custom
query parser and search component. These components go over the search
results and perform some calculation over it. This calculation depends on
input sorted by a certain value. In this scenario, regular solr sorting is
insufficient as it's performed in post-search, and only collects needed rows
to satisfy the query. The alternative for naturally sorted  index is to sort
all the docs myself, and I wish to avoid this. I use docValues extensively,
it really is a great help.

Erick, I've tried using SortingMergePolicyFactory. It brings me close to my
goal, but it's not quite there. The problem with this approach is that while
each segment is sorted by itself there might be overlapping in ranges
between the segments. For example, lets say that some query results lay in
segments A, B, and C. Each one of the segments is sorted, so the docs coming
from segment A will be sorted in the range 0-50, docs coming from segment B
will be sorted in the range 20-70, and segment C will hold values in the
50-90 range. The query result will be 0-50,20-70, 50-90. Almost sorted, but
not quite there. 

A helpful detail about my data is that the fields I'm interested in sorting
the index by is a timestamp. Docs are indexed more or less in the correct
order. As a result, if the merge policy I'm using will merge only
consecutive segments, it should satisfy my need. TieredMergePolicy does
merge non-consecutive segments so it's clearly a bad fit. I'm hoping to get
some insight about some additional steps I may take so that 
SortingMergePolicyFactory could achieve perfection. 

Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: tipping point for using solrcloud—or not?

2017-10-02 Thread John Blythe
thanks for the responses, guys.

erick: we do need NRT in several cases. also in need of HA pending where
the line is drawn. we do need it relatively speaking, i.e. w/i our user
base. if the largest of our cores falters then our business is completely
stopped till we can get everything reindexed.

is there a general rule when it comes to query rate and efficiency between
Cloud and M/S? in either case we need to add complexity to the system so,
if it's a jump ball, that will be the thing that likely tips in favor.

emir: the logs aren't write intensive. what are the core benefits to
splitting up the machine if there isn't a jvm load issue we're currently
experiencing?

i can def provide more info that could help in the discussion. help me know
the best way / stuff to send if you can please.

thanks again for the help guys-

--
John Blythe

On Fri, Sep 29, 2017 at 10:27 AM, Erick Erickson 
wrote:

> SolrCloud. SolrCloud. SolrCloud.
>
> Well, it actually depends. I recommend people go to SolrCloud when any
> of the following apply:
>
> > The instant you need to break any collection up into shards because
> you're running into the constraints of your hardware (you can't just keep
> adding memory to the JVM forever).
>
> > You need NRT searching and need multiple replicas for either your
> traffic rate or HA purposes.
>
> > You find yourself dealing with lots of administrative complexity for
> various indexes. You have what sounds like 6-10 cores laying around. You
> can move them to different machines without going to SolrCloud, but then
> something has to keep track of where they all are and route requests
> appropriately. If that gets onerous, SolrCloud will simplify it.
>
> If none of the above apply, master/slave is just fine. Since you can
> rebuild in a couple of hours, most of the difficulty with M/S when the
> master goes down are manageable. With a master and several slaves, you
> have HA, and a load balancer will see to it that some are used.
> There's no real need to exclusively search on the slaves, I've seen
> situations where the master is used for queries as well as indexing.
>
> To increase your query rate, you can just add more slaves to the hot
> index, assuming you're content with the latency between indexing and
> being able to search newly indexed documents.
>
> SolrCloud, of course, comes with the added complexity of ZooKeeper.
>
> Best,
> Erick
>
>
>
> On Fri, Sep 29, 2017 at 5:34 AM, John Blythe  wrote:
> > hi all.
> >
> > complete noob as to solrcloud here. almost-non-noob on solr in general.
> >
> > we're experiencing growing pains in our data and am thinking through
> moving
> > to solrcloud as a result. i'm hoping to find out if it seems like a good
> > strategy or if we need to get other areas of interest handled first
> before
> > introducing new complexities.
> >
> > here's a rundown of things:
> > - we are on a 30g ram aws instance
> > - we have ~30g tucked away in the ../solr/server/ dir
> > - our largest core is 6.8g w/ ~25 segments at any given time. this is
> also
> > the core that our business directly runs off of, users interact with,
> etc.
> > - 5g is for a logs type of dataset that analytics can be built off of to
> > help inform the primary core above
> > - 3g are taken up by 3 different third party sources that we use solr to
> > warehouse and have available for query for the sake of linking items in
> our
> > primary core to these cores for data enrichment
> > - several others take up < 1g each
> > - and then we have dev- and demo- flavors for some of these
> >
> > we had been operating on a 16gb machine till a few weeks ago (actually
> > bumped while at lucene revolution bc i hadn't noticed how much we'd
> > outgrown the cache size's needs till the week before!). the load when
> doing
> > an import or running our heavier operations is much better and doesn't
> fall
> > under the weight of the operations like it had been doing.
> >
> > we have no master/slave replica. all of our data is 'replicated' by the
> > fact that it exists in mysql. if solr were to go down it'd be a nice big
> > fire but one we could recover from within a couple hours by simply
> > reimporting.
> >
> > i'd like to have a more sophisticated set up in place for fault tolerance
> > than that, of course. i'd also like to see our heavy, many-query based
> > operations be speedier and better capable of handling multi-threaded runs
> > at once w/ ease.
> >
> > is this a matter of getting still more ram on the machine? cpus for
> faster
> > processing? splitting up the read/write operations between master/slave?
> > going full steam into a solrcloud configuration?
> >
> > one more note. per discussion at the conference i'm combing through our
> > configs to make sure we trim any fat we can. also wanting to get
> > optimization scheduled more regularly to help out w segmentation and
> > garbage heap. not sure how far those two alone will get us, though.
> >
> > thanks for any thoughts!
> >
> > --
> > J

Re: solr.log rotation

2017-10-02 Thread Shawn Heisey
On 10/2/2017 8:39 AM, Noriyuki TAKEI wrote:
> HI,All
>
> When I restart Solr Service, solr.log is rotated as below.
>
> solr.log.1
> solr.log.2
> solr.log.3
> ...
>
> I would like to stop this rotation.

To keep Solr startup from rotating the logfile, you'll need to edit the
bin/solr or bin\solr.cmd script and remove the lofgile renaming.

That is not the only thing that rotates the logfile, though.  The
default log4j.properties file that comes with Solr rotates when solr.log
reaches 4 megabytes.  See the log4j documentation for details about how
to change that configuration.

Thanks,
Shawn



Is there a parsing issue with "OR NOT" or is something else going on? (Solr 6)

2017-10-02 Thread Michael Joyner

Hello all,

What is the difference between the following two queries that causes 
them to give different results? Is there a parsing issue with "OR NOT" 
or is something else going on?


a) ("batman" AND "indiana jones") OR NOT ("cancer") /*only seems to 
match the and clause*/


parsedquery=BoostedQuery(boost(+(+((+((_text_ws:batman)^2.0 | 
(_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1) 
+((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 | 
(_text_txt_en_split:"indiana jone")^0.1)) -(+((_text_ws:cancer)^2.0 | 
(_text_txt:cancer)^0.5 | (_text_txt_en_split:cancer)^0.1


b) ("batman" AND "indiana jones") OR (NOT ("cancer")) /*gives the 
results we expected*/


parsedquery=BoostedQuery(boost(+(+((+((_text_ws:batman)^2.0 | 
(_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1) 
+((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 | 
(_text_txt_en_split:"indiana jone")^0.1)) (-(+((_text_ws:cancer)^2.0 | 
(_text_txt:cancer)^0.5 | (_text_txt_en_split:cancer)^0.1)) +*:*)^1.0))


The first thing I notice is the '+*.*)^1.0' component in the 2nd query's 
parsedquery which is not in the 1st query's parsedquery response. The 
first query does not seem to be matching any of the "NOT" articles to 
include in the union of sets and is not giving us the expected results. 
Is wrapping "NOT" a general requirement when preceded by an operator?


We are using SolrCloud 6.6 and are using q.op=AND with edismax.

Thanks!

-Michael/NewsRx

Full debug outputs:

{rawquerystring={!boost 
b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}{!edismax}(("batman" AND 
"indiana jones") OR NOT ("cancer")), querystring={!boost 
b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}{!edismax}(("batman" AND 
"indiana jones") OR NOT ("cancer")), 
parsedquery=BoostedQuery(boost(+(+((+((_text_ws:batman)^2.0 | 
(_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1) 
+((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 | 
(_text_txt_en_split:"indiana jone")^0.1)) -(+((_text_ws:cancer)^2.0 | 
(_text_txt:cancer)^0.5 | 
(_text_txt_en_split:cancer)^0.1,1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0))), 
parsedquery_toString=boost(+(+((+((_text_ws:batman)^2.0 | 
(_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1) 
+((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 | 
(_text_txt_en_split:"indiana jone")^0.1)) -(+((_text_ws:cancer)^2.0 | 
(_text_txt:cancer)^0.5 | 
(_text_txt_en_split:cancer)^0.1,1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0)), 
QParser=ExtendedDismaxQParser, altquerystring=null, boost_queries=null, 
parsed_boost_queries=[], boostfuncs=null, 
boost_str=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1), 
boost_parsed=org.apache.lucene.queries.function.valuesource.ReciprocalFloatFunction:1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0), 
filter_queries=[issuedate_tdt:[2000\-09\-18T04\:00\:00Z/DAY TO 
2017\-10\-02T04\:00\:00Z/DAY+1DAY}, types_ss:(TrademarkApp OR 
Stockmarket OR AllClinicalTrials OR PressRelease OR Patent OR SEC OR 
Scholarly OR ClinicalTrial)], 
parsed_filter_queries=[+issuedate_tdt:[96924960 TO 150700320}, 
+(types_ss:TrademarkApp types_ss:Stockmarket types_ss:AllClinicalTrials 
types_ss:PressRelease types_ss:Patent types_ss:SEC types_ss:Scholarly 
types_ss:ClinicalTrial)]}


{rawquerystring={!boost 
b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}{!edismax}(("batman" AND 
"indiana jones") OR (NOT ("cancer"))), querystring={!boost 
b=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1)}{!edismax}(("batman" AND 
"indiana jones") OR (NOT ("cancer"))), 
parsedquery=BoostedQuery(boost(+(+((+((_text_ws:batman)^2.0 | 
(_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1) 
+((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 | 
(_text_txt_en_split:"indiana jone")^0.1)) (-(+((_text_ws:cancer)^2.0 | 
(_text_txt:cancer)^0.5 | (_text_txt_en_split:cancer)^0.1)) 
+*:*)^1.0)),1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0))), 
parsedquery_toString=boost(+(+((+((_text_ws:batman)^2.0 | 
(_text_txt:batman)^0.5 | (_text_txt_en_split:batman)^0.1) 
+((_text_ws:"indiana jones")^2.0 | (_text_txt:"indiana jones")^0.5 | 
(_text_txt_en_split:"indiana jone")^0.1)) (-(+((_text_ws:cancer)^2.0 | 
(_text_txt:cancer)^0.5 | (_text_txt_en_split:cancer)^0.1)) 
+*:*)^1.0)),1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0)), 
QParser=ExtendedDismaxQParser, altquerystring=null, boost_queries=null, 
parsed_boost_queries=[], boostfuncs=null, 
boost_str=recip(ms(NOW/DAY,issuedate_tdt),3.16e-11,1,1), 
boost_parsed=org.apache.lucene.queries.function.valuesource.ReciprocalFloatFunction:1.0/(3.16E-11*float(ms(const(150691680),date(issuedate_tdt)))+1.0), 
filter_queries=[issuedate_tdt:[2000\-09\-18T04\:00\:00Z/DAY TO 
2017\-10\-02T04\:00\:00Z/DAY+1DAY}, types_ss:(TrademarkApp OR 
Stockmarket OR AllClinicalTrials OR PressRelease OR Patent OR SEC OR 
Schol

RE: solr.log rotation

2017-10-02 Thread Noriyuki TAKEI
Thanks for your quick reply!!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr cloud without hard commit?

2017-10-02 Thread Erick Erickson
ramBufferSizeMB limits the amount of memory used for indexing and when
it's exceeded the buffer is flushed to disk into a new segment. This
is independent of hard/soft commits.

Soft commits do not _force_ the in-memory structures to be written to
a segment and do not update the segments file, hard commits do both of
these things. For the purposes of that blog I decided it just leads to
confusion to mention segments being created wrt soft commits; they're
not "real" until a hard commit.

bq: If the indexing rate is high, there's not really much difference
between soft commits and hard commits

Not necessarily true. Soft commits open a new searcher and to all the
autowarming, whereas hard commits with openSearcher=false do not. This
can be a _very_ expensive operation.

Second difference: the segments file is not updated unless you hard
commit. So even though I have a bunch of segments written to disk, if
I crash my server they won't be found on restart. Graceful shutdown is
a different story.

At least that's my understanding.

Best,
Erick



On Mon, Oct 2, 2017 at 2:49 AM, alessandro.benedetti
 wrote:
> Hi Erick,
> you said :
> ""mentions that for soft commit, "new segments are created that will
> be merged""
>
> Wait, how did that get in there? Ignore it, I'm taking it out. "
>
> but I think you were not wrong, based on another mailing list thread message
> by Shawn, I read :
> [1]
>
> "If you are using the correct DirectoryFactory type, a soft commit has
> the *possibility* of not writing to disk, but the amount of memory
> reserved is fairly small.
>
> Looking into the source code for NRTCachingDirectoryFactory, I see that
> maxMergeSizeMB defaults to 4, and maxCachedMB defaults to 48.  This is a
> little bit different than what the javadoc states for
> NRTCachingDirectory (5 and 60):
>
> http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/store/NRTCachingDirectory.html
>
> The way I read this, assuming the amount of segment data created is
> small, only the first few soft commits will be entirely handled in
> memory.  After that, older segments must be flushed to disk to make room
> for new ones.
>
> If the indexing rate is high, there's not really much difference between
> soft commits and hard commits.  This also assumes that you have left the
> directory at the default of NRTCachingDirectoryFactory.  If this has
> been changed, then there is no caching in RAM, and soft commit probably
> behaves *exactly* the same as hard commit.
> "
>
> [1]
> http://lucene.472066.n3.nabble.com/High-disk-write-usage-td4344356.html#a4344551
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


search request audit logging

2017-10-02 Thread Michal Hlavac
Hi,

I would like to ask how to implement search audit logging. I've implemented 
some idea but I would like to ask if there is better approach to do this.

Requirement is to log username, search time, all request parameters (q, fq, 
etc.), response data (count, etc) and important thing is to log all errors.

As I need it only for search requests I implemented custom SearchHandler with 
something like:

public class AuditSearchHandler extends SearchHandler {

@Override
public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp) {
try {
super.handleRequest(req, rsp);
} finally {
doAuditLog(req, rsp);
}
}
}

Custom SearchComponent is not option, because it can't handle all errors.

I read also 
/http://lucene.472066.n3.nabble.com/Solr-request-response-lifecycle-and-logging-full-response-time-td4006044.html[1]/
 and they mentioned custom Servlet Filter, but I didn't find example how to 
implement Servlet Filter to SOLR in proper way. If it's ok to edit web.xml

thanks for suggestions, m.



[1] 
http://lucene.472066.n3.nabble.com/Solr-request-response-lifecycle-and-logging-full-response-time-td4006044.html


RE: solr.log rotation

2017-10-02 Thread Oakley, Craig (NIH/NLM/NCBI) [C]
My guess would be to edit server/resources/log4j.properties to have

log4j.appender.file.MaxBackupIndex=0

-Original Message-
From: Noriyuki TAKEI [mailto:nta...@sios.com] 
Sent: Monday, October 02, 2017 10:39 AM
To: solr-user@lucene.apache.org
Subject: solr.log rotation

HI,All

When I restart Solr Service, solr.log is rotated as below.

solr.log.1
solr.log.2
solr.log.3
...

I would like to stop this rotation.

Do you have Any idea?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


solr.log rotation

2017-10-02 Thread Noriyuki TAKEI
HI,All

When I restart Solr Service, solr.log is rotated as below.

solr.log.1
solr.log.2
solr.log.3
...

I would like to stop this rotation.

Do you have Any idea?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


ANNOUNCE: Apache Solr Reference Guide for 7.0 released

2017-10-02 Thread Cassandra Targett
The Lucene PMC is pleased to announce that the Solr Reference Guide
for 7.0 is now available.

This 1,035-page PDF is the definitive guide to using Apache Solr, the
search server built on Apache Lucene.

The Guide can be downloaded from:
https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/apache-solr-ref-guide-7.0.pdf.

It is also available online at https://lucene.apache.org/solr/guide/7_0.

Included in this release is documentation for the major new features
released in Solr 7.0, with an extensive list of configuration changes
and deprecations you should be aware of while upgrading.

Cassandra


Re: Solr 7 default Response now JSON instead of XML causing issues

2017-10-02 Thread Emir Arnautović
Hi Roland,
I guess you can use defaults in solr config to set wt to XML. Something like:


  xml

  

You can also use useParams=“xml_out” and in your params.json have xml params 
defined group xml_out with wt: “xml”

HTH,
Emir

> On 2 Oct 2017, at 13:58, Roland Villemoes  wrote:
> 
> Hi
> 
> Default response in Solr 7 is now JSON instead of XML 
> (https://issues.apache.org/jira/browse/SOLR-10494)
> 
> We are using a system that use the Solr admin/cores api for core status etc. 
> and we can't really change that system. That system expects the XML response. 
> And as far as I can see default also changed to JSON there.
> 
> So:
> 
> Are there any way I can change the admin/cores API back to responses using 
> XML instead of JSON?
> 
> 
> /Roland Villemoes



Solr 7 default Response now JSON instead of XML causing issues

2017-10-02 Thread Roland Villemoes
Hi

Default response in Solr 7 is now JSON instead of XML 
(https://issues.apache.org/jira/browse/SOLR-10494)

We are using a system that use the Solr admin/cores api for core status etc. 
and we can't really change that system. That system expects the XML response. 
And as far as I can see default also changed to JSON there.

So:

Are there any way I can change the admin/cores API back to responses using XML 
instead of JSON?


/Roland Villemoes


Re: How to Index JSON field Solr 5.3.2

2017-10-02 Thread Emir Arnautović
Hi Sharma,
I guess you are looking for nested documents: 
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-index-handlers.html#UploadingDatawithIndexHandlers-NestedChildDocuments
 


It seems DIH supports it since versions 5.1: 
https://issues.apache.org/jira/browse/SOLR-5147 


Regards,
Emir

> On 2 Oct 2017, at 10:50, Deeksha Sharma  wrote:
> 
> Hi everyone,
> 
> 
> I have created a core and index data in Solr using dataImportHandler.
> 
> 
> The schema for the core looks like this:
> 
>  required="true"/>
> 
> required="true"/>
> 
> 
> 
> This is my data in mysql database:
> 
> 
> md5:"376463475574058bba96395bfb87"
> 
> 
> rules: 
> {"fileRules":[{"file_id":1321241,"md5":"376463475574058bba96395bfb87","group_id":69253,"filecdata1":{"versionId":3382858,"version":"1.2.1","detectionNotes":"Generated
>  from Ibiblio Maven2, see URL 
> (http://maven.ibiblio.org/maven2/sk/seges/acris/acris-security-hibernate).","texts":[{"shortText":null,"header":"Sample
>  from URL 
> (http://maven.ibiblio.org/maven2/sk/seges/acris/acris-os-parent/1.2.1/acris-os-parent-1.2.1.pom)","text":"
>The Apache Software License, Version 2.0
>http://www.apache.org/licenses/LICENSE-2.0.txt
>repo
> "}],"notes":[],"forge":"Ibiblio 
> Maven2"}}],"groupRules":[{"group_id":69253,"parent":-1,"component":"sk.seges.acris/acris-security-hibernate
>  - AcrIS Security with Hibernate metadata","license":"Apache 
> 2.0","groupcdata1":{"componentId":583560,"title":"sk.seges.acris/acris-security-hibernate
>  - Ibiblio 
> Maven2","licenseIds":[20],"priority":3,"url":"http://maven.ibiblio.org/maven2/sk/seges/acris/acris-security-hibernate","displayName":"AcrIS
>  Security with Hibernate 
> metadata","description":null,"texts":[],"notes":[],"forge":"Ibiblio 
> Maven2"}}]}
> 
> Query results from Solr:
> 
> { "responseHeader":{ "status":0, "QTime":0, "params":{ 
> "q":"md5:03bb576a6b6e001cd94e91ad4c29", "indent":"on", "wt":"json", 
> "_":"1506933082656"}}, "response":{"numFound":1,"start":0,"docs":[ { 
> "rules":"{\"fileRules\":[{\"file_id\":7328190,\"md5\":\"03bb576a6b6e001cd94e91ad4c29\",\"group_id\":241307,\"filecdata1\":{\"versionId\":15761972,\"version\":\"1.0.2\",\"detectionNotes\":null,\"texts\":[{\"shortText\":null,\"header\":\"The
>  following text is found at URL 
> (https://www.nuget.org/packages/HangFire.Redis/1.0.2)\",\"text\":\"License 
> details:\nLGPL-3.0\"}],\"notes\":[],\"forge\":\"NuGet 
> Gallery\"}}],\"groupRules\":[{\"group_id\":241307,\"parent\":-1,\"component\":\"HangFire.Redis\",\"license\":\"LGPL
>  
> 3.0\",\"groupcdata1\":{\"componentId\":3524318,\"title\":null,\"licenseIds\":[216],\"priority\":1,\"url\":\"https://www.nuget.org/packages/HangFire.Redis\",\"displayName\":\"Hangfire
>  Redis Storage [DEPRECATED]\",\"description\":\"DEPRECATED -- DO NOT INSTALL 
> OR UPDATE. Now shipped with Hangfire Pro, please read the \"Project site\" 
> (http://odinserj.net/2014/11/15/hangfire-pro/) for more 
> information.\",\"texts\":[{\"shortText\":null,\"header\":\"License details 
> history:\n(Refer to https://www.nuget.org/packages/HangFire.Redis and select 
> the desired version for more information)\",\"text\":\"LGPL-3.0 - (for 
> HangFire.Redis versions 0.7.0, 0.7.1, 0.7.3, 0.7.4, 0.7.5, 0.8.0, 0.8.1, 
> 0.8.2, 0.8.3, 0.9.0, 0.9.1, 1.0.1, 1.0.0, 1.0.2)\nNo information - (for 
> HangFire.Redis versions 1.1.1, 2.0.1, 
> 2.0.0)\"}],\"notes\":[{\"header\":null,\"text\":\"Project Site: 
> http://odinserj.net/2014/11/15/hangfire-pro\"},{\"header\":\"Previous Project 
> Sites\",\"text\":\"https://github.com/odinserj/HangFire - (for Hangfire Redis 
> Storage [DEPRECATED] version 0.7.0)\nhttp://hangfire.io - (for Hangfire Redis 
> Storage [DEPRECATED] versions 0.7.1, 0.7.3, 0.7.4, 0.7.5, 0.8.0, 0.8.1, 
> 0.8.2, 0.8.3, 0.9.0, 0.9.1, 1.0.1, 1.0.0, 1.0.2, 1.1.1)\nNo information - 
> (for Hangfire Redis Storage [DEPRECATED] versions 2.0.1, 
> 2.0.0)\"},{\"header\":\"License 
> links\",\"text\":\"https://raw.github.com/odinserj/HangFire/master/COPYING.LESSER
>  - (for HangFire.Redis version 
> 0.7.0)\nhttps://raw.github.com/odinserj/HangFire/master/LICENSE.md - (for 
> HangFire.Redis versions 0.7.1, 0.7.3, 0.7.4, 0.7.5, 0.8.0, 0.8.1, 0.8.2, 
> 0.8.3, 0.9.0, 
> 0.9.1)\nhttps://raw.github.com/odinserj/Hangfire/master/LICENSE.md - (for 
> HangFire.Redis versions 1.0.1, 1.0.0, 
> 1.0.2)\nhttps://raw.github.com/HangfireIO/Hangfire/master/LICENSE.md - (for 
> HangFire.Redis version 1.1.1)\nNo information - (for HangFire.Redis versions 
> 2.0.1, 2.0.0)\"}],\"forge\":\"NuGet Gallery\"}}]}", 
> "md5":"03bb576a6b6e001cd94e91ad4c29", "_version_":1579807444777828352}] }}
> 
> 
> 
> Now when I receive the results from Solr query, it returns me the String for 
> rules. How can I tell Solr to index rules as JSON and return a 

Re: solr cloud without hard commit?

2017-10-02 Thread alessandro.benedetti
Hi Erick,
you said :
""mentions that for soft commit, "new segments are created that will 
be merged"" 

Wait, how did that get in there? Ignore it, I'm taking it out. "

but I think you were not wrong, based on another mailing list thread message
by Shawn, I read :
[1]

"If you are using the correct DirectoryFactory type, a soft commit has 
the *possibility* of not writing to disk, but the amount of memory 
reserved is fairly small. 

Looking into the source code for NRTCachingDirectoryFactory, I see that 
maxMergeSizeMB defaults to 4, and maxCachedMB defaults to 48.  This is a 
little bit different than what the javadoc states for 
NRTCachingDirectory (5 and 60): 

http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/store/NRTCachingDirectory.html

The way I read this, assuming the amount of segment data created is 
small, only the first few soft commits will be entirely handled in 
memory.  After that, older segments must be flushed to disk to make room 
for new ones. 

If the indexing rate is high, there's not really much difference between 
soft commits and hard commits.  This also assumes that you have left the 
directory at the default of NRTCachingDirectoryFactory.  If this has 
been changed, then there is no caching in RAM, and soft commit probably 
behaves *exactly* the same as hard commit. 
"

[1]
http://lucene.472066.n3.nabble.com/High-disk-write-usage-td4344356.html#a4344551



-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: tipping point for using solrcloud—or not?

2017-10-02 Thread Emir Arnautović
Hi John,
Your data volume does not require SolrCloud, especially if you isolate core 
that is related to your business from other cores. You mentioned that the 
second largest is logs core used for analytics - not sure what sort of logs, 
but if write intensive logging, you might want to isolate those. It is probably 
better to have two 15GB instances than one 30GB and dedicate one instance to 
your main core. If you do not see the size going up in the near future, you can 
go with even smaller one. It may also be better to invest some money into 
instances with SSD. You may consider sending logs to some centralised logging 
solutions (one such is out Logsene http://sematext.com/logsene 
 ).
When it comes to FT, you can still have it with MS model by introducing slaves. 
That can also be one way to isolate cores that your users are facing - they 
will query only slaves and the only replicated core will be the main core.
It is hard to tell more without knowing your ingestion/query rate, query types, 
NRT requirements…

HTH,
Emir

> On 29 Sep 2017, at 17:27, Erick Erickson  wrote:
> 
> SolrCloud. SolrCloud. SolrCloud.
> 
> Well, it actually depends. I recommend people go to SolrCloud when any
> of the following apply:
> 
>> The instant you need to break any collection up into shards because you're 
>> running into the constraints of your hardware (you can't just keep adding 
>> memory to the JVM forever).
> 
>> You need NRT searching and need multiple replicas for either your traffic 
>> rate or HA purposes.
> 
>> You find yourself dealing with lots of administrative complexity for various 
>> indexes. You have what sounds like 6-10 cores laying around. You can move 
>> them to different machines without going to SolrCloud, but then something 
>> has to keep track of where they all are and route requests appropriately. If 
>> that gets onerous, SolrCloud will simplify it.
> 
> If none of the above apply, master/slave is just fine. Since you can
> rebuild in a couple of hours, most of the difficulty with M/S when the
> master goes down are manageable. With a master and several slaves, you
> have HA, and a load balancer will see to it that some are used.
> There's no real need to exclusively search on the slaves, I've seen
> situations where the master is used for queries as well as indexing.
> 
> To increase your query rate, you can just add more slaves to the hot
> index, assuming you're content with the latency between indexing and
> being able to search newly indexed documents.
> 
> SolrCloud, of course, comes with the added complexity of ZooKeeper.
> 
> Best,
> Erick
> 
> 
> 
> On Fri, Sep 29, 2017 at 5:34 AM, John Blythe  wrote:
>> hi all.
>> 
>> complete noob as to solrcloud here. almost-non-noob on solr in general.
>> 
>> we're experiencing growing pains in our data and am thinking through moving
>> to solrcloud as a result. i'm hoping to find out if it seems like a good
>> strategy or if we need to get other areas of interest handled first before
>> introducing new complexities.
>> 
>> here's a rundown of things:
>> - we are on a 30g ram aws instance
>> - we have ~30g tucked away in the ../solr/server/ dir
>> - our largest core is 6.8g w/ ~25 segments at any given time. this is also
>> the core that our business directly runs off of, users interact with, etc.
>> - 5g is for a logs type of dataset that analytics can be built off of to
>> help inform the primary core above
>> - 3g are taken up by 3 different third party sources that we use solr to
>> warehouse and have available for query for the sake of linking items in our
>> primary core to these cores for data enrichment
>> - several others take up < 1g each
>> - and then we have dev- and demo- flavors for some of these
>> 
>> we had been operating on a 16gb machine till a few weeks ago (actually
>> bumped while at lucene revolution bc i hadn't noticed how much we'd
>> outgrown the cache size's needs till the week before!). the load when doing
>> an import or running our heavier operations is much better and doesn't fall
>> under the weight of the operations like it had been doing.
>> 
>> we have no master/slave replica. all of our data is 'replicated' by the
>> fact that it exists in mysql. if solr were to go down it'd be a nice big
>> fire but one we could recover from within a couple hours by simply
>> reimporting.
>> 
>> i'd like to have a more sophisticated set up in place for fault tolerance
>> than that, of course. i'd also like to see our heavy, many-query based
>> operations be speedier and better capable of handling multi-threaded runs
>> at once w/ ease.
>> 
>> is this a matter of getting still more ram on the machine? cpus for faster
>> processing? splitting up the read/write operations between master/slave?
>> going full steam into a solrcloud configuration?
>> 
>> one more note. per discussion at the conference i'm combing through our
>> configs to make sure we trim any fat we can. also wanting to get
>> op

How to Index JSON field Solr 5.3.2

2017-10-02 Thread Deeksha Sharma
Hi everyone,


I have created a core and index data in Solr using dataImportHandler.


The schema for the core looks like this:







This is my data in mysql database:


md5:"376463475574058bba96395bfb87"


rules: 
{"fileRules":[{"file_id":1321241,"md5":"376463475574058bba96395bfb87","group_id":69253,"filecdata1":{"versionId":3382858,"version":"1.2.1","detectionNotes":"Generated
 from Ibiblio Maven2, see URL 
(http://maven.ibiblio.org/maven2/sk/seges/acris/acris-security-hibernate).","texts":[{"shortText":null,"header":"Sample
 from URL 
(http://maven.ibiblio.org/maven2/sk/seges/acris/acris-os-parent/1.2.1/acris-os-parent-1.2.1.pom)","text":"
The Apache Software License, Version 2.0
http://www.apache.org/licenses/LICENSE-2.0.txt
repo
"}],"notes":[],"forge":"Ibiblio 
Maven2"}}],"groupRules":[{"group_id":69253,"parent":-1,"component":"sk.seges.acris/acris-security-hibernate
 - AcrIS Security with Hibernate metadata","license":"Apache 
2.0","groupcdata1":{"componentId":583560,"title":"sk.seges.acris/acris-security-hibernate
 - Ibiblio 
Maven2","licenseIds":[20],"priority":3,"url":"http://maven.ibiblio.org/maven2/sk/seges/acris/acris-security-hibernate","displayName":"AcrIS
 Security with Hibernate 
metadata","description":null,"texts":[],"notes":[],"forge":"Ibiblio Maven2"}}]}

Query results from Solr:

{ "responseHeader":{ "status":0, "QTime":0, "params":{ 
"q":"md5:03bb576a6b6e001cd94e91ad4c29", "indent":"on", "wt":"json", 
"_":"1506933082656"}}, "response":{"numFound":1,"start":0,"docs":[ { 
"rules":"{\"fileRules\":[{\"file_id\":7328190,\"md5\":\"03bb576a6b6e001cd94e91ad4c29\",\"group_id\":241307,\"filecdata1\":{\"versionId\":15761972,\"version\":\"1.0.2\",\"detectionNotes\":null,\"texts\":[{\"shortText\":null,\"header\":\"The
 following text is found at URL 
(https://www.nuget.org/packages/HangFire.Redis/1.0.2)\",\"text\":\"License 
details:\nLGPL-3.0\"}],\"notes\":[],\"forge\":\"NuGet 
Gallery\"}}],\"groupRules\":[{\"group_id\":241307,\"parent\":-1,\"component\":\"HangFire.Redis\",\"license\":\"LGPL
 
3.0\",\"groupcdata1\":{\"componentId\":3524318,\"title\":null,\"licenseIds\":[216],\"priority\":1,\"url\":\"https://www.nuget.org/packages/HangFire.Redis\",\"displayName\":\"Hangfire
 Redis Storage [DEPRECATED]\",\"description\":\"DEPRECATED -- DO NOT INSTALL OR 
UPDATE. Now shipped with Hangfire Pro, please read the \"Project site\" 
(http://odinserj.net/2014/11/15/hangfire-pro/) for more 
information.\",\"texts\":[{\"shortText\":null,\"header\":\"License details 
history:\n(Refer to https://www.nuget.org/packages/HangFire.Redis and select 
the desired version for more information)\",\"text\":\"LGPL-3.0 - (for 
HangFire.Redis versions 0.7.0, 0.7.1, 0.7.3, 0.7.4, 0.7.5, 0.8.0, 0.8.1, 0.8.2, 
0.8.3, 0.9.0, 0.9.1, 1.0.1, 1.0.0, 1.0.2)\nNo information - (for HangFire.Redis 
versions 1.1.1, 2.0.1, 
2.0.0)\"}],\"notes\":[{\"header\":null,\"text\":\"Project Site: 
http://odinserj.net/2014/11/15/hangfire-pro\"},{\"header\":\"Previous Project 
Sites\",\"text\":\"https://github.com/odinserj/HangFire - (for Hangfire Redis 
Storage [DEPRECATED] version 0.7.0)\nhttp://hangfire.io - (for Hangfire Redis 
Storage [DEPRECATED] versions 0.7.1, 0.7.3, 0.7.4, 0.7.5, 0.8.0, 0.8.1, 0.8.2, 
0.8.3, 0.9.0, 0.9.1, 1.0.1, 1.0.0, 1.0.2, 1.1.1)\nNo information - (for 
Hangfire Redis Storage [DEPRECATED] versions 2.0.1, 
2.0.0)\"},{\"header\":\"License 
links\",\"text\":\"https://raw.github.com/odinserj/HangFire/master/COPYING.LESSER
 - (for HangFire.Redis version 
0.7.0)\nhttps://raw.github.com/odinserj/HangFire/master/LICENSE.md - (for 
HangFire.Redis versions 0.7.1, 0.7.3, 0.7.4, 0.7.5, 0.8.0, 0.8.1, 0.8.2, 0.8.3, 
0.9.0, 0.9.1)\nhttps://raw.github.com/odinserj/Hangfire/master/LICENSE.md - 
(for HangFire.Redis versions 1.0.1, 1.0.0, 
1.0.2)\nhttps://raw.github.com/HangfireIO/Hangfire/master/LICENSE.md - (for 
HangFire.Redis version 1.1.1)\nNo information - (for HangFire.Redis versions 
2.0.1, 2.0.0)\"}],\"forge\":\"NuGet Gallery\"}}]}", 
"md5":"03bb576a6b6e001cd94e91ad4c29", "_version_":1579807444777828352}] }}



Now when I receive the results from Solr query, it returns me the String for 
rules. How can I tell Solr to index rules as JSON and return a valid JSON 
instead of escaped String ?

Any help is greatly appreciated.
Thanks!



Re: Distributed IDF configuration query

2017-10-02 Thread alessandro.benedetti
Hi Reth,
there are some problem in the debug for the distributed IDF [1]

Your case seems different though.
It has been a while I experimented that feature but your config seems ok to
me.
What helped me a lot that time was to debug my Solr instance.



[1] https://issues.apache.org/jira/browse/SOLR-7759



-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Keeping the index naturally ordered by some field

2017-10-02 Thread alessandro.benedetti
Hi Alex,
just to explore a bit your question, why do you need that ?
Do you need to reduce query time ?
Have you tried enabling docValues for the fields of interest ?
Doc Values seem to me a pretty useful data structure when sorting is a
requirement.
I am curious to understand why that was not an option.

Regards



-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html