SolrDeletionPolicy & Core Reload

2021-01-02 Thread John Davis
Hi,

Does Core Reload pick up changes to SolrDeletionPolicy

in solrconfig.xml or does the solr server needs to be restarted?

And what would be the best way to check the current values
of SolrDeletionPolicy (eg maxCommitsToKeep, maxCommitAge) from the solr
admin console?

Thank you.


Re: [EXTERNAL] Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread John Gallagher
While on the topic of renaming roles, I'd like to propose finding a better
term than "overseer" which has historical slavery connotations as well.
Director, perhaps?


John Gallagher

On Thu, Jun 18, 2020 at 8:48 AM Jason Gerlowski 
wrote:

> +1 to rename master/slave, and +1 to choosing terminology distinct
> from what's used for SolrCloud.  I could be happy with several of the
> proposed options.  Since a good few have been proposed though, maybe
> an eventual vote thread is the most organized way to aggregate the
> opinions here.
>
> I'm less positive about the prospect of changing the name of our
> primary git branch.  Most projects that contributors might come from,
> most tutorials out there to learn git, most tools built on top of git
> - the majority are going to assume "master" as the main branch.  I
> appreciate the change that Github is trying to effect in changing the
> default for new projects, but it'll be a long time before that
> competes with the huge bulk of projects, documentation, etc. out there
> using "master".  Our contributors are smart and I'm sure they'd figure
> it out if we used "main" or something else instead, but having a
> non-standard git setup would be one more "papercut" in understanding
> how to contribute to a project that already makes that harder than it
> should.
>
> Jason
>
>
> On Thu, Jun 18, 2020 at 7:33 AM Demian Katz 
> wrote:
> >
> > Regarding people having a problem with the word "master" -- GitHub is
> changing the default branch name away from "master," even in isolation from
> a "slave" pairing... so the terminology seems to be falling out of favor in
> all contexts. See:
> >
> >
> https://www.cnet.com/news/microsofts-github-is-removing-coding-terms-like-master-and-slave/
> >
> > I'm not here to start a debate about the semantics of that, just to
> provide evidence that in some communities, the term "master" is causing
> concern all by itself. If we're going to make the change anyway, it might
> be best to get it over with and pick the most appropriate terminology we
> can agree upon, rather than trying to minimize the amount of change. It's
> going to be backward breaking anyway, so we might as well do it all now
> rather than risk having to go through two separate breaking changes at
> different points in time.
> >
> > - Demian
> >
> > -Original Message-
> > From: Noble Paul 
> > Sent: Thursday, June 18, 2020 1:51 AM
> > To: solr-user@lucene.apache.org
> > Subject: [EXTERNAL] Re: Getting rid of Master/Slave nomenclature in Solr
> >
> > Looking at the code I see a 692 occurrences of the word "slave".
> > Mostly variable names and ref guide docs.
> >
> > The word "slave" is present in the responses as well. Any change in the
> request param/response payload is backward incompatible.
> >
> > I have no objection to changing the names in ref guide and other
> internal variables. Going ahead with backward incompatible changes is
> painful. If somebody has the appetite to take it up, it's OK
> >
> > If we must change, master/follower can be a good enough option.
> >
> > master (noun): A man in charge of an organization or group.
> > master(adj) : having or showing very great skill or proficiency.
> > master(verb): acquire complete knowledge or skill in (a subject,
> technique, or art).
> > master (verb): gain control of; overcome.
> >
> > I hope nobody has a problem with the term "master"
> >
> > On Thu, Jun 18, 2020 at 3:19 PM Ilan Ginzburg 
> wrote:
> > >
> > > Would master/follower work?
> > >
> > > Half the rename work while still getting rid of the slavery
> connotation...
> > >
> > >
> > > On Thu 18 Jun 2020 at 07:13, Walter Underwood 
> wrote:
> > >
> > > > > On Jun 17, 2020, at 4:00 PM, Shawn Heisey 
> wrote:
> > > > >
> > > > > It has been interesting watching this discussion play out on
> > > > > multiple
> > > > open source mailing lists.  On other projects, I have seen a VERY
> > > > high level of resistance to these changes, which I find disturbing
> > > > and surprising.
> > > >
> > > > Yes, it is nice to see everyone just pitch in and do it on this list.
> > > >
> > > > wunder
> > > > Walter Underwood
> > > > wun...@wunderwood.org
> > > > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fobs
> > > > erver.wunderwood.org%2F&data=02%7C01%7Cdemian.katz%40villanova.e
> > > > du%7C1eef0604700a442deb7e08d8134b97fb%7C765a8de5cf9444f09cafae5bf8cf
> > > > a366%7C0%7C0%7C637280562684672329&sdata=0GyK5Tlq0PGsWxl%2FirJOVN
> > > > VaFCELlEChdxuLJ5RxdQs%3D&reserved=0  (my blog)
> > > >
> > > >
> >
> >
> >
> > --
> > -
> > Noble Paul
>


Require java 8 upgrade

2020-05-21 Thread Akhila John
Hi Team,

We use solr 5.3.1 for sitecore 8.2.
We require to upgrade Java version to 'Java 8 Update 251' and remove / Upgrade 
Wireshark to 3.2.3 in our application servers.
Could you please advise if this would have any impact on the solr. Does solr 
5.3.1 support Java 8.

Thanks and regards,

Akhila

Bupa A&NZ email disclaimer: The information contained in this email and any 
attachments is confidential and may be subject to copyright or other 
intellectual property protection. If you are not the intended recipient, you 
are not authorized to use or disclose this information, and we request that you 
notify us by reply mail or telephone and delete the original message from your 
mail system.


Re: Solrcloud Garbage Collection Suspension linked across nodes?

2020-05-13 Thread John Blythe
can we get this person blocked?
--
John Blythe


On Wed, May 13, 2020 at 1:05 PM ART GALLERY  wrote:

> check out the videos on this website TROO.TUBE don't be such a
> sheep/zombie/loser/NPC. Much love!
> https://troo.tube/videos/watch/aaa64864-52ee-4201-922f-41300032f219
>
> On Mon, May 4, 2020 at 5:43 PM Webster Homer
>  wrote:
> >
> > My company has several Solrcloud environments. In our most active cloud
> we are seeing outages that are related to GC pauses. We have about 10
> collections of which 4 get a lot of traffic. The solrcloud consists of 4
> nodes with 6 processors and 11Gb heap size (25Gb physical memory).
> >
> > I notice that the 4 nodes seem to do their garbage collection at almost
> the same time. That seems strange to me. I would expect them to be more
> staggered.
> >
> > This morning we had a GC pause that caused problems . During that time
> our application service was reporting "No live SolrServers available to
> handle this request"
> >
> > Between 3:55 and 3:56 AM all 4 nodes were having some amount of garbage
> collection pauses, for 2 of the nodes it was minor, for one it was 50%. For
> 3 nodes it lasted  until 3>57. However the node with the worst impact
> didn't recover until 4am.
> >
> > How is it that all 4 nodes were in lock step doing GC? If they all are
> doing GC at the same time it defeats the purpose of having redundant cloud
> servers.
> > We just this weekend switched to use G1GC from CMS
> >
> > At this point in time we also saw that traffic to solr was not well
> distributed. The application calls solr using CloudSolrClient which I
> thought did its own load balancing. We saw 10X more traffic going to one
> solr node that all the others, the we saw it start hitting another node.
> All solr queries come from our application.
> >
> > During this period of time I saw only 1 error message in the solr log:
> > ERROR (zkConnectionManagerCallback-8-thread-1) [   ]
> o.a.s.c.ZkController There was a problem finding the leader in
> zk:org.apache.solr.common.SolrException: Could not get leader props
> >
> > We are currently using Solr 7.7.2
> > GC Tuning
> > GC_TUNE="-XX:NewRatio=3 \
> > -XX:SurvivorRatio=4 \
> > -XX:TargetSurvivorRatio=90 \
> > -XX:MaxTenuringThreshold=8 \
> > -XX:+UseG1GC \
> > -XX:MaxGCPauseMillis=250 \
> > -XX:+ParallelRefProcEnabled"
> >
> >
> >
> >
> > This message and any attachment are confidential and may be privileged
> or otherwise protected from disclosure. If you are not the intended
> recipient, you must not copy this message or attachment or disclose the
> contents to any other person. If you have received this transmission in
> error, please notify the sender immediately and delete the message and any
> attachment from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
> >
> >
> >
> > Click http://www.merckgroup.com/disclaimer to access the German,
> French, Spanish and Portuguese versions of this disclaimer.
>


Blocking certain queries

2020-02-03 Thread John Davis
Hello,

Is there a way to block certain queries in solr? For eg a delete for *:* or
if there is a known query that causes problems, can these be blocked at the
solr server layer.


Re: Haystack CFP is open, come and tell us how you tune relevance for Lucene/Solr

2020-01-28 Thread John Blythe
Perfect, thank you!

On Tue, Jan 28, 2020 at 02:18 Charlie Hull  wrote:

> We're expecting prices to be very similar to last year - early bird will
> be $300 ish for conference only and $2250 ish for conference plus a
> training (we're running no less than 5 different classes that week
> including Think Like a Relevance Engineer, Hello LTR and NLP) -
> hopefully this will give you enough information for budgeting.
>
> Speakers get a small discount too!
>
> Cheers
>
> Charlie
>
> On 27/01/2020 22:21, John Blythe wrote:
> > Hey Doug. Do you know the pricing yet? Trying to get something submitted
> to
> > VP so I can take my team to the conference. Thanks!
> >
> > On Mon, Jan 27, 2020 at 14:54 Doug Turnbull <
> > dturnb...@opensourceconnections.com> wrote:
> >
> >> Just an update the CFP was extended to Feb 7th, less than 2 weeks
> away.  ->
> >> http://haystackconf.com
> >>
> >> It's your ethical imperative to share! ;)
> >>
> >>
> https://opensourceconnections.com/blog/2020/01/23/opening-up-search-is-an-ethical-imperative/
> >>
> >> And no talk is too small, people often underestimate what they're doing,
> >> and very much underestimate how interesting others will find your story!
> >> The best talks often come from the least expected people & orgs.
> >>
> >> On Thu, Jan 9, 2020 at 4:13 AM Charlie Hull  wrote:
> >>
> >>> Hi all,
> >>>
> >>> Haystack, the search relevance conference, is confirmed for 29th & 30th
> >>> April 2020 in Charlottesville, Virginia - the CFP is open and we need
> >>> your contributions! More information at www.haystackconf.com
> >>> <http://www.haystackconf.com>including links to previous talks,
> deadline
> >>> is 31st January. We'd love to hear your Lucene/Solr relevance stories.
> >>>
> >>> Cheers
> >>>
> >>> Charlie
> >>> --
> >>>
> >>> Charlie Hull
> >>> Flax - Open Source Enterprise Search
> >>>
> >>> tel/fax: +44 (0)8700 118334
> >>> mobile:  +44 (0)7767 825828
> >>> web: www.flax.co.uk
> >>>
> >>>
> >> --
> >> *Doug Turnbull **| CTO* | OpenSource Connections
> >> <http://opensourceconnections.com>, LLC | 240.476.9983
> >> Author: Relevant Search <http://manning.com/turnbull>
> >> This e-mail and all contents, including attachments, is considered to be
> >> Company Confidential unless explicitly stated otherwise, regardless
> >> of whether attachments are marked as such.
> >>
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>
> --
John Blythe


Re: Haystack CFP is open, come and tell us how you tune relevance for Lucene/Solr

2020-01-27 Thread John Blythe
Hey Doug. Do you know the pricing yet? Trying to get something submitted to
VP so I can take my team to the conference. Thanks!

On Mon, Jan 27, 2020 at 14:54 Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Just an update the CFP was extended to Feb 7th, less than 2 weeks away.  ->
> http://haystackconf.com
>
> It's your ethical imperative to share! ;)
>
> https://opensourceconnections.com/blog/2020/01/23/opening-up-search-is-an-ethical-imperative/
>
> And no talk is too small, people often underestimate what they're doing,
> and very much underestimate how interesting others will find your story!
> The best talks often come from the least expected people & orgs.
>
> On Thu, Jan 9, 2020 at 4:13 AM Charlie Hull  wrote:
>
> > Hi all,
> >
> > Haystack, the search relevance conference, is confirmed for 29th & 30th
> > April 2020 in Charlottesville, Virginia - the CFP is open and we need
> > your contributions! More information at www.haystackconf.com
> > <http://www.haystackconf.com>including links to previous talks, deadline
> > is 31st January. We'd love to hear your Lucene/Solr relevance stories.
> >
> > Cheers
> >
> > Charlie
> > --
> >
> > Charlie Hull
> > Flax - Open Source Enterprise Search
> >
> > tel/fax: +44 (0)8700 118334
> > mobile:  +44 (0)7767 825828
> > web: www.flax.co.uk
> >
> >
>
> --
> *Doug Turnbull **| CTO* | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>
-- 
John Blythe


Solr Payloads

2019-09-20 Thread John Davis
We are using solr payload field and noticed the values extracted using
payload() sometimes don't match the value stored in the field. Is there a
lossy encoding for the payload value?

fq=payload_field:*, fl=payload_field,payload(payload_field, 573131)

"payload_field": "573131|*1568263581*",
"payload(payload_field, 573131)": *1568263550*
  ...
"payload_field": "573131|1568263582",
"payload(payload_field, 573131)": 1568263550


Field definition:

   
  
  
  
  
  
  

John


Re: Solr with encrypted HDFS

2019-09-12 Thread John Thorhauer
Great.  Thanks so much Hendrik for your experience!  We will might have
high volume levels to deal with but probably not high commit rates.

On Thu, Sep 12, 2019 at 1:45 AM Hendrik Haddorp 
wrote:

> Hi,
>
> we have some setups that use an encryption zone in HDFS. Once you have
> the hdfs config setup the rest is transparent to the client and thus
> Solr works just fine like that. Said that, we have some general issues
> with Solr and HDFS. The main problem seems to be around the transaction
> log files. We have a quite high commit rate and these short lived files
> don't seem to play well with HDFS and triple replication of the blocks
> in HDFS. But encryption did not add anything issues for us.
>
> regards,
> Hendrik
>
> On 11.09.19 22:53, John Thorhauer wrote:
> > Hi,
> >
> > I am interested in encrypting/protecting my solr indices.  I am wondering
> > if Solr can work the an encrypted HDFS.  I see that these instructions (
> >
> https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/configuring-hdfs-encryption/content/configuring_and_using_hdfs_data_at_rest_encryption.html
> )
> > explain that:
> >
> > "After permissions are set, Java API clients and HDFS applications with
> > sufficient HDFS and Ranger KMS access privileges can write and read
> to/from
> > files in the encryption zone"
> >
> >
> > So I am wondering if the solr/java api that uses HDFS would work with
> this
> > as well and also, has anyone had experience running this?  Either good
> > or bad?
> >
> > Thanks,
> > John
> >
>
>


Solr with encrypted HDFS

2019-09-11 Thread John Thorhauer
Hi,

I am interested in encrypting/protecting my solr indices.  I am wondering
if Solr can work the an encrypted HDFS.  I see that these instructions (
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/configuring-hdfs-encryption/content/configuring_and_using_hdfs_data_at_rest_encryption.html)
explain that:

"After permissions are set, Java API clients and HDFS applications with
sufficient HDFS and Ranger KMS access privileges can write and read to/from
files in the encryption zone"


So I am wondering if the solr/java api that uses HDFS would work with this
as well and also, has anyone had experience running this?  Either good
or bad?

Thanks,
John


Re: Enabling/disabling docValues

2019-06-11 Thread John Davis
There is no way to match case insensitive without TextFields + no
tokenization. Its a long standing limitation of not being able to apply any
analyzers with str fields.

Thanks for pointing out the re-index page I've seen it. However sometimes
it is hard to re-index in a reasonable amount of time & resources, and if
we empower power users to understand the system better it will help making
more informed tradeoffs.

On Tue, Jun 11, 2019 at 6:52 AM Gus Heck  wrote:

> On Mon, Jun 10, 2019 at 10:53 PM John Davis 
> wrote:
>
> > You have made many assumptions which might not always be realistic a)
> > TextField is always tokenized
>
>
> Well, you could of course change configuration or code to do something else
> but this would be a very odd and misleading thing to do and we would expect
> you to have mentioned it.
>
>
> > b) Users care about precise counts and
>
>
> This is indeed use case dependent if you are talking about approximately
> correct (150 vs 152 etc), but it's pretty reasonable to say that gross
> errors (75 vs 153 or 0 vs 5 etc) more or less make faceting pointless.
>
>
> > c) Users have the luxury or ability to do a full re-index anytime.
>
>
> This is a state of affairs we consistently advise against. The reason we
> give the advice is precisely because one cannot change the schema out from
> under an existing index safely without rewriting the index. Without
> extremely careful design on your side (not using certain features and high
> storage requirements), your index will not retain enough information to
> re-remake itself. Therefore, it is a long standing bad practice to not have
> a separate canonical copy of the data and a means to re-index it (or a
> design where only the very most recent data is important, and a copy of
> that). There is a whole page dedicated to reindexing in the ref guide:
> https://lucene.apache.org/solr/guide/8_0/reindexing.html Here's a relevant
> bit from the current version:
>
> `There is no process in Solr for programmatically reindexing data. When we
> say "reindex", we mean, literally, "index it again". However you got the
> data into the index the first time, you will run that process again. It is
> strongly recommended that Solr users index their data in a repeatable,
> consistent way, so that the process can be easily repeated when the need
> for reindexing arises.`
>
>
> The ref guide has lots of nice info, maybe you should read it rather than
> snubbing one of the nicest and most knowledgeable committers on the project
> (who is helping you for free) by haughtily saying you'll go ask someone
> else... And if you've been left with this situation (no ability to reindex)
> by your predecessor you have our deepest sympathies, but it still doesn't
> change the fact that you need break it to management the your predecessor
> has lost the data required to maintain the system and you still need
> re-index whatever you can salvage somehow, or start fresh.
>
> When Erick is saying you shouldn't be asking that question... >90% of the
> time you really shouldn't be, and if you do pursue it, you'll just waste a
> lot of your own time.
>
>
> > On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson  >
> > wrote:
> >
> > > bq. Does lucene look at %docs in each state, or the first doc or
> > something
> > > else?
> > >
> > > Frankly I don’t care since no matter what, the results of faceting
> mixed
> > > definitions is not useful.
> > >
> > > tl;dr;
> > >
> > > “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> > > means just what I choose it to mean — neither more nor less.’
> > >
> > > So “undefined" in this case means “I don’t see any value at all in
> > chasing
> > > that info down” ;).
> > >
> > > Changing from regular text to SortableText means that the results will
> be
> > > inaccurate no matter what. For example, I have a doc with the value “my
> > dog
> > > has fleas”. When NOT using SortableText, there are multiple tokens so
> > facet
> > > counts would be:
> > >
> > > my (1)
> > > dog (1)
> > > has (1)
> > > fleas (1)
> > >
> > > But for SortableText will be:
> > >
> > > my dog has fleas (1)
> > >
> > > Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> > > doc1 was  indexed before switching to SortableText and doc2 after.
> > > Presumably  the output you want is:
> > >
> > > my dog has fleas (1)
> > > my ca

Re: Enabling/disabling docValues

2019-06-10 Thread John Davis
You have made many assumptions which might not always be realistic a)
TextField is always tokenized b) Users care about precise counts and c)
Users have the luxury or ability to do a full re-index anytime. These are
real issues and there is no black/white solution. I will ask Lucene folks
on the actual implementation.

On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson 
wrote:

> bq. Does lucene look at %docs in each state, or the first doc or something
> else?
>
> Frankly I don’t care since no matter what, the results of faceting mixed
> definitions is not useful.
>
> tl;dr;
>
> “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> means just what I choose it to mean — neither more nor less.’
>
> So “undefined" in this case means “I don’t see any value at all in chasing
> that info down” ;).
>
> Changing from regular text to SortableText means that the results will be
> inaccurate no matter what. For example, I have a doc with the value “my dog
> has fleas”. When NOT using SortableText, there are multiple tokens so facet
> counts would be:
>
> my (1)
> dog (1)
> has (1)
> fleas (1)
>
> But for SortableText will be:
>
> my dog has fleas (1)
>
> Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> doc1 was  indexed before switching to SortableText and doc2 after.
> Presumably  the output you want is:
>
> my dog has fleas (1)
> my cat has fleas (1)
>
> But you can’t get that output.  There are three cases:
>
> 1> Lucene treats all documents as SortableText, faceting on the docValues
> parts. No facets on doc1
>
> my  cat has fleas (1)
>
> 2> Lucene treats all documents as tokenized, faceting on each individual
> token. Faceting is performed on the tokenized content of both,  docValues
> in doc2  ignored
>
> my  (2)
> dog (1)
> has (2)
> fleas (2)
> cat (1)
>
>
> 3> Lucene does the best it can, faceting on the tokens for docs without
> SortableText and docValues if the doc was indexed with Sortable text. doc1
> faceted on tokenized, doc2 on docValues
>
> my  (1)
> dog (1)
> has (1)
> fleas (1)
> my cat has fleas (1)
>
> Since none of those cases is what I want, there’s no point I can see in
> chasing down what actually happens….
>
> Best,
> Erick
>
> P.S. I _think_ Lucene tries to use the definition from the first segment,
> but since whether the lists of segments to be  merged don’t look at the
> field definitions at all. Whether the first segment in the list has
> SortableText or not will not be predictable in a general way even within a
> single run.
>
>
> > On Jun 9, 2019, at 6:53 PM, John Davis 
> wrote:
> >
> > Understood, however code is rarely random/undefined. Does lucene look at
> %
> > docs in each state, or the first doc or something else?
> >
> > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson 
> > wrote:
> >
> >> It’s basically undefined. When segments are merged that have dissimilar
> >> definitions like this what can Lucene do? Consider:
> >>
> >> Faceting on a text (not sortable) means that each individual token in
> the
> >> index is uninverted on the Java heap and the facets are computed for
> each
> >> individual term.
> >>
> >> Faceting on a SortableText field just has a single term per document,
> and
> >> that in the docValues structures as opposed to the inverted index.
> >>
> >> Now you change the value and start indexing. At some point a segment
> >> containing no docValues is merged with a segment containing docValues
> for
> >> the field. The resulting mixed segment is in this state. If you facet on
> >> the field, should the docs without docValues have each individual term
> >> counted? Or just the SortableText values in the docValues structure?
> >> Neither one is right.
> >>
> >> Also remember that Lucene has no notion of schema. That’s entirely
> imposed
> >> on Lucene by Solr carefully constructing low-level analysis chains.
> >>
> >> So I’d _strongly_ recommend you re-index your corpus to a new collection
> >> with the current definition, then perhaps use CREATEALIAS to seamlessly
> >> switch.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 9, 2019, at 12:50 PM, John Davis 
> >> wrote:
> >>>
> >>> Hi there,
> >>> We recently changed a field from TextField + no docValues to
> >>> SortableTextField which has docValues enabled by default. Once I did
> >> this I
> >>> do not see any facet values for the field. I know that once all the
> docs
> >>> are re-indexed facets should work again, however can someone clarify
> the
> >>> current logic of lucene/solr how facets will be computed when schema is
> >>> changed from no docValues to docValues and vice-versa?
> >>>
> >>> 1. Until ALL the docs are re-indexed, no facets will be returned?
> >>> 2. Once certain fraction of docs are re-indexed, those facets will be
> >>> returned?
> >>> 3. Something else?
> >>>
> >>>
> >>> Varun
> >>
> >>
>
>


Re: Enabling/disabling docValues

2019-06-09 Thread John Davis
Understood, however code is rarely random/undefined. Does lucene look at %
docs in each state, or the first doc or something else?

On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson 
wrote:

> It’s basically undefined. When segments are merged that have dissimilar
> definitions like this what can Lucene do? Consider:
>
> Faceting on a text (not sortable) means that each individual token in the
> index is uninverted on the Java heap and the facets are computed for each
> individual term.
>
> Faceting on a SortableText field just has a single term per document, and
> that in the docValues structures as opposed to the inverted index.
>
> Now you change the value and start indexing. At some point a segment
> containing no docValues is merged with a segment containing docValues for
> the field. The resulting mixed segment is in this state. If you facet on
> the field, should the docs without docValues have each individual term
> counted? Or just the SortableText values in the docValues structure?
> Neither one is right.
>
> Also remember that Lucene has no notion of schema. That’s entirely imposed
> on Lucene by Solr carefully constructing low-level analysis chains.
>
> So I’d _strongly_ recommend you re-index your corpus to a new collection
> with the current definition, then perhaps use CREATEALIAS to seamlessly
> switch.
>
> Best,
> Erick
>
> > On Jun 9, 2019, at 12:50 PM, John Davis 
> wrote:
> >
> > Hi there,
> > We recently changed a field from TextField + no docValues to
> > SortableTextField which has docValues enabled by default. Once I did
> this I
> > do not see any facet values for the field. I know that once all the docs
> > are re-indexed facets should work again, however can someone clarify the
> > current logic of lucene/solr how facets will be computed when schema is
> > changed from no docValues to docValues and vice-versa?
> >
> > 1. Until ALL the docs are re-indexed, no facets will be returned?
> > 2. Once certain fraction of docs are re-indexed, those facets will be
> > returned?
> > 3. Something else?
> >
> >
> > Varun
>
>


Enabling/disabling docValues

2019-06-09 Thread John Davis
Hi there,
We recently changed a field from TextField + no docValues to
SortableTextField which has docValues enabled by default. Once I did this I
do not see any facet values for the field. I know that once all the docs
are re-indexed facets should work again, however can someone clarify the
current logic of lucene/solr how facets will be computed when schema is
changed from no docValues to docValues and vice-versa?

1. Until ALL the docs are re-indexed, no facets will be returned?
2. Once certain fraction of docs are re-indexed, those facets will be
returned?
3. Something else?


Varun


Re: Solr Heap Usage

2019-06-07 Thread John Davis
What would be the best way to understand where heap is being used?

On Tue, Jun 4, 2019 at 9:31 PM Greg Harris  wrote:

> Just a couple of points I’d make here. I did some testing a while back in
> which if no commit is made, (hard or soft) there are internal memory
> structures holding tlogs and it will continue to get worse the more docs
> that come in. I don’t know if that’s changed in further versions. I’d
> recommend doing commits with some amount of frequency in indexing heavy
> apps, otherwise you are likely to have heap issues. I personally would
> advocate for some of the points already made. There are too many variables
> going on here and ways to modify stuff to make sizing decisions and think
> you’re doing anything other than a pure guess if you don’t test and
> monitor. I’d advocate for a process in which testing is done regularly to
> figure out questions like number of shards/replicas, heap size, memory etc.
> Hard data, good process and regular testing will trump guesswork every time
>
> Greg
>
> On Tue, Jun 4, 2019 at 9:22 AM John Davis 
> wrote:
>
> > You might want to test with softcommit of hours vs 5m for heavy indexing
> +
> > light query -- even though there is internal memory structure overhead
> for
> > no soft commits, in our testing a 5m soft commit (via commitWithin) has
> > resulted in a very very large heap usage which I suspect is because of
> > other overhead associated with it.
> >
> > On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson 
> > wrote:
> >
> > > I need to update that, didn’t understand the bits about retaining
> > internal
> > > memory structures at the time.
> > >
> > > > On Jun 4, 2019, at 2:10 AM, John Davis 
> > > wrote:
> > > >
> > > > Erick - These conflict, what's changed?
> > > >
> > > > So if I were going to recommend settings, they’d be something like
> > this:
> > > > Do a hard commit with openSearcher=false every 60 seconds.
> > > > Do a soft commit every 5 minutes.
> > > >
> > > > vs
> > > >
> > > > Index-heavy, Query-light
> > > > Set your soft commit interval quite long, up to the maximum latency
> you
> > > can
> > > > stand for documents to be visible. This could be just a couple of
> > minutes
> > > > or much longer. Maybe even hours with the capability of issuing a
> hard
> > > > commit (openSearcher=true) or soft commit on demand.
> > > >
> > >
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <
> erickerick...@gmail.com
> > >
> > > > wrote:
> > > >
> > > >>> I've looked through SolrJ, DIH and others -- is the bottomline
> > > >>> across all of them to "batch updates" and not commit as long as
> > > possible?
> > > >>
> > > >> Of course it’s more complicated than that ;)….
> > > >>
> > > >> But to start, yes, I urge you to batch. Here’s some stats:
> > > >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> > > >>
> > > >> Note that at about 100 docs/batch you hit diminishing returns.
> > > _However_,
> > > >> that test was run on a single shard collection, so if you have 10
> > shards
> > > >> you’d
> > > >> have to send 1,000 docs/batch. I wouldn’t sweat that number much,
> just
> > > >> don’t
> > > >> send one at a time. And there are the usual gotchas if your
> documents
> > > are
> > > >> 1M .vs. 1K.
> > > >>
> > > >> About committing. No, don’t hold off as long as possible. When you
> > > commit,
> > > >> segments are merged. _However_, the default 100M internal buffer
> size
> > > means
> > > >> that segments are written anyway even if you don’t hit a commit
> point
> > > when
> > > >> you have 100M of index data, and merges happen anyway. So you won’t
> > save
> > > >> anything on merging by holding off commits.
> > > >> And you’ll incur penalties. Here’s more than you want to know about
> > > >> commits:
> > > >>
> > > >>
> > >
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-

Re: Solr Heap Usage

2019-06-04 Thread John Davis
You might want to test with softcommit of hours vs 5m for heavy indexing +
light query -- even though there is internal memory structure overhead for
no soft commits, in our testing a 5m soft commit (via commitWithin) has
resulted in a very very large heap usage which I suspect is because of
other overhead associated with it.

On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson 
wrote:

> I need to update that, didn’t understand the bits about retaining internal
> memory structures at the time.
>
> > On Jun 4, 2019, at 2:10 AM, John Davis 
> wrote:
> >
> > Erick - These conflict, what's changed?
> >
> > So if I were going to recommend settings, they’d be something like this:
> > Do a hard commit with openSearcher=false every 60 seconds.
> > Do a soft commit every 5 minutes.
> >
> > vs
> >
> > Index-heavy, Query-light
> > Set your soft commit interval quite long, up to the maximum latency you
> can
> > stand for documents to be visible. This could be just a couple of minutes
> > or much longer. Maybe even hours with the capability of issuing a hard
> > commit (openSearcher=true) or soft commit on demand.
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> >
> >
> >
> > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson 
> > wrote:
> >
> >>> I've looked through SolrJ, DIH and others -- is the bottomline
> >>> across all of them to "batch updates" and not commit as long as
> possible?
> >>
> >> Of course it’s more complicated than that ;)….
> >>
> >> But to start, yes, I urge you to batch. Here’s some stats:
> >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> >>
> >> Note that at about 100 docs/batch you hit diminishing returns.
> _However_,
> >> that test was run on a single shard collection, so if you have 10 shards
> >> you’d
> >> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
> >> don’t
> >> send one at a time. And there are the usual gotchas if your documents
> are
> >> 1M .vs. 1K.
> >>
> >> About committing. No, don’t hold off as long as possible. When you
> commit,
> >> segments are merged. _However_, the default 100M internal buffer size
> means
> >> that segments are written anyway even if you don’t hit a commit point
> when
> >> you have 100M of index data, and merges happen anyway. So you won’t save
> >> anything on merging by holding off commits.
> >> And you’ll incur penalties. Here’s more than you want to know about
> >> commits:
> >>
> >>
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >>
> >> But some key take-aways… If for some reason Solr abnormally
> >> terminates, the accumulated documents since the last hard
> >> commit are replayed. So say you don’t commit for an hour of
> >> furious indexing and someone does a “kill -9”. When you restart
> >> Solr it’ll try to re-index all the docs for the last hour. Hard commits
> >> with openSearcher=false aren’t all that expensive. I usually set mine
> >> for a minute and forget about it.
> >>
> >> Transaction logs hold a window, _not_ the entire set of operations
> >> since time began. When you do a hard commit, the current tlog is
> >> closed and a new one opened and ones that are “too old” are deleted. If
> >> you never commit you have a huge transaction log to no good purpose.
> >>
> >> Also, while indexing, in order to accommodate “Real Time Get”, all
> >> the docs indexed since the last searcher was opened have a pointer
> >> kept in memory. So if you _never_ open a new searcher, that internal
> >> structure can get quite large. So in bulk-indexing operations, I
> >> suggest you open a searcher every so often.
> >>
> >> Opening a new searcher isn’t terribly expensive if you have no
> autowarming
> >> going on. Autowarming as defined in solrconfig.xml in filterCache,
> >> queryResultCache
> >> etc.
> >>
> >> So if I were going to recommend settings, they’d be something like this:
> >> Do a hard commit with openSearcher=false every 60 seconds.
> >> Do a soft commit every 5 minutes.
> >>
> >> I’d actually be surprised if you were able to measure differences
> between
> >> those settings and just hard commit with openSearcher=true every 60
> >> seconds and soft commit at -1 (never)…
>

Re: Solr Heap Usage

2019-06-04 Thread John Davis
Erick - These conflict, what's changed?

So if I were going to recommend settings, they’d be something like this:
Do a hard commit with openSearcher=false every 60 seconds.
Do a soft commit every 5 minutes.

vs

Index-heavy, Query-light
Set your soft commit interval quite long, up to the maximum latency you can
stand for documents to be visible. This could be just a couple of minutes
or much longer. Maybe even hours with the capability of issuing a hard
commit (openSearcher=true) or soft commit on demand.
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/




On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson 
wrote:

> > I've looked through SolrJ, DIH and others -- is the bottomline
> > across all of them to "batch updates" and not commit as long as possible?
>
> Of course it’s more complicated than that ;)….
>
> But to start, yes, I urge you to batch. Here’s some stats:
> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
>
> Note that at about 100 docs/batch you hit diminishing returns. _However_,
> that test was run on a single shard collection, so if you have 10 shards
> you’d
> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
> don’t
> send one at a time. And there are the usual gotchas if your documents are
> 1M .vs. 1K.
>
> About committing. No, don’t hold off as long as possible. When you commit,
> segments are merged. _However_, the default 100M internal buffer size means
> that segments are written anyway even if you don’t hit a commit point when
> you have 100M of index data, and merges happen anyway. So you won’t save
> anything on merging by holding off commits.
> And you’ll incur penalties. Here’s more than you want to know about
> commits:
>
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> But some key take-aways… If for some reason Solr abnormally
> terminates, the accumulated documents since the last hard
> commit are replayed. So say you don’t commit for an hour of
> furious indexing and someone does a “kill -9”. When you restart
> Solr it’ll try to re-index all the docs for the last hour. Hard commits
> with openSearcher=false aren’t all that expensive. I usually set mine
> for a minute and forget about it.
>
> Transaction logs hold a window, _not_ the entire set of operations
> since time began. When you do a hard commit, the current tlog is
> closed and a new one opened and ones that are “too old” are deleted. If
> you never commit you have a huge transaction log to no good purpose.
>
> Also, while indexing, in order to accommodate “Real Time Get”, all
> the docs indexed since the last searcher was opened have a pointer
> kept in memory. So if you _never_ open a new searcher, that internal
> structure can get quite large. So in bulk-indexing operations, I
> suggest you open a searcher every so often.
>
> Opening a new searcher isn’t terribly expensive if you have no autowarming
> going on. Autowarming as defined in solrconfig.xml in filterCache,
> queryResultCache
> etc.
>
> So if I were going to recommend settings, they’d be something like this:
> Do a hard commit with openSearcher=false every 60 seconds.
> Do a soft commit every 5 minutes.
>
> I’d actually be surprised if you were able to measure differences between
> those settings and just hard commit with openSearcher=true every 60
> seconds and soft commit at -1 (never)…
>
> Best,
> Erick
>
> > On Jun 2, 2019, at 3:35 PM, John Davis 
> wrote:
> >
> > If we assume there is no query load then effectively this boils down to
> > most effective way for adding a large number of documents to the solr
> > index. I've looked through SolrJ, DIH and others -- is the bottomline
> > across all of them to "batch updates" and not commit as long as possible?
> >
> > On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson 
> > wrote:
> >
> >> Oh, there are about a zillion reasons ;).
> >>
> >> First of all, most tools that show heap usage also count uncollected
> >> garbage. So your 10G could actually be much less “live” data. Quick way
> to
> >> test is to attach jconsole to the running Solr and hit the button that
> >> forces a full GC.
> >>
> >> Another way is to reduce your heap when you start Solr (on a test system
> >> of course) until bad stuff happens, if you reduce it to very close to
> what
> >> Solr needs, you’ll get slower as more and more cycles are spent on GC,
> if
> >> you reduce it a little more you’ll get OOMs.
> >>
> >> You can take heap dumps of course to see where all the memory is being
&

Adding Multiple JSON Documents

2019-06-02 Thread John Davis
Hi there,

I was looking at the solr documentation for indexing multiple documents via
json and noticed inconsistency in the docs.

Should the POST url be /update/*json/docs *instead of just /update. It does
look like former does work, unless both will work just fine?

https://lucene.apache.org/solr/guide/7_3/uploading-data-with-index-handlers.html#adding-multiple-json-documents
Adding Multiple JSON Documents


Adding multiple documents at one time via JSON can be done via a JSON Array
of JSON Objects, where each object represents a document:

curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/*update*' --data-binary '[
{"id": "1","title": "Doc 1"  },  {"id": "2","title":
"Doc 2"  }]'


Re: Solr Heap Usage

2019-06-02 Thread John Davis
If we assume there is no query load then effectively this boils down to
most effective way for adding a large number of documents to the solr
index. I've looked through SolrJ, DIH and others -- is the bottomline
across all of them to "batch updates" and not commit as long as possible?

On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson 
wrote:

> Oh, there are about a zillion reasons ;).
>
> First of all, most tools that show heap usage also count uncollected
> garbage. So your 10G could actually be much less “live” data. Quick way to
> test is to attach jconsole to the running Solr and hit the button that
> forces a full GC.
>
> Another way is to reduce your heap when you start Solr (on a test system
> of course) until bad stuff happens, if you reduce it to very close to what
> Solr needs, you’ll get slower as more and more cycles are spent on GC, if
> you reduce it a little more you’ll get OOMs.
>
> You can take heap dumps of course to see where all the memory is being
> used, but that’s tricky as it also includes garbage.
>
> I’ve seen cache sizes (filterCache in particular) be something that uses
> lots of memory, but that requires queries to be fired. Each filterCache
> entry can take up to roughly maxDoc/8 bytes + overhead….
>
> A classic error is to sort, group or facet on a docValues=false field.
> Starting with Solr 7.6, you can add an option to fields to throw an error
> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.
>
> In short, there’s not enough information until you dive in and test
> bunches of stuff to tell.
>
> Best,
> Erick
>
>
> > On Jun 2, 2019, at 2:22 AM, John Davis 
> wrote:
> >
> > This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
> > index.My hypothesis was merging segments was trying to read it all but if
> > that's not the case I am out of ideas. The one caveat is we are trying to
> > add the documents quickly (~1g an hour) but if lucene does write 100m
> > segments and does streaming merge it shouldn't matter?
> >
> > On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood 
> > wrote:
> >
> >>> On May 31, 2019, at 11:27 PM, John Davis 
> >> wrote:
> >>>
> >>> 2. Merging segments - does solr load the entire segment in memory or
> >> chunks
> >>> of it? if later how large are these chunks
> >>
> >> No, it does not read the entire segment into memory.
> >>
> >> A fundamental part of the Lucene design is streaming posting lists into
> >> memory and processing them sequentially. The same amount of memory is
> >> needed for small or large segments. Each posting list is in document-id
> >> order. The merge is a merge of sorted lists, writing a new posting list
> in
> >> document-id order.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
>
>


Re: Solr Heap Usage

2019-06-02 Thread John Davis
This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
index.My hypothesis was merging segments was trying to read it all but if
that's not the case I am out of ideas. The one caveat is we are trying to
add the documents quickly (~1g an hour) but if lucene does write 100m
segments and does streaming merge it shouldn't matter?

On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood 
wrote:

> > On May 31, 2019, at 11:27 PM, John Davis 
> wrote:
> >
> > 2. Merging segments - does solr load the entire segment in memory or
> chunks
> > of it? if later how large are these chunks
>
> No, it does not read the entire segment into memory.
>
> A fundamental part of the Lucene design is streaming posting lists into
> memory and processing them sequentially. The same amount of memory is
> needed for small or large segments. Each posting list is in document-id
> order. The merge is a merge of sorted lists, writing a new posting list in
> document-id order.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Solr Heap Usage

2019-05-31 Thread John Davis
I've read a bunch of the wiki's on solr heap usage and wanted to confirm my
understanding of what all does solr use the heap for:

1. Indexing new documents - until committed? if not how long are the new
documents kept in heap?

2. Merging segments - does solr load the entire segment in memory or chunks
of it? if later how large are these chunks

3. Queries, facets, caches - anything else major?

John


Re: Facet count incorrect

2019-05-23 Thread John Davis
Reindexing to alias is not always easy if it requires 2x resources. Just to
be clear the issues you mentioned are mostly around faceting because we
haven't seen any other search/retrieval issues. Or is that not accurate?

On Wed, May 22, 2019 at 5:12 PM Erick Erickson 
wrote:

> 1> I strongly recommend you re-index into a new collection and switch to
> it with a collection alias rather than try to re-index all the docs.
> Segment merging with the same field with dissimilar definitions is not
> guaranteed to do the right thing.
>
> 2> No. There a few (very few) things that don’t require starting fresh.
> You can do some things like add a lowercasefilter, add or remove a field
> totally and the like. Even then you’ll go through a period of mixed-up
> results until the reindex is complete. But changing the type, changing from
> multiValued to singleValued or vice versa (particularly with docValues)
> etc. are all “fraught”.
>
> My usual reply is “if you’re going to reindex everything anyway, why not
> just do it to a new collection and alias when you’re done?” It’s much safer.
>
> Best,
> Erick
>
> > On May 22, 2019, at 3:06 PM, John Davis 
> wrote:
> >
> > Hi there -
> > Our facet counts are incorrect for a particular field and I suspect it is
> > because we changed the type of the field from StrField to TextField. Two
> > questions:
> >
> > 1. If we do re-index all the documents in the index, would these counts
> get
> > fixed?
> > 2. Is there a "safe" way of changing field types that generally works?
> >
> > *Old type:*
> >   > docValues="true" multiValued="true"/>
> >
> > *New type:*
> >   > omitNorms="true" omitTermFreqAndPositions="true" indexed="true"
> > stored="true" positionIncrementGap="100" sortMissingLast="true"
> > multiValued="true">
> > 
> >  
> >  
> >
> >  
>
>


Facet count incorrect

2019-05-22 Thread John Davis
Hi there -
Our facet counts are incorrect for a particular field and I suspect it is
because we changed the type of the field from StrField to TextField. Two
questions:

1. If we do re-index all the documents in the index, would these counts get
fixed?
2. Is there a "safe" way of changing field types that generally works?

*Old type:*
  

*New type:*
  

  
  

  


Re: Optimizing fq query performance

2019-04-18 Thread John Davis
FYI
https://issues.apache.org/jira/browse/SOLR-11437
https://issues.apache.org/jira/browse/SOLR-12488

On Thu, Apr 18, 2019 at 7:24 AM Shawn Heisey  wrote:

> On 4/17/2019 11:49 PM, John Davis wrote:
> > I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
> > *] doesn't seem materially different compared to has_field:1. If no one
> > knows why Lucene optimizes one but not another, it's not clear whether it
> > even optimizes one to be sure.
>
> Queries using a boolean field will be even faster than the all-inclusive
> range query ... but they require work at index time to function
> properly.  If you can do it this way, that's definitely preferred.  I
> was providing you with something that would work even without the
> separate boolean field.
>
> If the cardinality of the field you're searching is very low (only a few
> possible values for that field across the whole index) then a wildcard
> query can be fast.  It is only when the cardinality is high that the
> wildcard query is slow.  Still, it is better to use the range query for
> determining whether the field exists, unless you have a separate boolean
> field for that purpose, in which case the boolean query will be a little
> bit faster.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
*] doesn't seem materially different compared to has_field:1. If no one
knows why Lucene optimizes one but not another, it's not clear whether it
even optimizes one to be sure.

On Wed, Apr 17, 2019 at 4:27 PM Shawn Heisey  wrote:

> On 4/17/2019 1:21 PM, John Davis wrote:
> > If what you describe is the case for range query [* TO *], why would
> lucene
> > not optimize field:* similar way?
>
> I don't know.  Low level lucene operation is a mystery to me.
>
> I have seen first-hand that the range query is MUCH faster than the
> wildcard query.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
If what you describe is the case for range query [* TO *], why would lucene
not optimize field:* similar way?

On Wed, Apr 17, 2019 at 10:36 AM Shawn Heisey  wrote:

> On 4/17/2019 10:51 AM, John Davis wrote:
> > Can you clarify why field:[* TO *] is lot more efficient than field:*
>
> It's a range query.  For every document, Lucene just has to answer two
> questions -- is the value more than any possible value and is the value
> less than any possible value.  The answer will be yes if the field
> exists, and no if it doesn't.  With one million documents, there are two
> million questions that Lucene has to answer.  Which probably seems like
> a lot ... but keep reading.  (Side note:  It wouldn't surprise me if
> Lucene has an optimization specifically for the all inclusive range such
> that it actually only asks one question, not two)
>
> With a wildcard query, there are as many questions as there are values
> in the field.  Every question is asked for every single document.  So if
> you have a million documents and there are three hundred thousand
> different values contained in the field across the whole index, that's
> 300 billion questions.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
Can you clarify why field:[* TO *] is lot more efficient than field:*

On Sun, Apr 14, 2019 at 12:14 PM Shawn Heisey  wrote:

> On 4/13/2019 12:58 PM, John Davis wrote:
> > We noticed a sizable performance degradation when we add certain fq
> filters
> > to the query even though the result set does not change between the two
> > queries. I would've expected solr to optimize internally by picking the
> > most constrained fq filter first, but maybe my understanding is wrong.
>
> All filters cover the entire index, unless the query parser that you're
> using implements the PostFilter interface, the filter cost is set high
> enough, and caching is disabled.  All three of those conditions must be
> met in order for a filter to only run on results instead of the entire
> index.
>
> http://yonik.com/advanced-filter-caching-in-solr/
> https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/
>
> Most query parsers don't implement the PostFilter interface.  The lucene
> and edismax parsers do not implement PostFilter.  Unless you've
> specified the query parser in the fq parameter, it will use the lucene
> query parser, and it cannot be a PostFilter.
>
> > Here's an example:
> >
> > query1: fq = 'field1:* AND field2:value'
> > query2: fq = 'field2:value'
>
> If the point of the "field1:*" query clause is "make sure field1 exists
> in the document" then you would be a lot better off with this query clause:
>
> field1:[* TO *]
>
> This is an all-inclusive range query.  It works with all field types
> where I have tried it, and that includes TextField types.   It will be a
> lot more efficient than the wildcard query.
>
> Here's what happens with "field1:*".  If the cardinality of field1 is
> ten million different values, then the query that gets constructed for
> Lucene will literally contain ten million values.  And every single one
> of them will need to be compared to every document.  That's a LOT of
> comparisons.  Wildcard queries are normally very slow.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-13 Thread John Davis
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)

This feels like something could be optimized internally by tracking
existence of the field in a doc instead of making users index yet another
field to track existence?

BTW does this same behavior apply for tlong fields too where the value
might be more continuous vs discrete strings?

On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley  wrote:

> More constrained but matching the same set of documents just guarantees
> that there is more information to evaluate per document matched.
> For your specific case, you can optimize fq = 'field1:* AND field2:value'
> to &fq=field1:*&fq=field2:value
> This will at least cause field1:* to be cached and reused if it's a common
> pattern.
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)
> One can optimize this by indexing a term in a different field to turn it
> into a single term query (i.e. exists:field1)
>
> -Yonik
>
> On Sat, Apr 13, 2019 at 2:58 PM John Davis 
> wrote:
>
> > Hi there,
> >
> > We noticed a sizable performance degradation when we add certain fq
> filters
> > to the query even though the result set does not change between the two
> > queries. I would've expected solr to optimize internally by picking the
> > most constrained fq filter first, but maybe my understanding is wrong.
> > Here's an example:
> >
> > query1: fq = 'field1:* AND field2:value'
> > query2: fq = 'field2:value'
> >
> > If we assume that the result set is identical between the two queries and
> > field1 is in general more frequent in the index, we noticed query1 takes
> > 100x longer than query2. In case it matters field1 is of type tlongs
> while
> > field2 is a string.
> >
> > Any tips for optimizing this?
> >
> > John
> >
>


Optimizing fq query performance

2019-04-13 Thread John Davis
Hi there,

We noticed a sizable performance degradation when we add certain fq filters
to the query even though the result set does not change between the two
queries. I would've expected solr to optimize internally by picking the
most constrained fq filter first, but maybe my understanding is wrong.
Here's an example:

query1: fq = 'field1:* AND field2:value'
query2: fq = 'field2:value'

If we assume that the result set is identical between the two queries and
field1 is in general more frequent in the index, we noticed query1 takes
100x longer than query2. In case it matters field1 is of type tlongs while
field2 is a string.

Any tips for optimizing this?

John


Re: What causes new searcher to be created?

2019-03-10 Thread John Davis
We do add commitWithin=XX when indexing updates, I take it that triggers
new searcher when the commit is made? I was under the wrong impression that
autoCommit openSearcher=false would control those too.

On Sat, Mar 9, 2019 at 9:00 PM Erick Erickson 
wrote:

> Nothing should be opening new searchers in that case unless
> the commit is happening from outside. “Outside” here is a SorlJ
> program that either commits or specifies a commitWithin for an
> add. By default, post.jar also issues a commit at the end.
>
> I’d look at whatever is adding new documents to the system. Does
> your Solr log show any updates and what are the parameters if so?
>
> BTW, the setting for hard commit openSearcher=false _only_ applies
> to autocommits. The default behavior of an explicit commit from
> elsewhere will open a new searcher.
>
> > My assumption is that until a new searcher is created all the
> > newly indexed docs will not be visible
>
> This should be the case. So regardless of what the admin says, _can_
> you see newly indexed documents?
>
> Best,
> Erick
>
> > On Mar 9, 2019, at 7:24 PM, John Davis 
> wrote:
> >
> > Hi there,
> > I couldn't find an answer to this in the docs: if openSearcher is set to
> > false in the autocommit with no softcommits, what triggers a new one to
> be
> > created? My assumption is that until a new searcher is created all the
> > newly indexed docs will not be visible. Based on the solr admin console I
> > do see a new one being created every few minutes but I could not find the
> > parameter that controls it.
> >
> > John
>
>


What causes new searcher to be created?

2019-03-09 Thread John Davis
Hi there,
I couldn't find an answer to this in the docs: if openSearcher is set to
false in the autocommit with no softcommits, what triggers a new one to be
created? My assumption is that until a new searcher is created all the
newly indexed docs will not be visible. Based on the solr admin console I
do see a new one being created every few minutes but I could not find the
parameter that controls it.

John


Re: child docs

2019-03-07 Thread John Blythe
thanks for the quick response! that was my inkling from what i've read thus
far, but was curious if any benefits could make it potentially worthwhile.
interested in other "gotchas" the nesting may cause us to incur.

thanks again!

--
John Blythe


On Thu, Mar 7, 2019 at 10:23 AM Erick Erickson 
wrote:

> First of all, if your problem space allows it you're usually better off
> denormalizing for many reasons, not the least of which is that a change to
> any record in a parent/child relationship requires that the entire block be
> re-indexed anyway. Plus, nested docs have quite a number of “gotchas”.
>
> If you really need nested docs, your first option would require the entire
> “platonic” parent and all child docs to be re-indexed every time, so I’d go
> with 2 for that reason alone.
>
> Best,
> Erick
>
> > On Mar 7, 2019, at 10:08 AM, John Blythe  wrote:
> >
> > hi all!
> >
> > curious about how child docs and performance interact.
> >
> > i'll have a bunch of transactions coming in from various entities. i'm
> > debating nesting them all under a single, 'master' parent entity or to
> have
> > the parent and children be entity specific.
> >
> > so either:
> >
> > [platonic ideal parent item]
> >child1: {entity1, tranx1}
> >child2: {entity1, tranx2}
> >child3: {entity2, tranx3}
> >child4: {entity3, tranx4}
> >
> > VS.
> >
> > [entity1's parent item]
> >child1: {tranx1}
> >child2: {tranx2}
> > [entity2's parent item]
> >child1: {tranx3}
> > [entity3's parent item]
> >    child1: {tranx4}
> >
> > could be up to several hundred child docs per entity, though usually will
> > be double digits only (per entity), sometimes as low as < 10.
> >
> > hope this makes sense. thanks for any insight!
> >
> > best,
> > --
> > John Blythe
>
>


child docs

2019-03-07 Thread John Blythe
hi all!

curious about how child docs and performance interact.

i'll have a bunch of transactions coming in from various entities. i'm
debating nesting them all under a single, 'master' parent entity or to have
the parent and children be entity specific.

so either:

[platonic ideal parent item]
child1: {entity1, tranx1}
child2: {entity1, tranx2}
child3: {entity2, tranx3}
child4: {entity3, tranx4}

VS.

[entity1's parent item]
child1: {tranx1}
child2: {tranx2}
[entity2's parent item]
child1: {tranx3}
[entity3's parent item]
child1: {tranx4}

could be up to several hundred child docs per entity, though usually will
be double digits only (per entity), sometimes as low as < 10.

hope this makes sense. thanks for any insight!

best,
--
John Blythe


Improve indexing speed?

2019-01-01 Thread John Milton
Hi to all,

My document contains 65 fields. All the fields needs to be indexed. But for
the 100 documents takes 10 seconds for indexing.
I am using Solr 7.5 (2 cloud instance), with 50 shards.
It's running on Windows OS and it has 32 GB RAM. Java heap space 15 GB.
How to improve indexing speed?
Note :
All the fields contains maximum 20 characters only. Field type is text
general with case insensitive.

Thanks,
John Milton


PC hang while running Solr cloud instance?

2018-12-30 Thread John Milton
Wish you happy new year to you all.

Hi,

I had run my Solr cloud instance 7.5 on my Windows OS. It has 100 shards
with 4 replication.

My PC is hanging,and cpu and memory occupied 95% of space.
Each PC has 16 GB of RAM.
PC in ideal state only, at the moment no indexing and searching happens,
but task manager shows 95% usage of CPU and memory.

How to solve this problem?

Thanks,
John Milton


Config change needs reindex?

2018-12-21 Thread John Milton
Hi Solr Team,

We are using Solr cloud and storing all my application log in Solr only.
For some features or new field add or copy field generation, if I change my
schema and upload it to zookeeper,  am I need to reindex all the data? Or
need to restart is enough?

Thanks,
John Milton


Re: unsubscribe

2018-12-07 Thread John Santosuosso
Unsubscribe 


Sent from Yahoo Mail for iPhone


On Friday, December 7, 2018, 9:57 AM, samuel kim  
wrote:



Sent from Outlook



From: samuel kim
Sent: Monday, July 31, 2017 3:48 PM
To: solr-user@lucene.apache.org
Subject: unsubscribe

unsubscribe




Re: Solr on Java 11?

2018-11-30 Thread John Gallagher
We're interested in this as well.  It is tracked here -
https://issues.apache.org/jira/browse/SOLR-12809

And you can see the test status for different JDKs here:
https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/

8 and 9 pass completely; 10, 11, 12ea don't

On Fri, Nov 30, 2018 at 10:23 AM Webster Homer <
webster.ho...@milliporesigma.com> wrote:

> My company is planning on upgrading our stack to use Java 11. What version
> of Solr is planned to be supported on Java 11?
> We won't be doing this immediately as several of our key components are
> not yet been ported to 11, but we want to plan for it.
>
> Thanks,
> Webster
>


Re: Time-Routed Alias Not Distributing Wrongly Placed Docs

2018-11-30 Thread John Nashorn
Hi Gus, thanks  for writing a detailed answer. I've written some bits between 
quotings from your post.

On 2018/11/30 05:15:10, Gus Heck  wrote: 
> Hi John,
> 
> TRA's really do require that you index via the alias. Internally the code
> is wrapping the Distributed Update Processor with an additional processor
> to handle the time routing when (and only when) the TRA alias is detected.
> If the alias is not used, none of the TRA code runs (by design, for
> performance). TRA's have no capability at all to re-assign docs once they
> are implemented since the process is data driven during update only, with
> no internal maintenance threads (again by design).  It is not even
> supported at this time to update the date on which the document was routed
> via atomic updates for example. One would have to delete and re-index the
> document (in that order, waiting for one to complete!) Adding some sort of
> "fixer thread" is not something that would make much sense, since we don't
> want to ever have the TRA's storing documents in the wrong place to
> begin with.
> 
> TRA's are targeted at systems where new data items arrive regularly, can be
> placed in the right place correctly up front and the timestamp is immutable
> (typical for IOT readings, log or event based types of data for example).
> 
> I think you will probably need to follow up with Lucidworks to get them to
> add a feature to allow TRA's as targets if TRA's still sound like they fit
> your use case. (or pursue another solution without limitations on the
> indexing target)
> 

I know that I'm using TRA out of its designed way, though my scenario would 
perfectly fit for TRA if I were able to use alias name with "hive-solr". I have 
reported the issue to hive-solr devs: 
https://github.com/lucidworks/hive-solr/issues/63

> 
> Frankly, it's a mystery to me how you even got any docs in the October
> collection you list in your question. For anything to have been
> distributed, it would have had to go through the alias. Also, how you have
> more than one collection is a mystery unless you manually inserted a doc at
> some point to cause collection creation perhaps?
> 

Maybe it's the example got you confused, I might have oversummarized it while 
trying to trim. Let me clarify things a little bit: My data ranges from 
2013-01-01 to NOW and continues to grow. I've created a TRA beginning from 
2013-01-01 adding a new collection on a monthly basis. I begun indexing data  
from last to first. Since hive-solr threw NPE when used against TRA name, I was 
sending data to an external table created for the collection of 2013-01-01. 
When the first document was indexed, I saw that all the collections between 
2013-01-01 and 2018-10-01 were created, and the docs were indexed into 
2018-10-01, then 2018-09-01, then 2018-08-01... But after some point, say 
2017-02-01, it stopped this routing and all documents went into 2013-01-01 
collection. 
I didn't manually insert any documents to cause creation of collections.

> 
> It's also worth noting that without the routing and maintenance features
> tied to the alias TRA's give very little benefit, and there are other ways
> of solving this problem with external solutions. Dave, my co-presenter at
> Activate 2018 talks about a couple of other options in the middle section
> of our talk
> https://www.youtube.com/watch?v=RB1-7Y5NQeI&index=59&list=PLU6n9Voqu_1HW8-VavVMa9lP8-oF8Oh5t&t=0s
> 
> 
> The part describing TRA's in detail starts at 14 min and 17 to 23 min
> discusses predecessors and alternatives
> 
> -Gus
> 
> On Tue, Nov 27, 2018 at 12:42 PM John Nashorn  wrote:
> 
> > Hello Everyone,
> > I'm using "hive-solr" from Lucidworks to index my data into Solr (v:7.5,
> > cloud mode). As written in the Solr Manual, TRA expects documents to be
> > indexed using its alias name, and not directly into the collections under
> > it. Unfortunately, hive-solr doesn't allow using TRA names as indexing
> > targets. So what I do is: I index data using the first collection created
> > by TRA and expect Solr to distribute my data into its respective collection
> > under the hood. This works to some extent, but a big portion of data stays
> > in where they were indexed, ie. the first collection of the TRA. For
> > example (approximate numbers):
> >
> > * coll_2018-07-01 => 800.000.000 docs
> > * coll_2018-08-01 => 0 docs
> > * coll_2018-09-01 => 0 docs
> > * coll_2018-10-01 => 150.000.000 docs
> > * coll_2018-11-01 => 0 docs
> >
> > Here, coll_2018-07-01 contains data that should normally be in the other
> > four collections.
> >
> > Is there a way to make TRA scan (somehow intentionally) misplaced data and
> > send them to their correct places?
> >
> 
> 
> -- 
> http://www.the111shift.com
> 


Re: is SearchComponent the correct way?

2018-11-29 Thread John Thorhauer
So my understanding is that the DelegatingCollector.collect() method has
access to a single doc.  At that point I must choose to either call
super.collect() or not.  So this is the point at which I have to check
redis for security data for a single doc and determine if this doc should
be allowed as part of the result set or not.  So it seems that I have to
check my redis cache one doc at a time since I am only provided one doc in
the collect() method and I must determine at this point if I should call
the super.collect() or not.

I would like to find an option where I can get all the docs in the
postfilter and run a single query to redis with all of the docs at once to
get a single answer back from redis and then determine, based on the redis
response, which of the docs should be allowed to pass thru my postfilter.




On Fri, Nov 16, 2018 at 4:30 PM Mikhail Khludnev  wrote:

> On Tue, Nov 13, 2018 at 6:36 AM John Thorhauer 
> wrote:
>
> > Mikhail,
> >
> > Where do I implement the buffering?  I can not do it in then collect()
> > method.
>
> Please clarify why exactly? Notice my statement about one segment only.
>
>
> > I can not see how I can get access to what I need in the finish()
> > method.
> >
> > Thanks,
> > John
> >
> > On Tue, Nov 6, 2018 at 12:44 PM Mikhail Khludnev 
> wrote:
> >
> > > Not really. It expect to work segment by segment. So it can buffer all
> > doc
> > > from one segment, hit redis and push all results into delegating
> > collector.
> > >
> > > On Tue, Nov 6, 2018 at 8:29 PM John Thorhauer 
> > > wrote:
> > >
> > > > Mikhail,
> > > >
> > > > Thanks for the suggestion.  After looking over the PostFilter
> interface
> > > and
> > > > the DelegatingCollector, it appears that this would require me to
> query
> > > my
> > > > outside datastore (redis) for security information once for each
> > > document.
> > > > This would be a big performance issue.  I would like to be able to
> > > iterate
> > > > through the documents, gathering all the critical ID's and then send
> a
> > > > single query to redis, getting back my security related data, and
> then
> > > > iterate through the documents, pulling out the ones that the user
> > should
> > > > not see.
> > > >
> > > > Is this possible?
> > > >
> > > > Thanks again for your help!
> > > > John
> > > >
> > > >
> > > > On Tue, Nov 6, 2018 at 6:24 AM John Thorhauer <
> jthorha...@yakabod.com>
> > > > wrote:
> > > >
> > > > > We have a need to check the results of a search against a set of
> > > security
> > > > > lists that are maintained in a redis cache.  I need to be able to
> > take
> > > > each
> > > > > document that is returned for a search and check the redis cache to
> > see
> > > > if
> > > > > the document should be displayed or not.
> > > > >
> > > > > I am attempting to do this by creating a SearchComponent.  I am
> able
> > to
> > > > > iterate thru the results and identify the items I want to remove
> from
> > > the
> > > > > results but I am not sure how to proceed in removing them.
> > > > >
> > > > > Is SearchComponent the best way to do this?  If so, any thoughts on
> > how
> > > > to
> > > > > proceed?
> > > > >
> > > > >
> > > > > Thanks,
> > > > > John Thorhauer
> > > > >
> > > > >
> > > >
> > > > --
> > > > John Thorhauer
> > > > Vice President, Software Development
> > > > Yakabod, Inc.
> > > > Cell: 240-818-9050
> > > > Office: 301-662-4554 x2105
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
> >
> > --
> > John Thorhauer
> > Vice President, Software Development
> > Yakabod, Inc.
> > Cell: 240-818-9050
> > Office: 301-662-4554 x2105
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


How to implement ssl for Solr cloud?

2018-11-27 Thread John Milton
Hi Solr Team,

In my Solr cloud cluster, I am having 3 Zookeeper external ensemble and 2
Solr cloud instance.

Is it needs to implement ssl for all the available Solr instance?

Based on the ssl implementation any additional configuration needed in
zookeeper?


Thanks,
John Milton


Time-Routed Alias Not Distributing Wrongly Placed Docs

2018-11-27 Thread John Nashorn
Hello Everyone,
I'm using "hive-solr" from Lucidworks to index my data into Solr (v:7.5, cloud 
mode). As written in the Solr Manual, TRA expects documents to be indexed using 
its alias name, and not directly into the collections under it. Unfortunately, 
hive-solr doesn't allow using TRA names as indexing targets. So what I do is: I 
index data using the first collection created by TRA and expect Solr to 
distribute my data into its respective collection under the hood. This works to 
some extent, but a big portion of data stays in where they were indexed, ie. 
the first collection of the TRA. For example (approximate numbers):

* coll_2018-07-01 => 800.000.000 docs
* coll_2018-08-01 => 0 docs
* coll_2018-09-01 => 0 docs
* coll_2018-10-01 => 150.000.000 docs
* coll_2018-11-01 => 0 docs

Here, coll_2018-07-01 contains data that should normally be in the other four 
collections.

Is there a way to make TRA scan (somehow intentionally) misplaced data and send 
them to their correct places?


How to use multiple data drives?

2018-11-15 Thread John Milton
Hi Solr Team,

I have installed Solr in Windows OS, on my C drive.

And I make the D drive as the data directory.

If D drive physical memory is almost full, can I use the other drives to
store data?

I mean, If the current data directory is getting full, I need to use the
multiple drives as a data directory.

Is it possible to do with Solr cloud?


Thanks,
John Milton


Solr cloud change collection index directory

2018-11-13 Thread John Milton
Hi Solr Team,

I am using Solr cloud in Windows OS. For that I am using following,

1. Three external Zookeeper servers (3.4.13)

2. Two Solr cloud nodes (7.5.0)


All the created collections are available under the Solr installed
directory only.

I want to change the data directory into other drives,  for example if my
Solr instance is installed on C drive means,  I want to store all the
collection index on D drive.

How to achieve this in Solr 7.5? Kindly give the suggestion to solve
this...


Thanks,
John Milton


Re: is SearchComponent the correct way?

2018-11-12 Thread John Thorhauer
Mikhail,

Where do I implement the buffering?  I can not do it in then collect()
method.  I can not see how I can get access to what I need in the finish()
method.

Thanks,
John

On Tue, Nov 6, 2018 at 12:44 PM Mikhail Khludnev  wrote:

> Not really. It expect to work segment by segment. So it can buffer all doc
> from one segment, hit redis and push all results into delegating collector.
>
> On Tue, Nov 6, 2018 at 8:29 PM John Thorhauer 
> wrote:
>
> > Mikhail,
> >
> > Thanks for the suggestion.  After looking over the PostFilter interface
> and
> > the DelegatingCollector, it appears that this would require me to query
> my
> > outside datastore (redis) for security information once for each
> document.
> > This would be a big performance issue.  I would like to be able to
> iterate
> > through the documents, gathering all the critical ID's and then send a
> > single query to redis, getting back my security related data, and then
> > iterate through the documents, pulling out the ones that the user should
> > not see.
> >
> > Is this possible?
> >
> > Thanks again for your help!
> > John
> >
> >
> > On Tue, Nov 6, 2018 at 6:24 AM John Thorhauer 
> > wrote:
> >
> > > We have a need to check the results of a search against a set of
> security
> > > lists that are maintained in a redis cache.  I need to be able to take
> > each
> > > document that is returned for a search and check the redis cache to see
> > if
> > > the document should be displayed or not.
> > >
> > > I am attempting to do this by creating a SearchComponent.  I am able to
> > > iterate thru the results and identify the items I want to remove from
> the
> > > results but I am not sure how to proceed in removing them.
> > >
> > > Is SearchComponent the best way to do this?  If so, any thoughts on how
> > to
> > > proceed?
> > >
> > >
> > > Thanks,
> > > John Thorhauer
> > >
> > >
> >
> > --
> > John Thorhauer
> > Vice President, Software Development
> > Yakabod, Inc.
> > Cell: 240-818-9050
> > Office: 301-662-4554 x2105
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
John Thorhauer
Vice President, Software Development
Yakabod, Inc.
Cell: 240-818-9050
Office: 301-662-4554 x2105


Re: is SearchComponent the correct way?

2018-11-06 Thread John Thorhauer
Mikhail,

Thanks for the suggestion.  After looking over the PostFilter interface and
the DelegatingCollector, it appears that this would require me to query my
outside datastore (redis) for security information once for each document.
This would be a big performance issue.  I would like to be able to iterate
through the documents, gathering all the critical ID's and then send a
single query to redis, getting back my security related data, and then
iterate through the documents, pulling out the ones that the user should
not see.

Is this possible?

Thanks again for your help!
John


On Tue, Nov 6, 2018 at 6:24 AM John Thorhauer 
wrote:

> We have a need to check the results of a search against a set of security
> lists that are maintained in a redis cache.  I need to be able to take each
> document that is returned for a search and check the redis cache to see if
> the document should be displayed or not.
>
> I am attempting to do this by creating a SearchComponent.  I am able to
> iterate thru the results and identify the items I want to remove from the
> results but I am not sure how to proceed in removing them.
>
> Is SearchComponent the best way to do this?  If so, any thoughts on how to
> proceed?
>
>
> Thanks,
> John Thorhauer
>
>

-- 
John Thorhauer
Vice President, Software Development
Yakabod, Inc.
Cell: 240-818-9050
Office: 301-662-4554 x2105


is SearchComponent the correct way?

2018-11-06 Thread John Thorhauer
We have a need to check the results of a search against a set of security
lists that are maintained in a redis cache.  I need to be able to take each
document that is returned for a search and check the redis cache to see if
the document should be displayed or not.

I am attempting to do this by creating a SearchComponent.  I am able to
iterate thru the results and identify the items I want to remove from the
results but I am not sure how to proceed in removing them.

Is SearchComponent the best way to do this?  If so, any thoughts on how to
proceed?


Thanks,
John Thorhauer


Re: More Like This Query problems

2018-10-18 Thread John Bickerstaff
Found it.

My SOLR does NOT store fields and after some careful checking, it turns out
we do NOT do term vectors either...  So, according to the docs, MLT will
not work.

Thanks for the response David!

On Thu, Oct 18, 2018 at 1:44 PM John Bickerstaff 
wrote:

> Thanks. There are many docs with matching words  I've tried an
> extremely simplified case where a basic query (q=Field1:"foo") returns
> millions of results... however a MLT similar to the one I mention below,
> using a doc Id I know has "foo" in Field1 returns only the same Doc ID as
> submitted in the query.
>
>
> http://XX.XXX.XX.XXX:10001/solr/BPS/select?indent=on&q=Field1:%22foo%22&wt=json
> (Returns several million as "numFound)
>
>
> http://XX.XXX.XX.XXX:10001/solr/BPS/select?indent=on&mlt.fl=Field1&mlt=true&q=id:%2227000:9009:66%22&wt=json
> (returns only the same ID in the More Like This section)
>
> Wouldn't the AND NOT just eliminate my initial doc Id from the list?
> Assuming matches, we would still expect other ids to be returned in any
> case, wouldn't we?  Should that be a Filter Query?
>
> On Thu, Oct 18, 2018, 12:57 PM David Hastings 
> wrote:
>
>> Make sure your query has an “AND NOT id:your doc id”
>> Also be certain there are other documents that will meet your criteria
>> for a test case. Remember it’s unique words in your core/collection
>>
>> On Oct 18, 2018, at 2:43 PM, John Bickerstaff > <mailto:j...@johnbickerstaff.com>> wrote:
>>
>> All,
>>
>>
>> I am having trouble with a “more like this” query in Solr.
>>
>>
>> Here’s what I think should be happening:
>>
>>
>> 1. Query contains Document ID (q=id:"942316176:9009:66
>> <
>> http://10.157.117.55:10001/solr/BPS/select?=&debug=true&indent=on&mlt.fl=surnames,genders,givennames,birthlocations,deathlocations&mlt=true&q=id:%22942316176:9009:66%22&wt=json
>> >
>> ”)
>>
>> 2. I add the following (on the solr admin page, raw query parameters
>> field)
>>
>>  &mlt=true&mlt.fl=field1,field2,field3
>>
>> 3. More Like This will take the Document ID, look at the fields (field1,
>> field2, field3) and return a list of documents that have the best match to
>> the contents of those fields in “document Id”
>>
>>
>> What is happening is that I’m getting only one result and it is the same
>> document id as the one I sent in on the query.  What I expected was a list
>> of Doc ID’s for documents that have some kind of match to the submitted
>> Doc
>> ID.
>>
>>
>> Any thoughts or advice would be appreciated.
>>
>>
>> ===
>>
>>
>> Here is an example of the query URL:
>>
>>
>>
>> http://XX.XXX.XXX.XX:10001/solr/BPS/select?=&debug=true&indent=on&mlt.fl=field1,field2,field3&mlt=true&q=id:%22942316176:9009:66%22&wt=json
>> <
>> http://xx.xxx.xxx.xx:10001/solr/BPS/select?=&debug=true&indent=on&mlt.fl=field1,field2,field3&mlt=true&q=id:%22942316176:9009:66%22&wt=json
>> >
>>
>>
>> However, when I submit the query, I get only one document ID returned -
>> the
>> same one I submitted in the first place.
>>
>>
>> Here is the important section of the response:
>>
>>
>> {
>>
>>  "*responseHeader*":{
>>
>>"*zkConnected*":true,
>>
>>"*status*":0,
>>
>>"*QTime*":26,
>>
>>"*params*":{
>>
>>  "*q*":"id:\"942316176:9009:66\"",
>>
>>  "*debug*":"true",
>>
>>  "*mlt*":"true",
>>
>>  "*indent*":"on",
>>
>>  "*mlt.fl*”:”field1,field2,field3",
>>
>>  "*wt*":"json",
>>
>>  "*_*":"1539881180264"}},
>>
>>  "*response*":{"*numFound*":1,"*start*":0,"*maxScore*":1.0,"*docs*":[
>>
>>  {
>>
>>"*id*":"942316176:9009:66",
>>
>>"*_version_*":1611920924010872837}]
>>
>>  },
>>
>>  "*moreLikeThis*":[
>>
>>"942316176:9009:66",{"*numFound*":0,"*start*":0,"*docs*":[]
>>
>>}],
>>
>>  "*debug*":{
>>
>


Re: More Like This Query problems

2018-10-18 Thread John Bickerstaff
Thanks. There are many docs with matching words  I've tried an
extremely simplified case where a basic query (q=Field1:"foo") returns
millions of results... however a MLT similar to the one I mention below,
using a doc Id I know has "foo" in Field1 returns only the same Doc ID as
submitted in the query.

http://XX.XXX.XX.XXX:10001/solr/BPS/select?indent=on&q=Field1:%22foo%22&wt=json
(Returns several million as "numFound)

http://XX.XXX.XX.XXX:10001/solr/BPS/select?indent=on&mlt.fl=Field1&mlt=true&q=id:%2227000:9009:66%22&wt=json
(returns only the same ID in the More Like This section)

Wouldn't the AND NOT just eliminate my initial doc Id from the list?
Assuming matches, we would still expect other ids to be returned in any
case, wouldn't we?  Should that be a Filter Query?

On Thu, Oct 18, 2018, 12:57 PM David Hastings  wrote:

> Make sure your query has an “AND NOT id:your doc id”
> Also be certain there are other documents that will meet your criteria for
> a test case. Remember it’s unique words in your core/collection
>
> On Oct 18, 2018, at 2:43 PM, John Bickerstaff  <mailto:j...@johnbickerstaff.com>> wrote:
>
> All,
>
>
> I am having trouble with a “more like this” query in Solr.
>
>
> Here’s what I think should be happening:
>
>
> 1. Query contains Document ID (q=id:"942316176:9009:66
> <
> http://10.157.117.55:10001/solr/BPS/select?=&debug=true&indent=on&mlt.fl=surnames,genders,givennames,birthlocations,deathlocations&mlt=true&q=id:%22942316176:9009:66%22&wt=json
> >
> ”)
>
> 2. I add the following (on the solr admin page, raw query parameters field)
>
>  &mlt=true&mlt.fl=field1,field2,field3
>
> 3. More Like This will take the Document ID, look at the fields (field1,
> field2, field3) and return a list of documents that have the best match to
> the contents of those fields in “document Id”
>
>
> What is happening is that I’m getting only one result and it is the same
> document id as the one I sent in on the query.  What I expected was a list
> of Doc ID’s for documents that have some kind of match to the submitted Doc
> ID.
>
>
> Any thoughts or advice would be appreciated.
>
>
> ===
>
>
> Here is an example of the query URL:
>
>
>
> http://XX.XXX.XXX.XX:10001/solr/BPS/select?=&debug=true&indent=on&mlt.fl=field1,field2,field3&mlt=true&q=id:%22942316176:9009:66%22&wt=json
> <
> http://xx.xxx.xxx.xx:10001/solr/BPS/select?=&debug=true&indent=on&mlt.fl=field1,field2,field3&mlt=true&q=id:%22942316176:9009:66%22&wt=json
> >
>
>
> However, when I submit the query, I get only one document ID returned - the
> same one I submitted in the first place.
>
>
> Here is the important section of the response:
>
>
> {
>
>  "*responseHeader*":{
>
>"*zkConnected*":true,
>
>"*status*":0,
>
>"*QTime*":26,
>
>"*params*":{
>
>  "*q*":"id:\"942316176:9009:66\"",
>
>  "*debug*":"true",
>
>  "*mlt*":"true",
>
>  "*indent*":"on",
>
>  "*mlt.fl*”:”field1,field2,field3",
>
>  "*wt*":"json",
>
>  "*_*":"1539881180264"}},
>
>  "*response*":{"*numFound*":1,"*start*":0,"*maxScore*":1.0,"*docs*":[
>
>  {
>
>"*id*":"942316176:9009:66",
>
>"*_version_*":1611920924010872837}]
>
>  },
>
>  "*moreLikeThis*":[
>
>"942316176:9009:66",{"*numFound*":0,"*start*":0,"*docs*":[]
>
>}],
>
>  "*debug*":{
>


More Like This Query problems

2018-10-18 Thread John Bickerstaff
All,


I am having trouble with a “more like this” query in Solr.


Here’s what I think should be happening:


1. Query contains Document ID (q=id:"942316176:9009:66

”)

2. I add the following (on the solr admin page, raw query parameters field)

  &mlt=true&mlt.fl=field1,field2,field3

3. More Like This will take the Document ID, look at the fields (field1,
field2, field3) and return a list of documents that have the best match to
the contents of those fields in “document Id”


What is happening is that I’m getting only one result and it is the same
document id as the one I sent in on the query.  What I expected was a list
of Doc ID’s for documents that have some kind of match to the submitted Doc
ID.


Any thoughts or advice would be appreciated.


===


Here is an example of the query URL:


http://XX.XXX.XXX.XX:10001/solr/BPS/select?=&debug=true&indent=on&mlt.fl=field1,field2,field3&mlt=true&q=id:%22942316176:9009:66%22&wt=json


However, when I submit the query, I get only one document ID returned - the
same one I submitted in the first place.


Here is the important section of the response:


{

  "*responseHeader*":{

"*zkConnected*":true,

"*status*":0,

"*QTime*":26,

"*params*":{

  "*q*":"id:\"942316176:9009:66\"",

  "*debug*":"true",

  "*mlt*":"true",

  "*indent*":"on",

  "*mlt.fl*”:”field1,field2,field3",

  "*wt*":"json",

  "*_*":"1539881180264"}},

  "*response*":{"*numFound*":1,"*start*":0,"*maxScore*":1.0,"*docs*":[

  {

"*id*":"942316176:9009:66",

"*_version_*":1611920924010872837}]

  },

  "*moreLikeThis*":[

"942316176:9009:66",{"*numFound*":0,"*start*":0,"*docs*":[]

}],

  "*debug*":{


Re: Faceting with a multi valued field

2018-09-25 Thread John Blythe
you can update your filter query to be a facet query, this will apply the
query to the resulting facet set instead of the Communities field itself.

--
John Blythe


On Tue, Sep 25, 2018 at 4:15 PM Hanjan, Harinder 
wrote:

> Hello!
>
> I am doing faceting on a field which has multiple values and it's yielding
> expected but undesireable results. I need different behaviour but not sure
> how to formulate a query for it. Here is my current setup.
>
> = Data Set =
>   {
> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"],
> "Document Type":"Engagement - What We Heard Report",
> "Navigation":"Livelink",
> "SolrId":"http://thesimpsons.com/one";
>   }
>   {
> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"],
> "Document Type":"Engagement - What We Heard Report",
> "Navigation":"Livelink",
> "Id":"http://thesimpsons.com/two";
>   }
>   {
> "Communities":["SUNALTA - SNA"],
> "Document Type":"Engagement - What We Heard Report",
> "Navigation":"Livelink",
> "Id":"http://thesimpsons.com/three";
>   }
>
> = Query I run now =
>
> http://localhost:8984/solr/everything/select?q=*:*&facet=on&facet.field=Communities&fq=Communities:"BANFF
> TRAIL - BNF"
>
>
> = Results I get now =
> {
>   ...
>   "facet_counts":{
> "facet_queries":{},
> "facet_fields":{
>   "Communities":[
> "BANFF TRAIL - BNF",2,
> "PARKDALE - PKD",2,
> "SUNALTA - SNA",0]},
>...
>
> Notice that the Communities facet has 2 non zero results. I understand
> this is because I'm using fq to get only documents which contain BANFF
> TRAIL but those documents also contain PARKDALE.
>
> Now, I am using facets to drive navigation on my page. The business case
> is that user can select a community to get documents pertaining to that
> specific community only. This works with the query I have above. However,
> the facets results also contain other communities which then get displayed
> to the user. For example, with the query above, user will see both BANFF
> TRAIL and PARKDALE as selected values even though user only selected BANFF
> TRAIL. It's worthwhile noting that I have no control over the data being
> sent to Solr and can't change it.
>
> How can I formulate a query to ensure that when user selects BANFF TRAIL,
> only BANFF TRAIL is returned under Solr facets?
>
> Thanks!
> Harinder
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.
>


Term insight with facets

2018-09-22 Thread John Blythe
Hi all.

I’m trying to do some analysis on our data. Specifically hoping to see the
top and bottom x% of terms in a particular field. I’d prefer to not loop
through all terms by way of an updated offset if I can help it.

Is there any cool way to request in the facet query a subset based on a
dynamically set limit?

Thanks for any thoughts!
-- 
John Blythe


admin auth

2018-09-21 Thread John Blythe
hi everyone!

we had authentication setup for our cloud deploy that for some reason or
another disappeared after some updates. we didn't realize it immediately so
aren't sure what triggered the change. curl requests still require auth but
our admin panel is accessible.

further, our local setup *does* require it.

any low hanging fruit ideas we could try out to help resolve this?

thanks!

--
John Blythe


Re: 6.x to 7.x differences

2018-09-12 Thread John Blythe
thanks, shawn. yep, i saw the multi term synonym discussion when googling
around a bit after your first reply. pretty jazzed about finally getting to
tinker w that instead of creating our regex ducktape solution
for_multi_term_synonyms!

thanks again-

--
John Blythe


On Wed, Sep 12, 2018 at 2:15 PM Shawn Heisey  wrote:

> On 9/12/2018 8:12 AM, John Blythe wrote:
> > shawn: at first, no. we rsynced data up after running it through the
> > migration tool. we'd gotten errors when using WDF so updated all
> instances
> > of it to WDGF (and subsequently added FlattenGraphFilterFactory to each
> > index analyzer that used WDGF to avoid errors).
>
> The messages you get in the log from WDF are not errors. They are
> warnings.  Just letting you know that the filter will be removed in the
> next major version.
>
> > the sow seems to be the key here. adding that to the query url dropped me
> > from +19k to 62 results lol. 'subtle' is a not so subtle understatement
> in
> > this case! i'm a big fan of finally being able to not be driven batty by
> > the analysis vs. query results though, so looking forward to playing w
> that
> > some more. for our immediate purposes, however, i think this solves it!
>
> Setting sow=false is a key part of the "graph" nature of the new filters
> that aren't deprecated.  Mostly this is to support multi-word synonyms
> properly.
>
>
> https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
>
> Thanks,
> Shawn
>
>


enquoted searches

2018-09-12 Thread John Blythe
hi (again!). hopefully this will be the last question for a while—i've
really gotten my money's worth the last day or two :)

searches like "foo bar" aren't working the same way they used to for us
since our 7.4 upgrade this weekend.

in both cases our phrase was wrapped in double quotes. the case that
performed as expected had the quotes escaped with a backslash.

this is the debug info from the one that is working as expected:

"parsedquery":"text:craniofacial text:bone text:screw
> (text:nonbioabsorbable PhraseQuery(text:\"non bioabsorbable\"))
> text:sterile",
> "parsedquery_toString":"text:craniofacial text:bone text:screw
> (text:nonbioabsorbable text:\"non bioabsorbable\") text:sterile",


and the other:

"parsedquery":"SpanNearQuery(spanNear([text:craniofacial, text:bone,
> text:screw, spanOr([text:nonbioabsorbable, spanNear([text:non,
> text:bioabsorbable], 0, true)]), text:sterile], 0, true))",
> "parsedquery_toString":"spanNear([text:craniofacial, text:bone,
> text:screw, spanOr([text:nonbioabsorbable, spanNear([text:non,
> text:bioabsorbable], 0, true)]), text:sterile], 0, true)",


it seems to be related to the spanNear() and/or spanOr() usage that is
injected in the latter case.

this is the query, by the way: "Craniofacial bone screw, non-bioabsorbable,
sterile"

removing ", sterile" will render results as expected, too. from the bit of
reading i did on the spanquery stuff i was thinking that maybe it was
related to positioning issues, specifically with 'sterile'. in the Analysis
tab, however, it's in position 6 in both indexing and querying output.

thanks for any thoughts or assists here!

best,

--
John Blythe


Re: large query producing graph error ... maybe?

2018-09-12 Thread John Blythe
well, it's our general text field that things get dumped into. this special
use case that is sku specific just ends up being done on the general input.

ended up raising the Xss value and i'm able to get results :)

i imagine this is a n00b or stupid question but imma go for it: what would
be the value add on fq + cache=false variation?

thanks for the help!




--
John Blythe


On Wed, Sep 12, 2018 at 1:02 PM Erick Erickson 
wrote:

> Looks like your SKU field is points-based? Strings would probably be
> better, if you switched to points-based it's new code.
>
> And maxBooleanClauses is so old-school ;) You're better off with
> TermsQueryParser, especially if you pre-sort the tokens. see:
> https://lucene.apache.org/solr/guide/6_6/other-parsers.html
>
> Although IIRC this is automagic in recent Solr's. I'd also put it in
> an fq clause with cache=false...
>
> Best,
> Erick
> On Wed, Sep 12, 2018 at 8:27 AM John Blythe  wrote:
> >
> > hey all!
> >
> > i'm having an issue w large queries. one of our use cases is for users to
> > drop in an untold amount of product skus. we previously had our
> > maxBooleanClause limit set to 20k (eek!). but it worked phenomenally well
> > and i think our record amount from a user was ~19k items.
> >
> > we're now on 7.4 Cloud. i'm getting this error when testing with a measly
> > 600 skus:
> >
> >
> org.apache.lucene.util.graph.GraphTokenStreamFiniteStrings.articulationPointsRecurse(GraphTokenStreamFiniteStrings.java:278)\n\tat
> >
> >
> > there's a lot more to the error message but that is the tail end of it
> all
> > and is repeated a lot of times (maybe 600? idk).
> >
> > no mention of maxBooleanClause issues specifically in the output, shows
> as
> > a stack overflow error.
> >
> > is this something we can solve in our solr/cloud/zk configuration or is
> it
> > somewhere else to be solved?
> >
> > thanks!
> >
> > --
> > John Blythe
>


large query producing graph error ... maybe?

2018-09-12 Thread John Blythe
hey all!

i'm having an issue w large queries. one of our use cases is for users to
drop in an untold amount of product skus. we previously had our
maxBooleanClause limit set to 20k (eek!). but it worked phenomenally well
and i think our record amount from a user was ~19k items.

we're now on 7.4 Cloud. i'm getting this error when testing with a measly
600 skus:

org.apache.lucene.util.graph.GraphTokenStreamFiniteStrings.articulationPointsRecurse(GraphTokenStreamFiniteStrings.java:278)\n\tat


there's a lot more to the error message but that is the tail end of it all
and is repeated a lot of times (maybe 600? idk).

no mention of maxBooleanClause issues specifically in the output, shows as
a stack overflow error.

is this something we can solve in our solr/cloud/zk configuration or is it
somewhere else to be solved?

thanks!

--
John Blythe


Re: 6.x to 7.x differences

2018-09-12 Thread John Blythe
hey guys.

preeti: good thought, but this was something we were already aware of and
had accounted for. thanks tho!

shawn: at first, no. we rsynced data up after running it through the
migration tool. we'd gotten errors when using WDF so updated all instances
of it to WDGF (and subsequently added FlattenGraphFilterFactory to each
index analyzer that used WDGF to avoid errors).

the sow seems to be the key here. adding that to the query url dropped me
from +19k to 62 results lol. 'subtle' is a not so subtle understatement in
this case! i'm a big fan of finally being able to not be driven batty by
the analysis vs. query results though, so looking forward to playing w that
some more. for our immediate purposes, however, i think this solves it!

--
John Blythe


On Wed, Sep 12, 2018 at 1:35 AM Preeti Bhat 
wrote:

> Hi John,
>
> Please check the solrQueryParser option, it was removed in 7.4 version, so
> you will need to provide AND in solrconfig.xml or
> give the q.op option while querying to solve this problem. By default solr
> makes it an "OR" operation leading to too many results.
>
> Old Way: In Managed-schema or schema.xml
> 
>
> New Way: in solrconfig.xml
>
>path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
> 
>   AND
>     
>   
>
>
> Thanks and Regards,
> Preeti Bhat
>
> -Original Message-
> From: John Blythe [mailto:johnbly...@gmail.com]
> Sent: Wednesday, September 12, 2018 8:02 AM
> To: solr-user@lucene.apache.org
> Subject: 6.x to 7.x differences
>
> hi, all.
>
> we recently migrated to cloud. part of that migration jumped us from 6.1
> to 7.4.
>
> one example query between our old solr instance and our new cloud instance
> produces 42 results and 19k results.
>
> the analyzer is the same aside from WordDelimiterFilterFactory moving over
> to the graph variation of it and the lucene parser moving from 6.1 to 7.4
> obviously.
>
> i've used the analysis tool in solr admin to try to determine the
> difference between the two. i'm seeing the same output between index and
> query results yet when actually running the queries have that huge
> divergence of results.
>
> i'm left scratching my head at this point. i'm guessing it's from the
> lucene parser? hoping to get some clarity from you guys!
>
> thanks!
>
> --
> John Blythe
>
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or
> privileged information. If you are not the intended recipient (or have
> received this communication in error) please notify the sender and
> it-supp...@shoregrp.com immediately, and destroy this communication. Any
> unauthorized copying, disclosure or distribution of the material in this
> communication is strictly forbidden. Any views or opinions presented in
> this email are solely those of the author and do not necessarily represent
> those of the company. Finally, the recipient should check this email and
> any attachments for the presence of viruses. The company accepts no
> liability for any damage caused by any virus transmitted by this email.
>
>
>


Re: parent/child rows in solr

2018-09-11 Thread John Smith
On Tue, Sep 11, 2018 at 11:05 PM Walter Underwood 
wrote:

> Have you tried modeling it with multivalued fields?
>
>
That's an interesting idea, but I don't think that would work. We would
lose the concept of "rows". So let's say child1 has col "a" and col "b",
both are turned into multi-value fields in the solr index. Normally in sql
we can query for a specific value in col "a", and then see what the
associated value in col "b" would be, but we can't do that if we stuff the
col values in multi-value; we can no longer see which value from col "a"
corresponds to which value in col "b". I'm probably explaining that poorly,
but I just don't see how that would work.


Re: parent/child rows in solr

2018-09-11 Thread John Smith
On Tue, Sep 11, 2018 at 11:00 PM Shawn Heisey  wrote:

> On 9/11/2018 8:35 PM, John Smith wrote:
> > The problem is that the math isn't a simple case of adding up all the row
> > counts. These are "left outer join"s. In sql, it would be this query:
>
> I think we'll just have to conclude that I do not understand what you
> are doing.  I have no idea what "left outer join" even means, how it's
> different than a join that's NOT "left outer".
>
> I will say this:  Solr is not very efficient at joins, and there are a
> bunch of caveats involved.  It's usually better to go with a flat
> document space for a search engine.
>
> Thanks,
> Shawn
>
>
A "left outer join" in sql is a join such that if there is no match in the
child table for a given header id, then the child cells are returned as
"null" values, instead of the header row being removed from the result set
(which is what happens in "inner join" or standard sql join).

A good rundown on the various sql joins:
https://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join


Re: parent/child rows in solr

2018-09-11 Thread John Smith
On Tue, Sep 11, 2018 at 9:32 PM Shawn Heisey  wrote:

> On 9/11/2018 7:07 PM, John Smith wrote:
> > header:  223,580
> >
> > child1:  124,978
> > child2:  254,045
> > child3:  127,917
> > child4:1,009,030
> > child5:  225,311
> > child6:  381,561
> > child7:  438,315
> > child8:   18,850
> >
> >
> > Trying to index that into solr with a flatfile schema, blows up into
> > 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a
>
> I think you're not getting what I'm suggesting.  Or maybe there's an
> aspect of your data that I'm not understanding.
>
> If we add up all those numbers for the child docs, there are 2.5 million
> of them.  So you would have 2.5 million docs in Solr.  I have created
> Solr indexes far larger than this, and I do not consider my work to be
> "big data".  Solr can handle 2.5 million docs easily, as long as the
> hardware resources are sufficient.
>
> Where the data duplication will come in is in additional fields in those
> 2.5 million docs.  Each one will contain some (or maybe all) of the data
> that WOULD have been in the parent document.  The amount of data
> balloons, but the number of documents (rows) doesn't.
>
> That kind of arrangement is usually enough to accomplish whatever is
> needed.  I cannot assume that it will work for your use case, but it
> does work for most.
>
> Thanks,
> Shawn
>
>
The problem is that the math isn't a simple case of adding up all the row
counts. These are "left outer join"s. In sql, it would be this query:

select * from header h
left outer join child1 c1 on c1.hid = h.id
left outer join child2 c2 on c2.hid = h.id
...
left outer join child8 c8 on c8.hid = h.id


If there are 10 rows in child1 linked to 1 header with id "abc", and 10
rows in child2 linked to that same header, then we end up with 10 * 10 rows
in solr, not 20. Considering there are 8 child tables in this example,
there is simply an explosion of data.

I can't describe it much better than that (abstractly), though perhaps I
could put together a simple example with live data. Suffice it to say, in
my example row counts above, that is all "live data" in a relatively small
database of ours, the row counts are real, and the final row count of 5.5
billion was calculated inside sql using that query above:

select count(*) from (
select id from header h
left outer join child1 c1 on c1.hid = h.id
left outer join child2 c2 on c2.hid = h.id
...
left outer join child8 c8 on c8.hid = h.id
) tmp;


6.x to 7.x differences

2018-09-11 Thread John Blythe
hi, all.

we recently migrated to cloud. part of that migration jumped us from 6.1 to
7.4.

one example query between our old solr instance and our new cloud instance
produces 42 results and 19k results.

the analyzer is the same aside from WordDelimiterFilterFactory moving over
to the graph variation of it and the lucene parser moving from 6.1 to 7.4
obviously.

i've used the analysis tool in solr admin to try to determine the
difference between the two. i'm seeing the same output between index and
query results yet when actually running the queries have that huge
divergence of results.

i'm left scratching my head at this point. i'm guessing it's from the
lucene parser? hoping to get some clarity from you guys!

thanks!

--
John Blythe


Re: parent/child rows in solr

2018-09-11 Thread John Smith
>
> On 9/7/2018 7:44 PM, John Smith wrote:
> > Thanks Shawn, for your comments. The reason why I don't want to go flat
> > file structure, is due to all the wasted/duplicated data. If a department
> > has 100 employees, then it's very wasteful in terms of disk space to
> repeat
> > the header data over and over again, 100 times. In this example there is
> > only a few doc types, but my real-life data is much larger, and the
> problem
> > is a "scaling" problem; with just a little bit of data, no problem in
> > duplicating header fields, but with massive amounts of data it's a large
> > problem.
>
> If your goal is data storage, then you are completely correct.  All that
> data duplication is something to avoid for a data storage situation.
> Normalizing your data so it's relational makes perfect sense, because
> most database software is designed to efficiently deal with those
> relationships.
>
> Solr is not designed as a data storage platform, and does not handle
> those relationships efficiently.  Solr's design goals are all about
> *search*.  It often gets touted as filling a NoSQL role ... but it's not
> something I would personally use as a primary data repository.  Search
> is a space where data duplication is expected and completely normal.
> This is something that people often have a hard time accepting.
>
>
I'm not actually trying to use solr as a data storage platform; all our
data is stored in an sql database, we are using solr strictly for the
search features, not storage features.

Here is a good example from a test I ran today. I have a header table, and
8 child tables which link directly to the header table. The children link
only to 1 header row, and they do not link to other children. So a 1:many
between header and each child. Some row counts:

header:  223,580

child1:  124,978
child2:  254,045
child3:  127,917
child4:1,009,030
child5:  225,311
child6:  381,561
child7:  438,315
child8:   18,850


Trying to index that into solr with a flatfile schema, blows up into
5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a
left outer join between header and each child and getting a row count in
the database. That's not going to scale, at all, considering the small size
of the source input tables. Some of our indexes would require 50 million
header rows alone, never mind the child tables.

So solr has no way of indexing something like this? I can't believe I would
be the first person to run into this issue, I have a feeling I'm missing
something obvious somewhere.


Re: 504 timeout

2018-09-11 Thread John Blythe
ah, great thought. didn't even think of that. we already have a couple
ngram-based fields. will send over to the stakeholder who was attempting
this.

thanks!

--
John Blythe


On Sun, Sep 9, 2018 at 11:31 PM Erick Erickson 
wrote:

> First of all, wildcards are evil. Be sure that the reason people are
> using wildcards wouldn't be better served by proper tokenizing,
> perhaps something like stemming etc.
>
> Assuming that wildcards must be handled though, there are two main
> strategies:
> 1> if you want to use leading wildcards, look at
> ReverseWildcardFilterFactory. For something like abc* (trailing
> wildcard), conceptually Lucene has to construct a big OR query of
> every term that starts with "abc". That's not hard and is also pretty
> fast, just jump to the first term that starts with "abc" and gather
> all of them (they're sorted lexicaly) until you get to the first term
> starting with "abd".
>
> _Leading_ wildcards are a whole 'nother story. *abc means that each
> and every distinct term in the field must be enumerated. The first
> term could be abc and the last term in the field zzzabc.
> There's no way to tell without checking every one.
> ReverseWildcardFilterFactory handles indexing the term, well, reversed
> so the above example not only would the term abc bb indexed,
> but also cba. Now both leading and trailing wildcards are
> automagically made into trailing wildcards.
>
> 2> If you must allow leading and trailing wildcards on the same term
> *abc*, consider ngramming, bigrams are usually sufficient. So aaabcde
> is indexed as aa, aa, ab, bd, de and searching for *abc* becomes
> searching for "ab bc".
>
> Both of these make the index larger, but usually by surprisingly
> little. People will also index these variants in separate fields upon
> occasion, it depends on the use-cases needed to support. Ngramming for
> instance would find "ab" in the above (no wildcards)
>
> Best,
> Erick
> On Sun, Sep 9, 2018 at 1:40 PM John Blythe  wrote:
> >
> > hi all. we just migrated to cloud on friday night (woohoo!). everything
> is
> > looking good (great!) overall. we did, however, just run into a hiccup.
> > running a query like this got us a 504 gateway time-out error:
> >
> > **some* *foo* *bar* *query**
> >
> > it was about 6 partials with encapsulating wildcards that someone was
> > running that gave the error. doing 4 or 5 of them worked fine, but upon
> > adding the last one or two it went caput. all operations have been
> zippier
> > since the migration before doing some of those wildcard queries which
> took
> > time (if they worked at all). is this something related directly w our
> > server configuration or is there some solr/cloud config'ing that we could
> > work on that would allow better response to these sorts of queries
> (though
> > it'd be at a cost, i'd imagine!).
> >
> > thanks for any insight!
> >
> > best,
> >
> > --
> > John Blythe
>


504 timeout

2018-09-09 Thread John Blythe
hi all. we just migrated to cloud on friday night (woohoo!). everything is
looking good (great!) overall. we did, however, just run into a hiccup.
running a query like this got us a 504 gateway time-out error:

**some* *foo* *bar* *query**

it was about 6 partials with encapsulating wildcards that someone was
running that gave the error. doing 4 or 5 of them worked fine, but upon
adding the last one or two it went caput. all operations have been zippier
since the migration before doing some of those wildcard queries which took
time (if they worked at all). is this something related directly w our
server configuration or is there some solr/cloud config'ing that we could
work on that would allow better response to these sorts of queries (though
it'd be at a cost, i'd imagine!).

thanks for any insight!

best,

--
John Blythe


Re: parent/child rows in solr

2018-09-07 Thread John Smith
Thanks Shawn, for your comments. The reason why I don't want to go flat
file structure, is due to all the wasted/duplicated data. If a department
has 100 employees, then it's very wasteful in terms of disk space to repeat
the header data over and over again, 100 times. In this example there is
only a few doc types, but my real-life data is much larger, and the problem
is a "scaling" problem; with just a little bit of data, no problem in
duplicating header fields, but with massive amounts of data it's a large
problem.

My understanding of both graph traversal and block joins, is that the
header data would only be present once, so that's why I'm gravitating
towards those solutions. I just can't seem to line up the "fq" and queries
correctly such that I am able to join 3+ document types together, filter on
them, and return my requested columns.

On Fri, Sep 7, 2018 at 9:32 PM Shawn Heisey  wrote:

> On 9/7/2018 3:06 PM, John Smith wrote:
> > Hi, I have a document structure like this (this is a made up schema, my
> > data has nothing to do with departments and employees, but the structure
> > holds true to my real data):
> >
> > department 1
> >  employee 11
> >  employee 12
> >  employee 13
> >  room 11
> >  room 12
> >  room 13
> >
> > department 2
> >  employee 21
> >  employee 22
> >  room 21
> >
> > ... etc
> >
> > I'm trying to figure out the best way to index this, and perform queries.
> > Due to the sheer volume of data, I cannot do a simple "flat file"
> approach,
> > repeating the header data for each child entry.
>
> Why not?
>
> For the precise use case you have outlined, Solr will work better if you
> only have the child documents and simply have every document contain a
> "department" field which contains an identifier for the department.
> Since this precise structure is not what you are doing, you'll need to
> adapt what I'm saying to your actual data.
>
> The volume of data should be irrelevant to this decision. Solr will
> always work best with a flat document structure.
>
> I have never used the parent/child document feature in Solr, so I cannot
> offer any advice on it.  Somebody else will need to help you if you
> choose to use that feature.
>
> Thanks,
> Shawn
>
>


parent/child rows in solr

2018-09-07 Thread John Smith
Hi, I have a document structure like this (this is a made up schema, my
data has nothing to do with departments and employees, but the structure
holds true to my real data):

department 1
employee 11
employee 12
employee 13
room 11
room 12
room 13

department 2
employee 21
employee 22
room 21

... etc

I'm trying to figure out the best way to index this, and perform queries.
Due to the sheer volume of data, I cannot do a simple "flat file" approach,
repeating the header data for each child entry.

So that leaves me with "graph traversal" or "block joins". I've played with
both of those, but I'm running into various issues with each approach.

I need to be able to run filters on any or all of the header + child rows
in the same query, at once (can't seem to get that working in either graph
or block join). One problem I had with graph is that I can't force solr to
return the header, then all the children for that header, then the next
header + all it's children, it just spits them out without keeping them
together. block join seems to return the children nested under the parents,
which is great, but then I can't seem to filter on parent + children in the
same query: I get the dreaded error message "Parent query must not match
any docs besides parent filter"

Kinda lost here, any tips/suggestions?


Re: Local development and SolrCloud

2018-08-23 Thread John Blythe
Thanks everyone. I think we forgot that cloud doesn’t have to be clustered.
That local overhead being avoided makes it a much easier pill to swallow as
far as local performance (vs. having all the extra containers running in
docker)

Will see what we can spin up and ask questions if/as they arise!

On Wed, Aug 22, 2018 at 17:41 Erick Erickson 
wrote:

> I do quite a bit of "correctness" testing on a local stand-alone Solr,
> as Walter says, that's often easier to debug, especially when working
> through creating the proper analysis chains, do queries do what I
> expect and the like.
>
> That said, I'd never jump straight to SolrCloud implementations
> without my QA being on SolrCloud. Not only do subtle differences creep
> in, but some things simply aren't supported, e.g. group.func.
>
> And, as Sameer says, you can set up a SolrCloud environment on just
> your local laptop as many of the examples do for testing, there's
> nothing required about "the cloud" for SorlCloud, it's not even
> necessary to have separate machines.
>
> Best,
> Erick
>
> On Wed, Aug 22, 2018 at 5:34 PM, Walter Underwood 
> wrote:
> > We use Solr Cloud where we need sharding or near real time updates.
> > For non-sharded collections that are updated daily, we use master-slave.
> >
> > There are some scaling and management advantages to the loose
> > coupling in a master slave cluster. Just clone a slave instance and
> > fire it up. Also, load benchmarking is easier when indexing is on a
> > separate instance.
> >
> > In prod, we have 45 Solr hosts in four clusters.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Aug 22, 2018, at 5:23 PM, John Blythe  wrote:
> >>
> >> For those of you who are developing applications with solr and are using
> >> solrcloud in production: what are you doing locally? Cloud seems
> >> unnecessary locally besides testing strictly for cloud specific use
> cases
> >> or configurations. Am I totally off basis there? We are considering
> keeping
> >> a “standard” (read: non-cloud) local solr environment locally for our
> >> development workflow and using cloud only for our remote environments.
> >> Curious to know how wise or stupid that play would be.
> >>
> >> Thanks for any info!
> >> --
> >> John Blythe
> >
>
-- 
John Blythe


Local development and SolrCloud

2018-08-22 Thread John Blythe
For those of you who are developing applications with solr and are using
solrcloud in production: what are you doing locally? Cloud seems
unnecessary locally besides testing strictly for cloud specific use cases
or configurations. Am I totally off basis there? We are considering keeping
a “standard” (read: non-cloud) local solr environment locally for our
development workflow and using cloud only for our remote environments.
Curious to know how wise or stupid that play would be.

Thanks for any info!
-- 
John Blythe


Ignored fields and copyfield

2018-08-06 Thread John Davis
Hi there,
If a field is set as "ignored" (indexed=false, stored=false) can it be used
for another field as part of copyfield directive which might index/store it.

John


Index size by document fields

2018-08-04 Thread John Davis
Hi,
Is there a way to monitor the size of the index broken by individual fields
across documents? I understand there are different parts - the inverted
index and the stored fields - and an estimate would be good start.

Thanks
John


Re: Preferred PHP Client Library

2018-07-17 Thread John Blythe
sorry for my typo earlier. we have enjoyed** using solarium.

there have been some shortcomings in our more complex queries that we've
had to more or less hack around, but generally speaking it's been a breeze.

we are about to undergo testing things out with the solrcloud extension as
we're migrating to cloud, but have also found that it's not entirely
necessary (i can't yet speak to the cost/benefit of using the extension
when we deploy to cloud)

best of luck-

--
John Blythe


On Tue, Jul 17, 2018 at 11:57 AM Christian Spitzlay <
christian.spitz...@biologis.com> wrote:

> Hi,
>
> we're using Solarium, too, via Drupal's search_api / search_api_solr
> modules,
> and I hear there are plans by the Typo3 project to move to Solarium.
>
> So there are at least two major CMS projects that will be relying on it in
> the future.
>
> I don't have any experience with wordpress but a web search
> returned wpsolr (I assume you are aware of this):
> https://www.wpsolr.com/knowledgebase/what-is-your-solr-php-client-library/
> So someone from the wordpress  world seems to be using it successfully,
> too.
>
> I haven't used the standard extension so I cannot comment on the
> differences
> but here is the Solarium project's description in their own words:
> https://solarium.readthedocs.io/en/latest/
>
> Solarium is under active development. They recently added support
> for Solr cloud streaming expressions and for the JSON Facet API
> (the latter is in the beta version).
>
>
> Best regards
> Christian Spitzlay
>
>
> > Am 16.07.2018 um 21:19 schrieb Zimmermann, Thomas <
> tzimmerm...@techtarget.com>:
> >
> > Hi,
> >
> > We're in the midst of our first major Solr upgrade in years and are
> trying to run some cleanup across
> > all of our client codebases. We're currently using the standard PHP Solr
> Extension when communicating
> > with our cluster from our Wordpress installs.
> http://php.net/manual/en/book.solr.php
> >
> > Few questions.
> >
> > Should we have any concerns about communicating with a Solr 7 cloud from
> that client?
> > Is anyone using another client they prefer? If so what are the benefits
> of switching to it?
> >
> > Thanks!
> > TZ
>
>
>
> --
>
> Christian Spitzlay
> Diplom-Physiker,
> Senior Software-Entwickler
>
> E-Mail: christian.spitz...@biologis.com
>
> bio.logis Genetic Information Management GmbH
> Altenhöferallee 3
> 60438 Frankfurt am Main
>
> Geschäftsführung: Prof. Dr. med. Daniela Steinberger, Dipl.Betriebswirt
> Enrico Just
> Firmensitz Frankfurt am Main, Registergericht Frankfurt am Main, HRB 97945
> Umsatzsteuer-Identifikationsnummer DE293587677
>
>
>


Re: Preferred PHP Client Library

2018-07-16 Thread John Blythe
We have envious using Solarium

On Mon, Jul 16, 2018 at 14:19 Zimmermann, Thomas 
wrote:

> Hi,
>
> We're in the midst of our first major Solr upgrade in years and are trying
> to run some cleanup across all of our client codebases. We're currently
> using the standard PHP Solr Extension when communicating with our cluster
> from our Wordpress installs. http://php.net/manual/en/book.solr.php
>
> Few questions.
>
> Should we have any concerns about communicating with a Solr 7 cloud from
> that client?
> Is anyone using another client they prefer? If so what are the benefits of
> switching to it?
>
> Thanks!
> TZ
>
-- 
John Blythe


Re: Sort by payload value

2018-05-25 Thread John Davis
Hi Erik - Solr is tokenizing correctly as you can see it return the payload
field value along with the full payload and they match on the particular
field. The field does have a lowercase filter as you can see in the
definition. Changing it to single word query doesn't fix it either..

On Fri, May 25, 2018 at 8:22 AM, Erick Erickson 
wrote:

> My first guess (and it's a total guess) is that you either have a case
> problem or
> you're tokenizing the string. Does your field definition lower-case the
> tokens?
> If it's a string type then certainly not.
>
> Quick test would be to try your query with a value that matches case
> and has no spaces,
> maybe "Portals". If that gives you the correct sort then you have a
> place to start
>
> Adding &debug=query will help a bit, although it won't show you the
> guts of the payload
> calcs.
>
> FYI, ties are broken by the internal Lucene doc ID. If the theory that
> you are getting
> no matches, then your sort order is determined by this value which you
> don't really
> have much access to.
>
> Best,
> Erick
>
> On Thu, May 24, 2018 at 7:29 PM, John Davis 
> wrote:
> > Hello,
> >
> > We are trying to use payload values as described in [1] and are running
> > into issues when issuing *sort by* payload value.  Would appreciate any
> > pointers to what we might be doing wrong. We are running solr 6.6.0.
> >
> > * Here's the payload value definition:
> >
> > indexed="true"
> > class="solr.TextField">
> >   
> >> pattern="[A-Za-z0-9][^|]*[|][0-9.]+" group="0"/>
> >> encoder="float"/>
> >   
> >   
> >   
> >
> > * Query with sort by does not return documents sorted by the payload
> value:
> >
> > {
> >   "responseHeader":{
> > "status":0,
> > "QTime":82,
> > "params":{
> >   "q":"*:*",
> >   "indent":"on",
> >   "fl":"industry_value,${indexp}",
> > *  "indexp":"payload(industry_value, 'internet services', 0)",*
> >   "fq":["{!frange l=0.1}${indexp}",
> > "industry_value:*"],
> > *  "sort":"${indexp} asc",*
> >   "rows":"10",
> >   "wt":"json"}},
> >   "response":{"numFound":102668,"start":0,"docs":[
> >   {
> > "industry_value":"Startup|13.3890410959
> Collaboration|12.3863013699
> > Document Management|12.3863013699 Chat|12.3863013699 Video
> > Conferencing|12.3863013699 Finance|1.0 Payments|1.0 Internet|1.0 Internet
> > Services|1.0 Top Companies|1.0",
> >
> > "payload(industry_value, 'internet services', 0)":*1.0*},
> >
> >   {
> > "industry_value":"Hardware|16.7616438356 Messaging and
> > Telecommunications|6.71780821918 Mobility|6.71780821918
> > Startup|6.71780821918 Analytics|6.71780821918 Development
> > Platforms|6.71780821918 Mobile Commerce|6.71780821918 Mobile
> > Security|6.71780821918 Privacy and Security|6.71780821918 Information
> > Security|6.71780821918 Cyber Security|6.71780821918 Finance|6.71780821918
> > Collaboration|6.71780821918 Enterprise|6.71780821918
> > Messaging|6.71780821918 Internet Services|6.71780821918 Information
> > Technology|6.71780821918 Contact Management|6.71780821918
> > Mobile|6.71780821918 Mobile Enterprise|6.71780821918 Data
> > Security|6.71780821918 Data and Analytics|6.71780821918
> > Security|6.71780821918",
> >
> > "payload(industry_value, 'internet services', 0)":*6.7178082*},
> >
> >   {
> > "industry_value":"Startup|4.46301369863
> Advertising|1.24657534247
> > Content and Publishing|0.917808219178 Internet|0.917808219178 Social
> Media
> > Platforms|0.917808219178 Content Discovery|0.917808219178 Media and
> > Entertainment|0.917808219178 Social Media|0.917808219178 Sales and
> > Marketing|0.917808219178 Internet Services|0.917808219178 Advertising
> > Platforms|0.917808219178 Social Media Management|0.917808219178
> > Mobile|0.328767123288 Food and Beverage|0.252054794521 Real
> > Estate|0.252054794521 Consumer Goods|0.252054794521 FMCG|0.252054794521
> > Home Services|0.252054794521 Consumer|0.252054794521
> > Enterprise|0.167123287671",
> >
> > "payload(industry_value, 'internet services', 0)":*0.91780823*},
> >
> > {
> > "industry_value":"Startup|8.55068493151 Media and
> > Entertainment|5.54794520548 Transportation|5.54794520548
> > Ticketing|5.54794520548 Travel|5.54794520548 Travel and
> > Tourism|5.54794520548 Events|5.54794520548 Cloud Computing|2.33698630137
> > Collaboration|2.33698630137 Platforms|2.33698630137
> > Enterprise|2.33698630137 Internet Services|2.33698630137 Top
> > Companies|2.33698630137 Developer Tools|2.33698630137 Operating
> > Systems|2.33698630137 Search|1.83287671233 Internet|1.83287671233
> > Technology|1.83287671233 Portals|1.83287671233 Email|1.83287671233
> > Photography|1.83287671233",
> >
> > "payload(industry_value, 'internet services', 0)":*2.3369863*},
> >
> >
> > [1] https://lucidworks.com/2017/09/14/solr-payloads/
>


Sort by payload value

2018-05-24 Thread John Davis
Hello,

We are trying to use payload values as described in [1] and are running
into issues when issuing *sort by* payload value.  Would appreciate any
pointers to what we might be doing wrong. We are running solr 6.6.0.

* Here's the payload value definition:

   
  
  
  
  
  
  

* Query with sort by does not return documents sorted by the payload value:

{
  "responseHeader":{
"status":0,
"QTime":82,
"params":{
  "q":"*:*",
  "indent":"on",
  "fl":"industry_value,${indexp}",
*  "indexp":"payload(industry_value, 'internet services', 0)",*
  "fq":["{!frange l=0.1}${indexp}",
"industry_value:*"],
*  "sort":"${indexp} asc",*
  "rows":"10",
  "wt":"json"}},
  "response":{"numFound":102668,"start":0,"docs":[
  {
"industry_value":"Startup|13.3890410959 Collaboration|12.3863013699
Document Management|12.3863013699 Chat|12.3863013699 Video
Conferencing|12.3863013699 Finance|1.0 Payments|1.0 Internet|1.0 Internet
Services|1.0 Top Companies|1.0",

"payload(industry_value, 'internet services', 0)":*1.0*},

  {
"industry_value":"Hardware|16.7616438356 Messaging and
Telecommunications|6.71780821918 Mobility|6.71780821918
Startup|6.71780821918 Analytics|6.71780821918 Development
Platforms|6.71780821918 Mobile Commerce|6.71780821918 Mobile
Security|6.71780821918 Privacy and Security|6.71780821918 Information
Security|6.71780821918 Cyber Security|6.71780821918 Finance|6.71780821918
Collaboration|6.71780821918 Enterprise|6.71780821918
Messaging|6.71780821918 Internet Services|6.71780821918 Information
Technology|6.71780821918 Contact Management|6.71780821918
Mobile|6.71780821918 Mobile Enterprise|6.71780821918 Data
Security|6.71780821918 Data and Analytics|6.71780821918
Security|6.71780821918",

"payload(industry_value, 'internet services', 0)":*6.7178082*},

  {
"industry_value":"Startup|4.46301369863 Advertising|1.24657534247
Content and Publishing|0.917808219178 Internet|0.917808219178 Social Media
Platforms|0.917808219178 Content Discovery|0.917808219178 Media and
Entertainment|0.917808219178 Social Media|0.917808219178 Sales and
Marketing|0.917808219178 Internet Services|0.917808219178 Advertising
Platforms|0.917808219178 Social Media Management|0.917808219178
Mobile|0.328767123288 Food and Beverage|0.252054794521 Real
Estate|0.252054794521 Consumer Goods|0.252054794521 FMCG|0.252054794521
Home Services|0.252054794521 Consumer|0.252054794521
Enterprise|0.167123287671",

"payload(industry_value, 'internet services', 0)":*0.91780823*},

{
"industry_value":"Startup|8.55068493151 Media and
Entertainment|5.54794520548 Transportation|5.54794520548
Ticketing|5.54794520548 Travel|5.54794520548 Travel and
Tourism|5.54794520548 Events|5.54794520548 Cloud Computing|2.33698630137
Collaboration|2.33698630137 Platforms|2.33698630137
Enterprise|2.33698630137 Internet Services|2.33698630137 Top
Companies|2.33698630137 Developer Tools|2.33698630137 Operating
Systems|2.33698630137 Search|1.83287671233 Internet|1.83287671233
Technology|1.83287671233 Portals|1.83287671233 Email|1.83287671233
Photography|1.83287671233",

"payload(industry_value, 'internet services', 0)":*2.3369863*},


[1] https://lucidworks.com/2017/09/14/solr-payloads/


Re: replication

2018-04-13 Thread John Blythe
great. thanks, erick!

--
John Blythe

On Wed, Apr 11, 2018 at 12:16 PM, Erick Erickson 
wrote:

> bq: are you simply flagging the fact that we wouldn't direct the queries
> to A
> v. B v. C since SolrCloud will make the decisions itself as to which part
> of the distro gets hit for the operation
>
> Yep. SolrCloud takes care of it all itself. I should also add that there
> are
> about a zillion metrics now available in Solr that you can use to make the
> best use of hardware, including things like CPU usage, I/O, GC etc.
> SolrCloud
> doesn't _yet_ make use of these but will in future. The current software LB
> does a pretty simple round-robin distribution.
>
> Best,
> Erick
>
> On Wed, Apr 11, 2018 at 5:57 AM, John Blythe  wrote:
> > thanks, erick. great info.
> >
> > although you can't (yet) direct queries to one or the other. So just
> making
> >> them all NRT and forgetting about it is reasonable.
> >
> >
> > are you simply flagging the fact that we wouldn't direct the queries to A
> > v. B v. C since SolrCloud will make the decisions itself as to which part
> > of the distro gets hit for the operation? if not, can you expound on
> this a
> > bit more?
> >
> > The very nature of merging is such that you will _always_ get large
> merges
> >> until you have 5G segments (by default)
> >
> >
> > bummer
> >
> > Quite possible, but you have to route things yourself. But in that case
> >> you're limited to one machine to handle all your NRT traffic. I skimmed
> >> your post so don't know whether your NRT traffic load is high enough to
> >> worry about.
> >
> >
> > ok. i think we'll take a two-pronged approach. for the immediate purposes
> > of trying to solve an issue we've begun encountering we will begin
> > thoroughtesting the load between various operations in the master-slave
> > setup we've set up. pending the results, we can roll forward w a
> temporary
> > patch in which all end-user touch points route through the primary box
> for
> > read/write while large scale operations/processing we do in the
> background
> > will point to the ELB the slaves are sitting behind. we'll also begin
> > setting up a simple solrcloud instance to toy with per your suggestion
> > above. inb4 tons more questions on my part :)
> >
> > thanks!
> >
> > --
> > John Blythe
> >
> > On Tue, Apr 10, 2018 at 11:14 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> bq: should we try to bite the solrcloud bullet and be done w it
> >>
> >> that's what I'd do. As of 7.0 there are different "flavors", TLOG,
> >> PULL and NRT so that's also a possibility, although you can't (yet)
> >> direct queries to one or the other. So just making them all NRT and
> >> forgetting about it is reasonable.
> >>
> >> bq:  is there some more config work we could put in place to avoid ...
> >> commit issue and the ultra large merge dangers
> >>
> >> No. The very nature of merging is such that you will _always_ get
> >> large merges until you have 5G segments (by default). The max segment
> >> size (outside "optimize/forceMerge/expungeDeletes" which you shouldn't
> >> do) is 5G so the steady-state worst-case segment pull is limited to
> >> that.
> >>
> >> bq: maybe for our initial need we use Master for writing and user
> >> access in NRT events, but slaves for the heavier backend
> >>
> >> Quite possible, but you have to route things yourself. But in that
> >> case you're limited to one machine to handle all your NRT traffic. I
> >> skimmed your post so don't know whether your NRT traffic load is high
> >> enough to worry about.
> >>
> >> The very first thing I'd do is set up a simple SolrCloud setup and
> >> give it a spin. Unless your indexing load is quite heavy, the added
> >> work the NRT replicas have in SolrCloud isn't a problem so worrying
> >> about that is premature optimization unless you have a heavy load.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Apr 9, 2018 at 4:36 PM, John Blythe 
> wrote:
> >> > Thanks a bunch for the thorough reply, Shawn.
> >> >
> >> > Phew. We’d chosen to go w Master-slave replication instead of
> SolrCloud
> >> per
> >> > the sudden need we had encountered and the desire to avoid 

Re: replication

2018-04-11 Thread John Blythe
thanks, erick. great info.

although you can't (yet) direct queries to one or the other. So just making
> them all NRT and forgetting about it is reasonable.


are you simply flagging the fact that we wouldn't direct the queries to A
v. B v. C since SolrCloud will make the decisions itself as to which part
of the distro gets hit for the operation? if not, can you expound on this a
bit more?

The very nature of merging is such that you will _always_ get large merges
> until you have 5G segments (by default)


bummer

Quite possible, but you have to route things yourself. But in that case
> you're limited to one machine to handle all your NRT traffic. I skimmed
> your post so don't know whether your NRT traffic load is high enough to
> worry about.


ok. i think we'll take a two-pronged approach. for the immediate purposes
of trying to solve an issue we've begun encountering we will begin
thoroughtesting the load between various operations in the master-slave
setup we've set up. pending the results, we can roll forward w a temporary
patch in which all end-user touch points route through the primary box for
read/write while large scale operations/processing we do in the background
will point to the ELB the slaves are sitting behind. we'll also begin
setting up a simple solrcloud instance to toy with per your suggestion
above. inb4 tons more questions on my part :)

thanks!

--
John Blythe

On Tue, Apr 10, 2018 at 11:14 AM, Erick Erickson 
wrote:

> bq: should we try to bite the solrcloud bullet and be done w it
>
> that's what I'd do. As of 7.0 there are different "flavors", TLOG,
> PULL and NRT so that's also a possibility, although you can't (yet)
> direct queries to one or the other. So just making them all NRT and
> forgetting about it is reasonable.
>
> bq:  is there some more config work we could put in place to avoid ...
> commit issue and the ultra large merge dangers
>
> No. The very nature of merging is such that you will _always_ get
> large merges until you have 5G segments (by default). The max segment
> size (outside "optimize/forceMerge/expungeDeletes" which you shouldn't
> do) is 5G so the steady-state worst-case segment pull is limited to
> that.
>
> bq: maybe for our initial need we use Master for writing and user
> access in NRT events, but slaves for the heavier backend
>
> Quite possible, but you have to route things yourself. But in that
> case you're limited to one machine to handle all your NRT traffic. I
> skimmed your post so don't know whether your NRT traffic load is high
> enough to worry about.
>
> The very first thing I'd do is set up a simple SolrCloud setup and
> give it a spin. Unless your indexing load is quite heavy, the added
> work the NRT replicas have in SolrCloud isn't a problem so worrying
> about that is premature optimization unless you have a heavy load.
>
> Best,
> Erick
>
> On Mon, Apr 9, 2018 at 4:36 PM, John Blythe  wrote:
> > Thanks a bunch for the thorough reply, Shawn.
> >
> > Phew. We’d chosen to go w Master-slave replication instead of SolrCloud
> per
> > the sudden need we had encountered and the desire to avoid the nuances
> and
> > changes related to moving to SolrCloud. But so much for this being a more
> > straightforward solution, huh?
> >
> > Few questions:
> > - should we try to bite the solrcloud bullet and be done w it?
> > - is there some more config work we could put in place to avoid the soft
> > commit issue and the ultra large merge dangers, keeping the replications
> > happening quickly?
> > - maybe for our initial need we use Master for writing and user access in
> > NRT events, but slaves for the heavier backend processing. Thoughts?
> > - anyone do consulting on this that would be interested in chatting?
> >
> > Thanks again!
> >
> > On Mon, Apr 9, 2018 at 18:18 Shawn Heisey  wrote:
> >
> >> On 4/9/2018 12:15 PM, John Blythe wrote:
> >> > we're starting to dive into master/slave replication architecture.
> we'll
> >> > have 1 master w 4 slaves behind it. our app is NRT. if user performs
> an
> >> > action in section A's data they may choose to jump to section B which
> >> will
> >> > be dependent on having the updates from their action in section A. as
> >> such,
> >> > we're thinking that the replication time should be set to 1-2s (the
> >> chances
> >> > of them arriving at section B quickly enough to catch the 2s gap is
> >> highly
> >> > unlikely at best).
> >>
> >> Once you start talking about master-slave replication, 

Re: replication

2018-04-09 Thread John Blythe
Thanks a bunch for the thorough reply, Shawn.

Phew. We’d chosen to go w Master-slave replication instead of SolrCloud per
the sudden need we had encountered and the desire to avoid the nuances and
changes related to moving to SolrCloud. But so much for this being a more
straightforward solution, huh?

Few questions:
- should we try to bite the solrcloud bullet and be done w it?
- is there some more config work we could put in place to avoid the soft
commit issue and the ultra large merge dangers, keeping the replications
happening quickly?
- maybe for our initial need we use Master for writing and user access in
NRT events, but slaves for the heavier backend processing. Thoughts?
- anyone do consulting on this that would be interested in chatting?

Thanks again!

On Mon, Apr 9, 2018 at 18:18 Shawn Heisey  wrote:

> On 4/9/2018 12:15 PM, John Blythe wrote:
> > we're starting to dive into master/slave replication architecture. we'll
> > have 1 master w 4 slaves behind it. our app is NRT. if user performs an
> > action in section A's data they may choose to jump to section B which
> will
> > be dependent on having the updates from their action in section A. as
> such,
> > we're thinking that the replication time should be set to 1-2s (the
> chances
> > of them arriving at section B quickly enough to catch the 2s gap is
> highly
> > unlikely at best).
>
> Once you start talking about master-slave replication, my assumption is
> that you're not running SolrCloud.  You would NOT want to try and mix
> SolrCloud with replication.  The features do not play well together.
> SolrCloud with NRT replicas (this is the only replica type that exists
> in 6.x and earlier) may be a better option than master-slave replication.
>
> > since the replicas will simply be looking for new files it seems like
> this
> > would be a lightweight operation even every couple seconds for 4
> replicas.
> > that said, i'm going *entirely* off of assumption at this point and
> wanted
> > to check in w you all to see any nuances, gotchas, hidden landmines, etc.
> > that we should be considering before rolling things out.
>
> Most of the time, you'd be correct to think that indexing is going to
> create a new small segment and replication will have little work to do.
> But as you create more and more segments, eventually Lucene is going to
> start merging those segments.  For discussion purposes, I'm going to
> describe a situation where each new segment during indexing is about
> 100KB in size, and the merge policy is left at the default settings.
> I'm also going to assume that no documents are getting deleted or
> reindexed (which will delete the old version).  Deleted documents can
> have an impact on merging, but it will usually only be a dramatic impact
> if there are a LOT of deleted documents.
>
> The first ten segments created will be this 100KB size.  Then Lucene is
> going to see that there are enough segments to trigger the merge policy
> - it's going to combine ten of those segments into one that's
> approximately one megabyte.  Repeat this ten times, and ten of those 1
> megabyte segments will be combined into one ten megabyte segment.
> Repeat all of THAT ten times, and there will be a 100 megabyte segment.
> And there will eventually be another level creating 1 gigabyte
> segments.  If the index is below 5GB in size, the entire thing *could*
> be merged into one segment by this process.
>
> The end result of all this:  Replication is not always going to be
> super-quick.  If merging creates a 1 gigabyte segment, then the amount
> of time to transfer that new segment is going to depend on how fast your
> disks are, and how fast your network is.  If you're using commodity SATA
> drives in the 4 to 10 terabyte range and a gigabit network, the network
> is probably going to be the bottleneck -- assuming that the system has
> plenty of memory and isn't under a high load.  If the network is the
> bottleneck in that situation, it's probably going to take close to ten
> seconds to transfer a 1GB segment, and the greater part of a minute to
> transfer a 5GB segment, which is the biggest one that the default merge
> policy configuration will create without an optimize operation.
>
> Also, you should understand something that has come to my attention
> recently (and is backed up by documentation):  If the master does a soft
> commit and the segment that was committed remains in memory (not flushed
> to disk), that segment will NOT be replicated to the slaves.  It has to
> get flushed to disk before it can be replicated.
>
> Thanks,
> Shawn
>
> --
John Blythe


replication

2018-04-09 Thread John Blythe
hi, all.

we're starting to dive into master/slave replication architecture. we'll
have 1 master w 4 slaves behind it. our app is NRT. if user performs an
action in section A's data they may choose to jump to section B which will
be dependent on having the updates from their action in section A. as such,
we're thinking that the replication time should be set to 1-2s (the chances
of them arriving at section B quickly enough to catch the 2s gap is highly
unlikely at best).

since the replicas will simply be looking for new files it seems like this
would be a lightweight operation even every couple seconds for 4 replicas.
that said, i'm going *entirely* off of assumption at this point and wanted
to check in w you all to see any nuances, gotchas, hidden landmines, etc.
that we should be considering before rolling things out.

thanks for any info!

--
John Blythe


Re: statistics in hitlist

2018-03-16 Thread John Smith
Thanks for the link to the documentation, that will probably come in useful.

I didn't see a way though, to get my avg function working? So instead of
doing a linear regression on two fields, X and Y, in a hitlist, we need to
do a linear regression on field X, and the average value of X. Is that
possible? To pass in a function to the regress function instead of a field?





On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein  wrote:

> I've been working on the user guide for the math expressions. Here is the
> page on regression:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/regression.adoc
>
> This page is part of the larger math expression documentation. The TOC is
> here:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/math-expressions.adoc
>
> The docs are still very rough but you can get an idea of the coverage.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein 
> wrote:
>
> > If you want to get everything in query you can do this:
> >
> > let(echo="d,e",
> >  a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
> > *]",
> > fq="isParent:true", rows="150",
> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> sort="id
> > asc"),
> >  b=col(a, oil_first_90_days_production),
> >  c=col(a, oil_last_30_days_production),
> >  d=regress(b, c),
> >  e=someExpression())
> >
> > The echo parameter tells the let expression which variables to output.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson  >
> > wrote:
> >
> >> What does the fq clause look like?
> >>
> >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith 
> >> wrote:
> >> > Hi Joel, I did some more work on this statistics stuff today. Yes, we
> do
> >> > have nulls in our data; the document contains many fields, we don't
> >> always
> >> > have values for each field, but we can't set the nulls to 0 either (or
> >> any
> >> > other value, really) as that will mess up other calculations (such as
> >> when
> >> > calculating average etc); we would normally just ignore fields with
> null
> >> > values when calculating stats manually ourselves.
> >> >
> >> > Adding a check in the "q" parameter to ensure that the fields used in
> >> the
> >> > calculations are > 0 does work now. Thanks for the tip (and sorry,
> >> should
> >> > have caught that myself). But I am unable to use "fq" for these
> checks,
> >> > they have to be added to the q instead. Adding fq's doesn't have any
> >> effect.
> >> >
> >> >
> >> > Anyway, I'm trying to change this up a little. This is what I'm
> >> currently
> >> > using (switched from "random" to "search" since I actually need the
> full
> >> > hitlist not just a random subset):
> >> >
> >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1
> TO
> >> *]",
> >> > fq="isParent:true", rows="150",
> >> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> >> sort="id
> >> > asc"),
> >> >  b=col(a, oil_first_90_days_production),
> >> >  c=col(a, oil_last_30_days_production),
> >> >  d=regress(b, c))
> >> >
> >> > So I have 2 fields there defined, that works great (in terms of a test
> >> and
> >> > running the query); but I need to replace the second field,
> >> > "oil_last_30_days_production" with the avg value in
> >> > oil_first_90_days_production.
> >> >
> >> > I can get the avg with this expression:
> >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> >> > fq="isParent:true", rows="150", avg(oil_first_90_days_
> production))
> >> >
> >> > But I don't know how to push that avg value into the first streaming
> >> > expression; guessing I have to set "c=" but that is where I'm
> >> getting
> >> > los

Re: statistics in hitlist

2018-03-15 Thread John Smith
Hi Joel, I did some more work on this statistics stuff today. Yes, we do
have nulls in our data; the document contains many fields, we don't always
have values for each field, but we can't set the nulls to 0 either (or any
other value, really) as that will mess up other calculations (such as when
calculating average etc); we would normally just ignore fields with null
values when calculating stats manually ourselves.

Adding a check in the "q" parameter to ensure that the fields used in the
calculations are > 0 does work now. Thanks for the tip (and sorry, should
have caught that myself). But I am unable to use "fq" for these checks,
they have to be added to the q instead. Adding fq's doesn't have any effect.


Anyway, I'm trying to change this up a little. This is what I'm currently
using (switched from "random" to "search" since I actually need the full
hitlist not just a random subset):

let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
fq="isParent:true", rows="150",
fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
asc"),
 b=col(a, oil_first_90_days_production),
 c=col(a, oil_last_30_days_production),
 d=regress(b, c))

So I have 2 fields there defined, that works great (in terms of a test and
running the query); but I need to replace the second field,
"oil_last_30_days_production" with the avg value in
oil_first_90_days_production.

I can get the avg with this expression:
stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
fq="isParent:true", rows="150", avg(oil_first_90_days_production))

But I don't know how to push that avg value into the first streaming
expression; guessing I have to set "c=" but that is where I'm getting
lost, since avg only returns 1 value and the first parameter, "b", returns
a list of sorts. Somehow I have to get the avg value stuffed inside a
"col", where it is the same value for every row in the hitlist...?

Thanks for your help!


On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein  wrote:

> I suspect you've got nulls in your data. I just tested with null values and
> got the same error. For testing purposes try loading the data with default
> values of zero.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein 
> wrote:
>
> > Let's break the expression down and build it up slowly. Let's start with:
> >
> > let(echo="true",
> >  a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
> > fl="oil_first_90_days_production,oil_last_30_days_production"),
> >  b=col(a, oil_first_90_days_production))
> >
> >
> > This should return variables a and b. Let's see what the data looks like.
> > I changed the rows from 15 to 15000. If it all looks good we can expand
> the
> > rows and continue adding functions.
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith  wrote:
> >
> >> Thanks Joel for your help on this.
> >>
> >> What I've done so far:
> >> - unzip downloaded solr-7.2
> >> - modify the _default "managed-schema" to add the random field type and
> >> the dynamic random field
> >> - start solr7 using "solr start -c"
> >> - indexed my data using pint/pdouble/boolean field types etc
> >>
> >> I can now run the random function all by itself, it returns random
> >> results as expected. So far so good!
> >>
> >> However... now trying to get the regression stuff working:
> >>
> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> rows="15000", fl="oil_first_90_days_producti
> >> on,oil_last_30_days_production"),
> >> b=col(a, oil_first_90_days_production),
> >> c=col(a, oil_last_30_days_production),
> >> d=regress(b, c))
> >>
> >> Posted directly into solr admin UI. Run the streaming expression and I
> >> get this error message:
> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
> >> expected but found type java.lang.String for value
> >> oil_first_90_days_production"
> >>
> >> It thinks my numeric field is defined as a string? But when I view the
> >> schema, those 2 fields are defined as ints:
> >>
> >>
> >> When I

Re: Solr Developer needed urgently

2018-03-15 Thread John Bickerstaff
Hi - thanks for thinking of me!

I'm currently lead on the Solr team for Ancestry - and having a good time.
I might be interested, but moving to New York isn't going to work for me.
If there is a good chance of working from home, then I might be
interested...  Let me know...

On Wed, Mar 14, 2018 at 4:14 PM,  wrote:

>
>
> Hi,
>
> I am Asma Talib from UTG Tech. We are looking for strong Solr Developer
> with good concepts and skill.
>
>
>
> Experience: 8 + Years
>
> Location: New York
>
>
>
> This position is for a Senior Search Developer, using Apache Solr for
> Analytics
>
> Following are the Responsibilities of candidate:
>
> · Implement automated techniques and processes for the bulk and
> real time indexing in Apache Solr of large-scale data sets residing in
> database, Hadoop, flat files and other sources.
>
> · Ability design and implement Solr build for generating indexes
> against structured and semi
>
> structured data.
>
> · Ability to design, create and manage shards and indexes for
> Solr Cloud.
>
> · Ability to write efficient search queries against Solr indexes
> using Solr REST/Java API.
>
> · Prototype and demonstrate new ideas of feasibility and
> leveraging for Solr Capabilities.
>
> · Ability to understand distributed technologies like Hadoop,
> Teradata, SQL server and ETLs.
>
> · Contribute to all aspects of application development including
> architecture, design, development and support.
>
> · Take ownership of application components and ensure timely
> delivery of the same
>
> · Ability to integrate research and best practices into project
> solution for continuous improvement.
>
> · Troubleshoot Solr indexing process and querying engine.
>
> · Identify, clarify, resolve issues and risks, escalating them as
> needed
>
> · Ability to Provide technical guidance and mentoring to other
> junior team members if needed
>
> · Ability to write automated unit test cases for solr search
> engine.
>
>
>
>
>
>
> *UTG  Tech- Corp*
>
> *Asma Talib*
>
> *Talent Acquisition Manager*
>
> *Website :www.utgtech.com *
>
> *Email :* asmata...@utgtech.com
>
> *Phone:* (571) 9325184 <(571)%20932-5184>
>
> *LinkedIn: *www.linkedin.com/in/asmatalib/
>
>
>
>
>
>
>


Re: HDInsight with Solr 4.9.0 Create Collection

2018-03-09 Thread john spooner

would be nice to not get this email.


On 3/9/2018 1:23 PM, Abhi Basu wrote:

This has been resolved!

Turned out to be schema and config file version diff between 4.10 and 4.9.

Thanks,

Abhi

On Fri, Mar 9, 2018 at 11:41 AM, Abhi Basu <9000r...@gmail.com> wrote:


That was due to a folder not being present. Is this something to do with
version?

http://hn0-esohad.mzwz3dh4pb1evcdwc1lcsddrbe.jx.
internal.cloudapp.net:8983/solr/admin/collections?action=
CREATE&name=ems-collection2&numShards=2&replicationFactor=
2&maxShardsPerNode=1


org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'ems-collection2_shard2_replica2': Unable to create
core: ems-collection2_shard2_replica2 Caused by: No enum constant
org.apache.lucene.util.Version.4.10.3

On Fri, Mar 9, 2018 at 11:11 AM, Abhi Basu <9000r...@gmail.com> wrote:


Ok, so I tried the following:

/usr/hdp/current/solr/example/scripts/cloud-scripts/zkcli.sh -cmd
upconfig -zkhost zk0-esohad.mzwz3dh4pb1evcdwc1l
csddrbe.jx.internal.cloudapp.net:2181 -confdir
/home/sshuser/abhi/ems-collection/conf -confname ems-collection

And got this exception:
java.lang.IllegalArgumentException: Illegal directory:
/home/sshuser/abhi/ems-collection/conf


On Fri, Mar 9, 2018 at 10:43 AM, Abhi Basu <9000r...@gmail.com> wrote:


Thanks for the reply, this really helped me.

For Solr 4.9, what is the actual zkcli command to upload config?

java -classpath example/solr-webapp/WEB-INF/lib/*
  org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost 127.0.0.1:9983
  -confdir example/solr/collection1/conf -confname conf1 -solrhome
example/solr

OR

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 -cmd
upconfig -confname my_new_config -confdir server/solr/configsets/basic_c
onfigs/conf

I dont know why HDP/HDInsight does not provide something like solrctl
commands to make life easier for all!




On Thu, Mar 8, 2018 at 5:43 PM, Shawn Heisey 
wrote:


On 3/8/2018 1:26 PM, Abhi Basu wrote:

I'm in a bind. Added Solr 4.9.0 to HDInsight cluster and find no

Solrctl

commands installed. So, I am doing the following to create a

collection.

This 'solrctl' command is NOT part of Solr.  Google tells me it's part
of software from Cloudera.

You need to talk to Cloudera for support on that software.


I have my collection schema in a location:

/home/sshuser/abhi/ems-collection/conf

Using this command to create a collection:

http://headnode1:8983/solr/admin/cores?action=CREATE&name=em

s-collection&instanceDir=/home/sshuser/abhi/ems-collection/conf



/

You're using the term "collection".  And later you mention ZooKeeper. So
you're almost certainly running in SolrCloud mode.  If your Solr is
running in SolrCloud mode, do not try to use the CoreAdmin API
(/solr/admin/cores).  Use the Collections API instead.  But before that,
you need to get the configuration into ZooKeeper.  For standard Solr
without Cloudera's tools, you would typically use the "zkcli" script
(either zkcli.sh or zkcli.bat).  See page 376 of the reference guide for
that specific version of Solr for help with the "upconfig" command for
that script:

http://archive.apache.org/dist/lucene/solr/ref-guide/apache-
solr-ref-guide-4.9.pdf


I guess i need to register my config name with Zk. How do I register

the

collection schema with Zookeeper?

Is there way to bypass the registration with zk and build the

collection

directly from my schema files at that folder location, like I was

able to

do in Solr 4.10 in CDH 5.14:

solrctl --zk hadoop-dn6.eso.local:2181/solr instancedir --create
ems-collection /home/sshuser/abhi/ems-collection/

solrctl --zk hadoop-dn6.eso.local:2181/solr collection --create
ems-collection -s 3 -r 2

The solrctl command is not something we can help you with on this
mailing list.  Cloudera customizes Solr to the point where only they are
able to really provide support for their version.  Your best bet will be
to talk to Cloudera.

When Solr is running with ZooKeeper, it's in SolrCloud mode.  In
SolrCloud mode, you cannot create cores in the same way that you can in
standalone mode -- you MUST create collections, and all configuration
will be in zookeeper, not on the disk.

Thanks,
Shawn




--
Abhi Basu




--
Abhi Basu




--
Abhi Basu








Re: CDCR performance issues

2018-03-09 Thread john spooner

please unsubscribe i tried to manaually unsubscribe


On 3/9/2018 12:59 PM, Tom Peters wrote:

Thanks. This was helpful. I did some tcpdumps and I'm noticing that the 
requests to the target data center are not batched in any way. Each update 
comes in as an independent update. Some follow-up questions:

1. Is it accurate that updates are not actually batched in transit from the 
source to the target and instead each document is posted separately?

2. Are they done synchronously? I assume yes (since you wouldn't want 
operations applied out of order)

3. If they are done synchronously, and are not batched in any way, does that 
mean that the best performance I can expect would be roughly how long it takes 
to round-trip a single document? ie. If my average ping is 25ms, then I can 
expect a peak performance of roughly 40 ops/s.

Thanks




On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C]  
wrote:

These are general guidelines, I've done loads of networking, but may be less 
familiar with SolrCloud  and CDCR architecture.  However, I know it's all TCP 
sockets, so general guidelines do apply.

Check the round-trip time between the data centers using ping or TCP ping.   
Throughput tests may be high, but if Solr has to wait for a response to a 
request before sending the next action, then just like any network protocol 
that does that, it will get slow.

I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check 
whether some proxy/load balancer between data centers is causing it to be a 
single connection per operation.   That will *kill* performance.   Some proxies 
default to HTTP/1.0 (open, send request, server send response, close), and that 
will hurt.

Why you should listen to me even without SolrCloud knowledge - checkout paper 
"Latency performance of SOAP Implementations".   Same distribution of skills - 
I knew TCP well, but Apache Axis 1.1 not so well.   I still improved response time of 
Apache Axis 1.1 by 250ms per call with 1-line of code.

-Original Message-
From: Tom Peters [mailto:tpet...@synacor.com]
Sent: Wednesday, March 7, 2018 6:19 PM
To: solr-user@lucene.apache.org
Subject: CDCR performance issues

I'm having issues with the target collection staying up-to-date with indexing 
from the source collection using CDCR.

This is what I'm getting back in terms of OPS:

curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
{
  "responseHeader": {
"status": 0,
"QTime": 0
  },
  "operationsPerSecond": [
"zook01,zook02,zook03/solr",
[
  "mycollection",
  [
"all",
49.10140553500938,
"adds",
10.27612635309587,
"deletes",
38.82527896994054
  ]
]
  ]
}

The source and target collections are in separate data centers.

Doing a network test between the leader node in the source data center and the 
ZooKeeper nodes in the target data center show decent enough network 
performance: ~181 Mbit/s

I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 
2000, 2500) and they've haven't made much of a difference.

Any suggestions on potential settings to tune to improve the performance?

Thanks

--

Here's some relevant log lines from the source data center's leader:

2018-03-07 23:16:11.984 INFO  
(cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
2018-03-07 23:16:23.062 INFO  
(cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
2018-03-07 23:16:32.063 INFO  
(cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
2018-03-07 23:16:36.209 INFO  
(cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
2018-03-07 23:16:42.091 INFO  
(cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
2018-03-07 23:16:46.790 INFO  
(cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr 
x:myco

Re: statistics in hitlist

2018-03-05 Thread John Smith
Thanks Joel for your help on this.

What I've done so far:
- unzip downloaded solr-7.2
- modify the _default "managed-schema" to add the random field type and the
dynamic random field
- start solr7 using "solr start -c"
- indexed my data using pint/pdouble/boolean field types etc

I can now run the random function all by itself, it returns random results
as expected. So far so good!

However... now trying to get the regression stuff working:

let(a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15000",
fl="oil_first_90_days_production,oil_last_30_days_production"),
b=col(a, oil_first_90_days_production),
c=col(a, oil_last_30_days_production),
d=regress(b, c))

Posted directly into solr admin UI. Run the streaming expression and I get
this error message:
"EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
expected but found type java.lang.String for value
oil_first_90_days_production"

It thinks my numeric field is defined as a string? But when I view the
schema, those 2 fields are defined as ints:


When I run a normal query and choose xml as output format, then it also
puts "int" elements into the hitlist, so the schema appears to be correct
it's just when using this regress function that something goes wrong and
solr thinks the field is string.

Any suggestions?
Thanks!
​


On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein  wrote:

> The field type will also need to be in the schema:
>
>  
>
> 
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein  wrote:
>
> > You'll need to have this field in your schema:
> >
> > 
> >
> > I'll check to see if the default schema used with solr start -c has this
> > field, if not I'll add it. Thanks for pointing this out.
> >
> > I checked and right now the random expression is only accepting one fq,
> > but I consider this a bug. It should accept multiple. I'll create ticket
> > for getting this fixed.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith  wrote:
> >
> >> Joel, thanks for the pointers to the streaming feature. I had no idea
> solr
> >> had that (and also just discovered the very intersting sql feature! I
> will
> >> be sure to investigate that in more detail in the future).
> >>
> >> However I'm having some trouble getting basic streaming functions
> working.
> >> I've already figured out that I had to move to "solr cloud" instead of
> >> "solr standalone" because I was getting errors about "cannot find zk
> >> instance" or whatever which went away when using "solr start -c"
> instead.
> >>
> >> But now I'm trying to use the random function since that was one of the
> >> functions used in your example.
> >>
> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> >>
> >> I posted that directly in the "stream" section of the solr admin UI.
> This
> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in
> case
> >> it was a bug in one)
> >>
> >> I get back an error message:
> >> *sort param could not be parsed as a query, and is not a field that
> exists
> >> in the index: random_-255009774*
> >>
> >> I'm not passing in any sort field anywhere. But the solr logs show these
> >> three log entries:
> >>
> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} status=400
> >> QTime=19
> >>
> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
> >> Request to collection [tx_header] failed due to (400)
> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >> Error
> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
> could
> >> not be parsed as a query, and is not a field that exists in the index:
> >> random_-255009774, retry? 0
> >>
> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_head

Re: statistics in hitlist

2018-03-01 Thread John Smith
Joel, thanks for the pointers to the streaming feature. I had no idea solr
had that (and also just discovered the very intersting sql feature! I will
be sure to investigate that in more detail in the future).

However I'm having some trouble getting basic streaming functions working.
I've already figured out that I had to move to "solr cloud" instead of
"solr standalone" because I was getting errors about "cannot find zk
instance" or whatever which went away when using "solr start -c" instead.

But now I'm trying to use the random function since that was one of the
functions used in your example.

random(tx_header, q="*:*", rows="100", fl="countyname")

I posted that directly in the "stream" section of the solr admin UI. This
is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case
it was a bug in one)

I get back an error message:
*sort param could not be parsed as a query, and is not a field that exists
in the index: random_-255009774*

I'm not passing in any sort field anywhere. But the solr logs show these
three log entries:

2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
[tx_header_shard1_replica_n1]  webapp=/solr path=/select
params={q=*:*&_stateVer_=tx_header:6&fl=countyname
*&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} status=400
QTime=19

2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
Request to collection [tx_header] failed due to (400)
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.13.31:8983/solr/tx_header: sort param could
not be parsed as a query, and is not a field that exists in the index:
random_-255009774, retry? 0

2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream
java.io.IOException:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.13.31:8983/solr/tx_header: sort param could
not be parsed as a query, and is not a field that exists in the index:
random_-255009774


So basically it looks like solr is injecting the "sort=random_" stuff into
my query and of course that is failing on the search since that
field/column doesn't exist in my schema. Everytime I run the random
function, I get a slightly different field name that it injects, but they
all start with "random_" etc.

I have tried adding my own sort field instead, hoping solr wouldn't inject
one for me, but it still injected a random sort fieldname:
random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname
asc")


Assuming I can fix that whole problem, my second question is: can I add
multiple "fq=" parameters to the random function? I build a pretty
complicated query using many fq= fields, and then want to run some stats on
that hitlist; so somehow I have to pass in the query that made up the exact
hitlist to these various functions, but when I used multiple "fq=" values
it only seemed to use the last one I specified and just ignored all the
previous fq's?

Thanks in advance for any comments/suggestions...!




On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein  wrote:

> This is going to be a complex answer because Solr actually now has multiple
> ways of doing regression analysis as part of the Streaming Expression
> statistical programming library. The basic documentation is here:
>
> https://lucene.apache.org/solr/guide/7_2/statistical-programming.html
>
> Here is a sample expression that performs a simple linear regression in
> Solr 7.2:
>
> let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
> fieldB"),
> b=col(a, fieldA),
> c=col(a, fieldB),
> d=regress(b, c))
>
>
> The expression above takes a random sample of 15000 results from
> collection1. The result set will include fieldA and fieldB in each record.
> The result set is stored in variable "a".
>
> Then the "col" function creates arrays of numbers from the results stored
> in variable a. The values in fieldA are stored in the variable "b". The
> values in fieldB are stored in variable "c".
>
> Then the regress function performs a simple linear regression on arrays
> stored in variables "b" and "c".
>
> The output of the regress function is a map containing the regression
> result. This result includes RSquared and other attributes of the
> regression model such as R (correlation), slope, y intercept etc...
>
>
>
>

Re: statistics in hitlist

2018-02-23 Thread John Smith
Hi Joel, thanks for the answer. I'm not really a stats guy, but the end
result of all this is supposed to be obtaining R^2. Is there no way of
obtaining this value, then (short of iterating over all the results in the
hitlist and calculating it myself)?

On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein  wrote:

> Typically SSE is the sum of the squared errors of the prediction in a
> regression analysis. The stats component doesn't perform regression,
> although it might be a nice feature.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Feb 23, 2018 at 12:17 PM, John Smith  wrote:
>
> > I'm using solr, and enabling stats as per this page:
> > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
> >
> > I want to get more stat values though. Specifically I'm looking for
> > r-squared (coefficient of determination). This value is not present in
> > solr, however some of the pieces used to calculate r^2 are in the stats
> > element, for example:
> >
> > 0.0
> > 10.0
> > 15
> > 17
> > 85.0
> > 603.0
> > 5.667
> > 2.943920288775949
> >
> >
> > So I have the sumOfSquares available (SST), and using this calculation, I
> > can get R^2:
> >
> > R^2 = 1 - SSE/SST
> >
> > All I need then is SSE. Is there anyway I can get SSE from those other
> > stats in solr?
> >
> > Thanks in advance!
> >
>


statistics in hitlist

2018-02-23 Thread John Smith
I'm using solr, and enabling stats as per this page:
https://lucene.apache.org/solr/guide/6_6/the-stats-component.html

I want to get more stat values though. Specifically I'm looking for
r-squared (coefficient of determination). This value is not present in
solr, however some of the pieces used to calculate r^2 are in the stats
element, for example:

0.0
10.0
15
17
85.0
603.0
5.667
2.943920288775949


So I have the sumOfSquares available (SST), and using this calculation, I
can get R^2:

R^2 = 1 - SSE/SST

All I need then is SSE. Is there anyway I can get SSE from those other
stats in solr?

Thanks in advance!


Re: SolrCloud: How best to do backups?

2018-02-08 Thread John Bickerstaff
This article may be of some use...

What isn't clear is what effect either of the two strategies mentioned
would have on serving responses to queries...  It would be nice if the
backup was a "low priority thread" compared to the needs of the server in
question, but I've never had to dig that deep before...

https://n2ws.com/how-to-guides/automate-amazon-ec2-instance-backup.html

On Thu, Feb 8, 2018 at 2:00 PM, John Bickerstaff 
wrote:

> Hmmm...
>
> Can you (fairly quickly) reproduce this AWS environment (including the
> indexes)?  Or does it require that several week process to provision new
> Solr boxes...?
>
> What happens now if one of those ec2 instances gets into trouble?  Do you
> have autoscaling groups set up?
>
> On Thu, Feb 8, 2018 at 1:44 PM, Kelly, Frank  wrote:
>
>> We have a large SolrCloud deployment on AWS (350m documents spread across
>> 3 collections, each with 3 shards and 3 replicas)
>> Running on 3 x r3.xlarge’s with the data stored on EBS drives with
>> Provisioned IOPS
>>
>> Currently it’s handling 38m requests per day
>>
>> My question is how best should we back-up the search index?
>> Is there someway to snapshot a backup while Solr remains online that
>> doesn’t horribly affect performance?
>>
>> Right now in the event of a catastrophic failure if would take several
>> weeks to reindex the data again based on the process we have now (which is
>> outdated)
>>
>> -Frank
>>
>> [image: Description: Macintosh
>> HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]
>>
>>
>>
>> *Frank Kelly*
>>
>> *Principal Software Engineer*
>>
>> AAA Identity Profile Team (SCBE / CDA)
>>
>>
>> HERE
>>
>> 5 Wayside Rd, Burlington, MA 01803, USA
>> <https://maps.google.com/?q=5+Wayside+Rd,+Burlington,+MA+01803,+USA&entry=gmail&source=g>
>>
>> *42° 29' 7" N 71° 11' 32" W*
>>
>>
>> [image: Description:
>> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
>> <http://360.here.com/>[image: Description:
>> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
>> <https://www.twitter.com/here>   [image: Description:
>> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
>> <https://www.facebook.com/here>[image: Description:
>> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
>> <https://www.linkedin.com/company/heremaps>[image: Description:
>> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]
>> <https://www.instagram.com/here/>
>>
>
>


Re: SolrCloud: How best to do backups?

2018-02-08 Thread John Bickerstaff
Hmmm...

Can you (fairly quickly) reproduce this AWS environment (including the
indexes)?  Or does it require that several week process to provision new
Solr boxes...?

What happens now if one of those ec2 instances gets into trouble?  Do you
have autoscaling groups set up?

On Thu, Feb 8, 2018 at 1:44 PM, Kelly, Frank  wrote:

> We have a large SolrCloud deployment on AWS (350m documents spread across
> 3 collections, each with 3 shards and 3 replicas)
> Running on 3 x r3.xlarge’s with the data stored on EBS drives with
> Provisioned IOPS
>
> Currently it’s handling 38m requests per day
>
> My question is how best should we back-up the search index?
> Is there someway to snapshot a backup while Solr remains online that
> doesn’t horribly affect performance?
>
> Right now in the event of a catastrophic failure if would take several
> weeks to reindex the data again based on the process we have now (which is
> outdated)
>
> -Frank
>
> [image: Description: Macintosh
> HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]
>
>
>
> *Frank Kelly*
>
> *Principal Software Engineer*
>
> AAA Identity Profile Team (SCBE / CDA)
>
>
> HERE
>
> 5 Wayside Rd, Burlington, MA 01803, USA
> 
>
> *42° 29' 7" N 71° 11' 32" W*
>
>
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
>    [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]
> 
>


Solr needs a restart to recover from "No space left on device"

2018-02-06 Thread John Davis
Hi there!

We ran out of disk on our solr instance. However even after cleaning up the
disk solr server did not realize that there is free disk available. It only
got fixed after a restart.

Is this a known issue? Or are there workarounds that don't require a
restart?

Thanks
John


Matching within list fields

2018-01-29 Thread John Davis
Hi there!

We have a use case where we'd like to search within a list field, however
the search should not match across different elements in the list field --
all terms should match a single element in the list.

For eg if the field is a list of comments on a product, search should be
able to find a comment that matches all the terms.

Short of creating separate documents for each element in the list, is there
any other efficient way of accomplishing this?

Thanks
John


Re: Bitnami, or other Solr on AWS recommendations?

2018-01-26 Thread John Bickerstaff
I guess I'd say test with the image - especially if you're deploying a
larger number of Solr boxes.  We do a lot of them where I work and
(unfortunately, for reasons I won't bother you with) can't use an image.
The time it takes to install solr is noticeable when we deploy Solr on our
100 plus EC2 instances.

Of course, if you need to customize Solr or the Solr Server in any way, you
can make your own by hand and then build an image from that for use as a
base.

On Fri, Jan 26, 2018 at 12:24 PM, TK Solr  wrote:

> If I want to deploy Solr on AWS, do people recommend using the prepackaged
> Bitnami Solr image? Or is it better to install Solr manually on a computer
> instance? Or are there a better way?
>
> TK
>
>
>


Re: Profanity

2018-01-08 Thread John Blythe
Gladly. Good luck!

On Mon, Jan 8, 2018 at 8:27 PM Sadiki Latty  wrote:

> Thanks for the feedback John,
>
> This is a genius idea if I don’t want to create my own processor. I could
> simply check that field for data for my reports. Either the field will have
> data or it won’t.
>
> Thanks
>
> Sid
>
> Sent from my iPhone
>
> > On Jan 8, 2018, at 4:38 PM, John Blythe  wrote:
> >
> > you could use the keepwords functionality. have a field that only keeps
> > profanity and then you can query against that field having its default
> > value vs. profane text
> >
> > --
> > John Blythe
> >
> >> On Mon, Jan 8, 2018 at 3:12 PM, Sadiki Latty  wrote:
> >>
> >> Hey
> >>
> >> I would like to find a solution to flag (at index-time) profanity.
> >> Optimally, it would be good if it function similar to stopwords in the
> >> sense that I can have a predefined list that is read and if token is on
> the
> >> list that document is 'flagged' in a different field. Does anyone know
> of
> >> solution (outside of configuring my own). If none exists and I end up
> >> configuring my own would I be doing this in the updateprcoessor phase.
> I am
> >> still fairly new to Solr, but from what I've read, that seems to be the
> >> best place to look.
> >>
> >>
> >> Thanks,
> >>
> >> Sid
> >>
>
-- 
John Blythe


Re: Profanity

2018-01-08 Thread John Blythe
you could use the keepwords functionality. have a field that only keeps
profanity and then you can query against that field having its default
value vs. profane text

--
John Blythe

On Mon, Jan 8, 2018 at 3:12 PM, Sadiki Latty  wrote:

> Hey
>
> I would like to find a solution to flag (at index-time) profanity.
> Optimally, it would be good if it function similar to stopwords in the
> sense that I can have a predefined list that is read and if token is on the
> list that document is 'flagged' in a different field. Does anyone know of
> solution (outside of configuring my own). If none exists and I end up
> configuring my own would I be doing this in the updateprcoessor phase. I am
> still fairly new to Solr, but from what I've read, that seems to be the
> best place to look.
>
>
> Thanks,
>
> Sid
>


Re: SolrCloud

2017-12-15 Thread John Davis
Thanks Erick. I agree SolrCloud is better than master/slave, however we
have some questions between managing replicas separately vs with solrcloud.
For eg how much overhead do SolrCloud nodes have wrt memory/cpu/disk in
order to be able to sync pending index updates to other replicas? What
monitoring and safeguards are in place out of the box so too many pending
updates for unreachable replicas don't make the alive ones fall over? Or a
new replica doesn't overwhelm existing replica.

Of course everything works great when things are running well but when
things go south our preference would be for solr to not fall over as first
priority.

On Fri, Dec 15, 2017 at 9:41 AM, Erick Erickson 
wrote:

> The main advantage in SolrCloud in your setup is HA/DR. You say you
> have multiple replicas and shards. Either you have to index to each
> replica separately or you use master/slave replication. In either case
> you have to manage and fix the case where some node goes down. If
> you're using master/slave, if the master goes down you need to get in
> there and fix it, reassign the master, make config changes, restart
> Solr to pick them up, make sure you pick up any missed updates and all
> that.
>
> in SolrCloud that is managed for you. Plus, let's say you want to
> increase QPS capacity. In SolrCloud all you do is use the collections
> API ADDREPLICA command and you're done. It gets created (and you can
> specify exactly what node if you want), the index gets copied, new
> updates are automatically routed to it and it starts serving requests
> when it's synchronized all automagically. Symmetrically you can
> DELETEREPLICA if you have too much capacity.
>
> The price here is you have to get comfortable with maintaining
> ZooKeeper admittedly.
>
> Also in the 7x world you have different types of replicas, TLOG, PULL
> and NRT that combine some of the features of master/slave with
> SolrCloud.
>
> Generally my rule of thumb is the minute you get beyond a single shard
> you should move to SolrCloud. If all your data fits in one Solr core
> then it's less clear-cut, master/slave can work just fine. It Depends
> (tm) of course.
>
> Your use case is "implicit" (being renamed "manual") routing when you
> create your Solr collection. There are pros and cons here, but that's
> beyond the scope of your question. Your infrastructure should port
> pretty directly to SolrCloud. The short form is that all your indexing
> and/or querying is happening on a single node when using manual
> routing rather than in parallel. Of course executing parallel
> sub-queries imposes its own overhead.
>
> If your use-case for having these on a single shard it to segregate
> the data by some set (say users), you might want to consider just
> using separate _collections_ in SolrCloud where old_shard ==
> new_collection, basically all your routing is the same. You can create
> aliases pointing to multiple collections or specify multiple
> collections on the query, don't know if that fits your use case or not
> though.
>
>
> Best,
> Erick
>
> On Fri, Dec 15, 2017 at 9:03 AM, John Davis 
> wrote:
> > Hello,
> > We are thinking about migrating to SolrCloud. Our current setup is:
> > 1. Multiple replicas and shards.
> > 2. Each query typically hits a single shard only.
> > 3. We have an external system that assigns a document to a shard based on
> > it's origin and is also used by solr clients when querying to find the
> > correct shard to query.
> >
> > It looks like the biggest advantage of SolrCloud is #3 - to route
> document
> > to the correct shard & replicas when indexing and to route query
> similarly.
> > Given we already have a fairly reliable system to do this, are there
> other
> > benefits from migrating to SolrCloud?
> >
> > Thanks,
> > John
>


SolrCloud

2017-12-15 Thread John Davis
Hello,
We are thinking about migrating to SolrCloud. Our current setup is:
1. Multiple replicas and shards.
2. Each query typically hits a single shard only.
3. We have an external system that assigns a document to a shard based on
it's origin and is also used by solr clients when querying to find the
correct shard to query.

It looks like the biggest advantage of SolrCloud is #3 - to route document
to the correct shard & replicas when indexing and to route query similarly.
Given we already have a fairly reliable system to do this, are there other
benefits from migrating to SolrCloud?

Thanks,
John


PayloadScoreQuery always returns score of zero

2017-12-13 Thread John Anonymous
The PayloadScoreQuery always returns a score of zero, regardless of
payloads.  The PayloadCheckQParser works fine, so I know that I am
successfully indexing the payloads.   Details below

*payload field that I am searching on:*


*definition of payload field type:*




















*Adding some documents with payloads in my test:*assertU(adoc(
"key", "1",
"report", "apple¯0 apple¯0 apple¯0"
));
assertU(adoc(
"key", "2",
"report", "apple¯1 apple¯1 text¯1"
));


*query:*{!payload_score f=report v=apple func=sum}

*score (both documents have a score of zero):*


0.0 = SumPayloadFunction.docScore()


0.0 = SumPayloadFunction.docScore()

  

I have tried using func=max as well, but it makes no difference.  Can
anyone help me with what I am missing here?
Thanks!
Johnathan


Solr index size statistics

2017-12-02 Thread John Davis
Hello,
Is there a way to get index size statistics for a given solr instance? For
eg broken by each field stored or indexed. The only things I know of is
running du on the index data files and getting counts per field
indexed/stored, however each field can be quite different wrt size.

Thanks
John


Re: does the payload_check query parser have support for simple query parser operators?

2017-11-30 Thread John Anonymous
Ok, thanks.  Do you know if there are any plans to support special syntax
in the future?

On Thu, Nov 30, 2017 at 5:04 AM, Erik Hatcher 
wrote:

> No it doesn’t.   The payload parsers currently just simple tokenize with
> no special syntax supported.
>
>  Erik
>
> > On Nov 30, 2017, at 02:41, John Anonymous  wrote:
> >
> > I would like to use wildcards and fuzzy search with the payload_check
> query
> > parser. Are these supported?
> >
> > {!payload_check f=text payloads='NOUN'}apple~1
> >
> > {!payload_check f=text payloads='NOUN'}app*
> >
> > Thanks
>


  1   2   3   4   5   6   7   8   9   >