Are values in solr.xml 32-bit or 64-bit?

2020-11-11 Thread elivis
I'm trying to find out what are the maximum values for parameters specified
in solr.xml file? Mainly I am interested in distribUpdateConnTimeout and
distribUpdateSoTimeout. I have tried setting those values to 0 in hopes that
it would set the timeout to infinite, but I don't think that worked. I want
to set these values to maximum possible. 

Just as FYI, we have multiple very large collections (total combined index
size of all shards is over 2TB for each collection, sometimes much more than
that) sharded across a handful of Solr nodes. So obviously the queries take
a long time, which is expected. I want to make sure the queries don't time
out and eventually return results. What is the best way to achieve this?
When I set these values to 10 minutes, I was getting timeout errors in solr
logs (timeouts were occurring in intra-cluster communication).

Thank you in advance!



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: SolrCloud shows cluster still healthy even the node data directory is deleted

2020-11-11 Thread Amy Bai
Hi Erick,

Thanks for your kindly reply.
There are two things that confuse me:

1. index/search queries keep failing because one of the node data directory is 
gone, but the node is not marked as down.

2. The replicas on the failed node are not working, but the Index/search 
queries didn't failover to other healthy replicas.

Regards,
Amy

From: Erick Erickson 
Sent: Monday, November 9, 2020 8:43 PM
To: solr-user@lucene.apache.org 
Subject: Re: SolrCloud shows cluster still healthy even the node data directory 
is deleted

Depends. *nix systems have delete-on-close semantics, that is as
long as there’s a single file handle open, the file will be still be
available to the process using it. Only when the last file handle is
closed will the file actually be deleted.

Solr (Lucene actually) has  file handle open to every file in the index
all the time.

These files aren’t visible when you do a directory listing. So if you
stop Solr, are the files gone? NOTE: When you start Solr again, if
there are existing replicas that are healthy then the entire index
should be copied from another replica….

Best,
Erick

> On Nov 9, 2020, at 3:30 AM, Amy Bai  wrote:
>
> Hi community,
>
> I found that SolrCloud won't check the IO status if the SolrCloud process is 
> alive.
> E.g. If I delete the SolrCloud data directory, there are no errors report, 
> and I can still log in to the SolrCloud   Admin UI to create/query 
> collections.
> Is this reasonable?
> Can someone explain why SOLR handles it like this?
> Thanks so much.
>
>
> Regards,
> Amy



Re: Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-11 Thread Edward Turner
Many thanks Walter, that's useful information. And yes, if we are able to
keep stopwords, then we will. We have been exploring it because we've
noticed its use leads to a sizable drop in index size (5%, in some of our
tests), which then had the knock on effect of better performance. (Also,
unfortunately, we do not have the luxury of using super big
machines/storage -- so it's always a balancing act for us.)

Best,
Edd

Edward Turner


On Tue, 10 Nov 2020 at 16:22, Walter Underwood 
wrote:

> By far the simplest solution is to leave stopwords in the index. That also
> improves
> relevance, because it becomes possible to search for “vitamin a” or “to be
> or not to be”.
>
> Stopword remove was a performance and disk space hack from the 1960s. It
> is no
> longer needed. We were keeping stopwords in the index at Infoseek, back in
> 1996.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 10, 2020, at 1:16 AM, Edward Turner  wrote:
> >
> > Hi all,
> >
> > Okay, I've been doing more research about this problem and from what I
> > understand, phrase queries + stopwords are known to have some
> difficulties
> > working together in some circumstances.
> >
> > E.g.,
> >
> https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
> > https://issues.apache.org/jira/browse/SOLR-6468
> >
> > I was thinking about workarounds, but each solution I've attempted
> doesn't
> > quite work.
> >
> > Therefore, maybe one possible solution is to take a step back and
> > preprocess index/query data going to Solr, something like:
> >
> > String wordsForSolr = removeStopWordsFrom("This is pretend index or query
> > data")
> > // wordsForSolr = "pretend index query data"
> >
> > Off the top of my head, this will by-pass position issues.
> >
> > I will give this a go, but was wondering whether this is something others
> > have done?
> >
> > Best wishes,
> > Edd
> >
> > 
> > Edward Turner
> >
> >
> > On Fri, 6 Nov 2020 at 13:58, Edward Turner  wrote:
> >
> >> Hi all,
> >>
> >> We are experiencing some unexpected behaviour for phrase queries which
> we
> >> believe might be related to the FlattenGraphFilterFactory and stopwords.
> >>
> >> Brief description: when performing a phrase query
> >> "Molecular cloning and evolution of the" => we get expected hits
> >> "Molecular cloning and evolution of the genes" => we get no hits
> >> (unexpected behaviour)
> >>
> >> I think it's worthwhile adding the analyzers we use to help you see what
> >> we're doing:
> >>  Analyzers 
> >>  >>   sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> >>   
> >>   >> pattern="[- /()]+" />
> >>   >> ignoreCase="true" />
> >>   >> preserveOriginal="false" />
> >>  
> >>   >> generateNumberParts="1" splitOnCaseChange="0"
> preserveOriginal="0"
> >> splitOnNumerics="0" stemEnglishPossessive="1"
> >> generateWordParts="1"
> >> catenateNumbers="0" catenateWords="1" catenateAll="1" />
> >>  
> >>   
> >>   
> >>   >> pattern="[- /()]+" />
> >>   >> ignoreCase="true" />
> >>   >> preserveOriginal="false" />
> >>  
> >>   >> generateNumberParts="1" splitOnCaseChange="0"
> preserveOriginal="0"
> >> splitOnNumerics="0" stemEnglishPossessive="1"
> >> generateWordParts="1"
> >> catenateNumbers="0" catenateWords="0" catenateAll="0" />
> >>   
> >> 
> >>  End of Analyzers 
> >>
> >>  Stopwords 
> >> We use the following stopwords:
> >> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no,
> not,
> >> of, on, or, such, that, the, their, then, there, these, they, this, to,
> >> was, will, with, which
> >>  End of Stopwords 
> >>
> >>  Analysis Admin page output ---
> >> ... And to see what's going on when we're indexing/querying, I created a
> >> gist with an image of the (non-verbose) output of the analysis admin
> page
> >> for, index data/query, "Molecular cloning and evolution of the genes":
> >>
> >>
> https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png
> >>
> >> Hopefully this link works, and you can see that the resulting terms and
> >> positions are identical until the FlattenGraphFilterFactory step in the
> >> "index" phase.
> >>
> >> Final stage of index analysis:
> >> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes
> >>
> >> Final stage of query analysis:
> >> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes
> >>
> >> The empty positions are because of stopwords (presumably)
> >>  End of Analysis Admin page output ---
> >>
> >> Main question:
> >> Could someone explain why the FlattenGraphFilterFactory changes the
> >> position of the "genes" token? From what