question about solrCloud joining

2019-08-19 Thread 王立生
Hello,

I have a question about solrCloud joining. i knew solrCloud joining can do
join only when index is  not splited to shards, but when i test it, i found
a problem which make me confused.

i tested it on version 8.2

assuming i have 2 collections like sample about "joining" on solr offcial
website,

one collection called "movies", another called "movieDirectors".

movies's fields: id, title, director_id
movieDirectors's fields: id, name, has_oscar

the information of shards and replicas as below, i started two nodes on my
laptop:
[image: image.png]
 moviesDirectors have 2 docs:
[image: image.png]
movies also have 2 docs:
[image: image.png]
everything is ok when i run query with "{!join from=id
fromIndex=movieDirectors to=director_id}has_oscar:true" on both 8983 and
8984, i can got expacted result:
[image: image.png]
but when i run "{!join from=director_id fromIndex=movies
to=id}title:"Dunkirk"" on 8983
got 1 doc and if i filter by "title:Get Out", i got nothing.  i understood
"Get Out" is not exist in 8983.
[image: image.png]
[image: image.png]
but question is coming, when i run "{!join from=director_id
fromIndex=movies to=id}title:"Dunkirk"" on 8984, i got "SolrCloud join:
multiple shards not yet supported movies"
no matter what filter value is.

i found following code:
[image: image.png]

when i run joining from movies on 8983, slice length is 2 as movies have 2
shards. "fromReplica " was assigned in second cycle,  because zkController
name is 8983 and replica name is 8984 in first cycle.

but when run on 8984, "fromReplica" was assigned in first cycle, because
zkController name isand replica name both are 8984 in first cycle, so throw
"SolrCloud join: multiple shards not yet supported" in second cycle.

Thanks for your patience, it's too long. i'm confused about why use this
way to judge "multiple shards", because the result is also wrong running on
8983 even if didnt throw exception. why dont use  slice length>1 to judge
"multiple shards" ? or maybe have other better way?

Please advise.

Thanks in advance!


8.2.0 getting warning - unable to load jetty, not starting JettyAdminServer

2019-08-19 Thread Arnold Bronley
Hi,

I am getting following warning in Solr admin UI logs. I did not get this
warning in Solr 8.1.1
Please note that I am using Solr docker slim image from here -
https://hub.docker.com/_/solr/

Unable to load jetty, not starting JettyAdminServer


Re: HttpShardHandlerFactory

2019-08-19 Thread Michael Gibney
Mark,

Another thing to check is that I believe the configuration you posted may
not actually be taking effect. Unless I'm mistaken, I think the correct
element name to configure the shardHandler is "shardHandler*Factory*", not
"shardHandler" ... as in, '...'

The element name is documented correctly in the refGuide page for "Format
of solr.xml":
https://lucene.apache.org/solr/guide/8_1/format-of-solr-xml.html#the-shardhandlerfactory-element

... but the incorrect (?) element name is included in the refGuide page for
"Distributed Requests":
https://lucene.apache.org/solr/guide/8_1/distributed-requests.html#configuring-the-shardhandlerfactory

Michael

On Fri, Aug 16, 2019 at 9:40 AM Shawn Heisey  wrote:

> On 8/16/2019 3:51 AM, Mark Robinson wrote:
> > I am trying to understand the socket time out and connection time out in
> > the HttpShardHandlerFactory:-
> >
> > 
> >10
> >20
> > 
>
> The shard handler is used when that Solr instance needs to make
> connections to another Solr instance (which could be itself, as odd as
> that might sound).  It does not apply to the requests that you make from
> outside Solr.
>
> > 1.Could some one please help me understand the effect of using such low
> > values of 10 ms
> >  and 20ms as given above inside my /select handler?
>
> A connection timeout of 10 milliseconds *might* result in connections
> not establishing at all.  This is translated down to the TCP socket as
> the TCP connection timeout -- the time limit imposed on making the TCP
> connection itself.  Which as I understand it, is the completion of the
> "SYN", "SYN/ACK", and "ACK" sequence.  If the two endpoints of the
> connection are on a LAN, you might never see a problem from this -- LAN
> connections are very low latency.  But if they are across the Internet,
> they might never work.
>
> The socket timeout of 20 milliseconds means that if the connection goes
> idle for 20 milliseconds, it will be forcibly closed.  So if it took 25
> milliseconds for the remote Solr instance to respond, this Solr instance
> would have given up and closed the connection.  It is extremely common
> for requests to take 100, 500, 2000, or more milliseconds to respond.
>
> > 2. What is the guidelines for setting these parameters? Should they be
> low
> > or high
>
> I would probably use a value of about 5000 (five seconds) for the
> connection timeout if everything's on a local LAN.  I might go as high
> as 15 seconds if there's a high latency network between them, but five
> seconds is probably long enough too.
>
> For the socket timeout, you want a value that's considerably longer than
> you expect requests to ever take.  Probably somewhere between two and
> five minutes.
>
> > 3. How can I test the effect of this chunk of code after adding it to my
> > /select handler ie I want to
> >   make sure the above code snippet is working. That is why I gave
> such
> > low values and
> >   thought when I fire a query I would get both time out errors in the
> > logs. But did not!
> >   Or is it that within the above time frame (10 ms, 20ms) if no
> request
> > comes the socket will
> >   time out and the connection will be lost. So to test this should I
> > give a say 100 TPS load with
> >   these low values and then increase the values to maybe 1000 ms and
> > 1500 ms respectively
> >   and see lesser time out error messages?
>
> If you were running a multi-server SolrCloud setup (or a single-server
> setup with multiple shards and/or replicas), you probably would see
> problems from values that low.  But if Solr never has any need to make
> connections to satisfy a request, then the values will never take effect.
>
> If you want to control these values for requests made from outside Solr,
> you will need to do it in your client software that is making the request.
>
> Thanks,
> Shawn
>


Re: Slow Indexing scaling issue

2019-08-19 Thread Furkan KAMACI
Hi Parmeshwor,

2 hours for 3 gb of data seems too slow. We scale up to PBs in such a way:

1) Ignore all commits from client
via IgnoreCommitOptimizeUpdateProcessorFactory
2) Heavy processes are done on external Tika server instead of Solr Cell
with embedded Tika feature.
3) Adjust autocommit, softcommit and shard size according to your needs.
4) Adjust JVM parameters.
5) Do not use swap if you can.

Kind Regards,
Furkan KAMACI

On Tue, Aug 13, 2019 at 8:37 PM Erick Erickson 
wrote:

> Here’s some sample SolrJ code using TIka outside of Solr’s Extracting
> Request Handler, along with some info about why loading Solr with the job
> of extracting text is not optimal speed wise:
>
> https://lucidworks.com/post/indexing-with-solrj/
>
> > On Aug 13, 2019, at 12:15 PM, Jan Høydahl  wrote:
> >
> > You May want to review
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-SlowIndexing
> for some hints.
> >
> > Make sure to index with multiple parallel threads. Also remember that
> using /extract on the solr side is resource intensive and may make your
> cluster slow and unstable. Better to use Tika or similar on the client side
> and send text docs to solr.
> >
> > Jan Høydahl
> >
> >> 13. aug. 2019 kl. 16:52 skrev Parmeshwor Thapa <
> thapa.parmesh...@gmail.com>:
> >>
> >> Hi,
> >>
> >> We are having some issue on scaling solr indexing. Looking for
> suggestion.
> >>
> >> Setup : We have two solr cloud (7.4) instances running in separate cloud
> >> VMs with an external zookeeper ensemble.
> >>
> >> We are sending async / non-blocking http request to index documents in
> solr.
> >> 2
> >>
> >> cloud VMs ( 4 core * 32 GB)
> >>
> >> 16 gb allocated for jvm
> >>
> >> We are sending all types to document to solr , which it would extract
> and
> >> index,  Using /update/extract request handler
> >>
> >> We have stopwords.txt and dictionary (7mb) for stemming.
> >>
> >>
> >>
> >> Issue : indexing speed is quite slow for us. It is taking around 2
> hours to
> >> index around 3 gb of data. 10,000 documents(PDF, xls, word, etc). We are
> >> planning to index approximately 10 tb of data.
> >>
> >> Below is the solr config setting and schema,
> >>
> >>
> >>
> >> 
> >>
> >>   
> >>
> >> 
> >>
> >> 
> >>
> >> 
> >>
> >>  >> languageSet="auto" ruleType="APPROX" concat="true"/>
> >>
> >>   
> >>
> >> 
> >>
> >> 
> >>
> >>   
> >>
> >>  >> tokenizerModel="en-token.bin" sentenceModel="en-sent.bin"/>
> >>
> >>   
> >>
> >>  >> posTaggerModel="en-pos-maxent.bin"/>
> >>
> >>  >> dictionary="en-lemmatizer-again.dict.txt"/>
> >>
> >>
> >>
> >> 
> >>
> >> 
> >>
> >> 
> >>
> >> 
> >>
> >> 
> >>
> >>
> >>
> >>  >> stored="false"/>
> >>
> >> 
> >>
> >>
> >>
> >>  >> indexed="true" stored="true"/>
> >>
> >>  >> required="true" stored="true"/>
> >>
> >>  >> indexed="true" stored="true"/>
> >>
> >>  >> indexed="true" stored="true"/>
> >>
> >>  >> stored="true"/>
> >>
> >>  >> indexed="true" stored="true"/>
> >>
> >>  >> indexed="true" stored="true" />
> >>
> >>  >> stored="true"/>
> >>
> >>  >> indexed="true" stored="true"/>
> >>
> >>  >> indexed="true" stored="true"/>
> >>
> >>  >> indexed="true" stored="false"/>
> >>
> >>  >> indexed="true" stored="false"/>
> >>
> >>  >> indexed="true" stored="true"/>
> >>
> >>  >> indexed="true" stored="true"/>
> >>
> >>  >> indexed="true" stored="true"/>
> >>
> >>
> >>
> >> 
> >>
> >> 
> >>
> >>
> >>
> >>  stored="false"
> >> docValues="false" />
> >>
> >>
> >>
> >> And below is the solrConfig,
> >>
> >>
> >>
> >> 
> >>
> >>  BEST_COMPRESSION
> >>
> >> 
> >>
> >>
> >>
> >>   
> >>
> >>   1000
> >>
> >>   60
> >>
> >>   false
> >>
> >>   
> >>
> >>
> >>
> >>   
> >>
> >> ${solr.autoSoftCommit.maxTime:-1}
> >>
> >>   
> >>
> >>
> >>
> >>  >>
> >> startup="lazy"
> >>
> >> class="solr.extraction.ExtractingRequestHandler" >
> >>
> >>   
> >>
> >> true
> >>
> >> ignored_
> >>
> >> content
> >>
> >>   
> >>
> >> 
> >>
> >> *Thanks,*
> >>
> >> *Parmeshwor Thapa*
>
>


Re: Multiple Request to solr from dotnet application

2019-08-19 Thread Jörn Franke
Hi,

Can you provide an example what you want to achieve?
Multiple requests in parallel?
Are those requests related?

Best regards 

> Am 19.08.2019 um 01:44 schrieb Prabhu Dhanaraj 
> :
> 
> Hi Team
> 
> I would like to know if there is any way where we can combine multiple 
> requests and send it solr.
> We are using a dot net application to send the request to solr.
> Please let us know if there is any article or sample code elated to this.
> 
> Thanks
> Prabhu
> 
> American Express made the following annotations 
> 
> "This message and any attachments are solely for the intended recipient and 
> may contain confidential or privileged information. If you are not the 
> intended recipient, any disclosure, copying, use, or distribution of the 
> information 
> 
> included in this message and any attachments is prohibited. If you have 
> received this communication in error, please notify us by reply e-mail and 
> immediately and permanently delete this message and any attachments. Thank 
> you." 
> American Express a ajouté le commentaire suivant le 
> Ce courrier et toute pièce jointe qu'il contient sont réservés au seul 
> destinataire indiqué et peuvent renfermer des renseignements confidentiels et 
> privilégiés. Si vous n'êtes pas le destinataire prévu, toute divulgation, 
> duplication, utilisation ou distribution du courrier ou de toute pièce jointe 
> est interdite. Si vous avez reçu cette communication par erreur, veuillez 
> nous en aviser par courrier et détruire immédiatement le courrier et les 
> pièces jointes. Merci. 
> 


Multiple Request to solr from dotnet application

2019-08-19 Thread Prabhu Dhanaraj
Hi Team

I would like to know if there is any way where we can combine multiple requests 
and send it solr.
We are using a dot net application to send the request to solr.
Please let us know if there is any article or sample code elated to this.

Thanks
Prabhu

American Express made the following annotations 

"This message and any attachments are solely for the intended recipient and may 
contain confidential or privileged information. If you are not the intended 
recipient, any disclosure, copying, use, or distribution of the information 

included in this message and any attachments is prohibited. If you have 
received this communication in error, please notify us by reply e-mail and 
immediately and permanently delete this message and any attachments. Thank 
you." 
American Express a ajouté le commentaire suivant le 
Ce courrier et toute pièce jointe qu'il contient sont réservés au seul 
destinataire indiqué et peuvent renfermer des renseignements confidentiels et 
privilégiés. Si vous n'êtes pas le destinataire prévu, toute divulgation, 
duplication, utilisation ou distribution du courrier ou de toute pièce jointe 
est interdite. Si vous avez reçu cette communication par erreur, veuillez nous 
en aviser par courrier et détruire immédiatement le courrier et les pièces 
jointes. Merci.