Re: Index Upgrader tool

2018-08-23 Thread damienk
Shawn, Is it possible to run optimize on the live collection? For example,
/solr/collection/update?commit=true&optimize=true

On Wed, 22 Aug 2018 at 06:50, Shawn Heisey  wrote:

> On 8/21/2018 2:29 AM, Artjoms Laivins wrote:
> > We are running Solr cloud with 3 nodes v. 6.6.2
> > We started with version 5 so we have some old index that we need safely
> move over to v. 7 now.
> > New data comes in several times per day.
> > Our questions are:
> >
> > Should we run IndexUpgrader tool on one slave node that is down or it is
> safe to run it while Solr is running and possible updates of the index are
> coming?
> > If yes, when we start it again will leader update this node with new
> data only or will it overwrite index?
>
> It might not be possible to upgrade two major versions like that, even
> with IndexUpgrader.  There is only a guarantee of reading an index
> ORIGINALLY written by the previous major version.
>
> Even if it's possible to accomplish an upgrade, it is strongly
> recommended that you index from scratch anyway.
>
> You cannot run IndexUpgrader while Solr has the index open.  The index
> must be completely closed.  You cannot update an index while it is being
> upgraded.
>
> Thanks,
> Shawn
>
>


Re: need help with a complicated join query

2018-08-23 Thread damienk
I'm thinking something like this: q={!join v=id:doca_1 from=members to=id}

On Fri, 24 Aug 2018 at 03:03, Steve Pruitt  wrote:

> At least it is complicated to me.  :>)
>
> We are investigating how to find return a list documents whose identifier
> is contained in a multi-value field in another document.
> The index consists of essentially two different documents sharing some
> common fields.
> To make it simple, I will refer to them as two different documents, but
> they are in the same index.
> DocA has a multi-value field that contains a set of identifiers from
> DocBs.  The multi-value field is named "members"
>
> I am trying to conceptualize a query join where for a given DocA the
> response contains those DocBs whose identifier is contained in DocA's
> members field.
>
> Not sure how to piece this together.
>
> I looked at function queries, but nothing jumped out.
>
> Any suggestions would be greatly appreciated.
>
> Thanks.
>
> -SP
>


Re: Local development and SolrCloud

2018-08-23 Thread John Blythe
Thanks everyone. I think we forgot that cloud doesn’t have to be clustered.
That local overhead being avoided makes it a much easier pill to swallow as
far as local performance (vs. having all the extra containers running in
docker)

Will see what we can spin up and ask questions if/as they arise!

On Wed, Aug 22, 2018 at 17:41 Erick Erickson 
wrote:

> I do quite a bit of "correctness" testing on a local stand-alone Solr,
> as Walter says, that's often easier to debug, especially when working
> through creating the proper analysis chains, do queries do what I
> expect and the like.
>
> That said, I'd never jump straight to SolrCloud implementations
> without my QA being on SolrCloud. Not only do subtle differences creep
> in, but some things simply aren't supported, e.g. group.func.
>
> And, as Sameer says, you can set up a SolrCloud environment on just
> your local laptop as many of the examples do for testing, there's
> nothing required about "the cloud" for SorlCloud, it's not even
> necessary to have separate machines.
>
> Best,
> Erick
>
> On Wed, Aug 22, 2018 at 5:34 PM, Walter Underwood 
> wrote:
> > We use Solr Cloud where we need sharding or near real time updates.
> > For non-sharded collections that are updated daily, we use master-slave.
> >
> > There are some scaling and management advantages to the loose
> > coupling in a master slave cluster. Just clone a slave instance and
> > fire it up. Also, load benchmarking is easier when indexing is on a
> > separate instance.
> >
> > In prod, we have 45 Solr hosts in four clusters.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Aug 22, 2018, at 5:23 PM, John Blythe  wrote:
> >>
> >> For those of you who are developing applications with solr and are using
> >> solrcloud in production: what are you doing locally? Cloud seems
> >> unnecessary locally besides testing strictly for cloud specific use
> cases
> >> or configurations. Am I totally off basis there? We are considering
> keeping
> >> a “standard” (read: non-cloud) local solr environment locally for our
> >> development workflow and using cloud only for our remote environments.
> >> Curious to know how wise or stupid that play would be.
> >>
> >> Thanks for any info!
> >> --
> >> John Blythe
> >
>
-- 
John Blythe


Re: Solr unable to start up after setting up SSL in Solr 7.4.0

2018-08-23 Thread Zheng Lin Edwin Yeo
Thanks for the advice.

Regards,
Edwin

On Thu, 23 Aug 2018 at 17:43, Shawn Heisey  wrote:

> On 8/23/2018 2:42 AM, Jan Høydahl wrote:
> > Don't need a git checkout to pull a text file :)
> > https://github.com/apache/lucene-solr/blob/branch_7x/solr/bin/solr.cmd <
> https://github.com/apache/lucene-solr/blob/branch_7x/solr/bin/solr.cmd>
> >
> https://github.com/apache/lucene-solr/blob/branch_7x/solr/server/scripts/cloud-scripts/zkcli.bat
> <
> https://github.com/apache/lucene-solr/blob/branch_7x/solr/server/scripts/cloud-scripts/zkcli.bat
> >
>
> Good point.  That can save some considerable time, especially with a
> slow Internet connection.
>
> Thanks,
> Shawn
>
>


Re: Question on query time boosting

2018-08-23 Thread Kydryavtsev Andrey
Hi, Pratic 

I believe that your observations are correct. 

Score for each individual query (in your example it's wildcards query like 
'concept_name:(*semantic*)^200') is calculated by a complex formulas (one of 
possible implementations with a good explanation is described here 
https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html),
 but it could be simplified as follows:

score(doc, query) = query_boost *  

Score for full disjunction (by default) would be calculated as a sum of every 
individual query matched.

So score of case1 would be:

score_for_case1(doc, query) = 200 *  + 400 * 
 + 20 *  + 40 * 
 = 10 * (20 *  + 
40 *  + 2 *  + 4 
* ) = 10 * score_for_case2(doc, query)



Thank you,

Andrey Kudryavtsev

23.08.2018, 18:53, "Pratik Patel" :
> Hello All,
>
> I am trying to understand how exactly query time boosting works in solr.
> Primarily, I want to understand if absolute boost values matter or is it
> just the relative difference between various boost values which decides
> scoring. Let's take following two queries for example.
>
> // case1: q parameter
>
>>  concept_name:(*semantic*)^200 OR
>>  concept_name:(*machine*)^400 OR
>>  Abstract_note:(*semantic*)^20 OR
>>  Abstract_note:(*machine*)^40
>
> //case2: q parameter
>
>>  concept_name:(*semantic*)^20 OR
>>  concept_name:(*machine*)^40 OR
>>  Abstract_note:(*semantic*)^2 OR
>>  Abstract_note:(*machine*)^4
>
> Are these two queries any different?
>
> Relative boosting is same in both of them.
> I can see that they produce same results and ordering. Only difference is
> that the score in case1 is 10 times the score in case2.
>
> Thanks,
> Pratik


Re: Still not seeing Solr listening on 8983 after 30 seconds!

2018-08-23 Thread Abhijit Pawar
Hello,

Here are the log files:
solr.log:
https://drive.google.com/open?id=1gvgUuPx5ItbBU7wvPXd9clGJqKWQdWSJ
solr-8983-console.log:
https://drive.google.com/open?id=1062seYIoRsLL5dcCU9OHbxx7hoH6armX

Version of SOLR server is 5.4.1
For Heap Size not sure if this is useful:
Num Docs:25837 Max Doc:25857 Heap Memory Usage: -1 Deleted Docs:20
Version:42497 Segment Count:8
solr-config.xml:
  ${solr.ulog.dir:}  ${solr.ulog.numVersionBuckets:65536}


Thanks!

Regards,

Abhijit








On Thu, Aug 23, 2018 at 3:30 PM Shawn Heisey  wrote:

> On 8/23/2018 2:02 PM, Abhijit Pawar wrote:
> > Recently with no change in the configuration or code we are facing a
> > slowdown of approx 3 minutes while restarting the SOLR instance.Earlier
> it
> > used to come up in few seconds however now it takes so long.
> >
> > *Error Message Displayed:*
> > Waiting up to 30 seconds to see Solr running on port 8983 Still not
> seeing
> > Solr listening on 8983 after 30 seconds!
> > 2018-08-23 19:42:39.167 INFO  (main) [   ] o.e.j.u.log Logging
> initialized
> > @28490ms
>
> Can you grab the solr.log file a few minutes after requesting a solr
> start and make it available to us?  Attachments do not make it to the
> list - you'll need to put it on a paste site or a file sharing site and
> give us a URL to access it.
>
> I will also need to know what your max heap is and get an idea of how
> much data is being handled by the Solr instance.  The logfile will
> probably have the Solr version, but go ahead and tell me what version it
> is anyway.
>
> Thanks,
> Shawn
>
>


Re: Permission Denied when trying to connect to Solr running on a different server

2018-08-23 Thread cyndefromva
I'm using the sunspot gem. 

It can't be rails because I was able to index from the app server and I can
search from the rails console. Its just when I'm trying to access from the
web application.

And yes, my logs are in /var/solr/logs and there was nothing new there. It
did write something when I reindexed and searched from my rails console.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Not possible to use NOT queries with Solr Export Handler?

2018-08-23 Thread Shawn Heisey

On 8/23/2018 1:12 PM, Antelmo Aguilar wrote:

I asked this question in the IRC channel, but had to leave so was not able
to wait for a response.  So sending it through here instead with the hopes
that someone can give me some insight on the issue I am experiencing.

So in our Solr setup, we use the Solr Export request handler.  Our users
are able to construct Not queries and export the results.  We were testing
this feature and noticed that Not queries do not return anything, but
normal queries do return results.  Is there a reason for this or am I
missing something that will allow NOT queries to work with the Export
handler?

I would really appreciate the help.


Purely negative queries do not work.  The reason is that a negative 
query says "subtract X".  But if you don't start with *SOMETHING*, then 
you are subtracting from nothing, so you get nothing.


In some simple cases, Solr is able to detect purely negative queries and 
implicitly add a starting point of *:* which is all documents.  The 
situations where this detection works are very limited.


To be absolutely sure that a negative query will work, start out with 
the *:* all docs query.  So instead of "-field:value" use "*:* 
-field:value".


Thanks,
Shawn



Re: Still not seeing Solr listening on 8983 after 30 seconds!

2018-08-23 Thread Shawn Heisey

On 8/23/2018 2:02 PM, Abhijit Pawar wrote:

Recently with no change in the configuration or code we are facing a
slowdown of approx 3 minutes while restarting the SOLR instance.Earlier it
used to come up in few seconds however now it takes so long.

*Error Message Displayed:*
Waiting up to 30 seconds to see Solr running on port 8983 Still not seeing
Solr listening on 8983 after 30 seconds!
2018-08-23 19:42:39.167 INFO  (main) [   ] o.e.j.u.log Logging initialized
@28490ms


Can you grab the solr.log file a few minutes after requesting a solr 
start and make it available to us?  Attachments do not make it to the 
list - you'll need to put it on a paste site or a file sharing site and 
give us a URL to access it.


I will also need to know what your max heap is and get an idea of how 
much data is being handled by the Solr instance.  The logfile will 
probably have the Solr version, but go ahead and tell me what version it 
is anyway.


Thanks,
Shawn



Re: Permission Denied when trying to connect to Solr running on a different server

2018-08-23 Thread Shawn Heisey

On 8/23/2018 1:36 PM, cyndefromva wrote:

But when I try to access search through my web application I'm
getting Errno::EACCES Permission denied -- connect(2) for  port 8983.


This sounds like an error message from your rails app.  You may need to 
ask whoever created the Solr client that you are using.  There are no 
ruby clients from the Solr project -- it's guaranteed to be third party, 
and this mailing list is probably going to be unable to help with it.



I'm thinking this is probably some sort of file permission issue but I have
no clue what. I can't find anything written in the solr logs.


Because this is not a Solr error, I cannot tell you what it means.  Is 
there more detail to the error, or have you shared the whole thing?  
What ruby client for Solr are you using?


So you don't see anything in the logfile called solr.log, most likely in 
/var/solr/logs?


You may need to get a packet capture of the Solr traffic, assuming that 
the Solr traffic isn't using https.


Thanks,
Shawn



Still not seeing Solr listening on 8983 after 30 seconds!

2018-08-23 Thread Abhijit Pawar
Hello All,

Recently with no change in the configuration or code we are facing a
slowdown of approx 3 minutes while restarting the SOLR instance.Earlier it
used to come up in few seconds however now it takes so long.

*Error Message Displayed:*
Waiting up to 30 seconds to see Solr running on port 8983 Still not seeing
Solr listening on 8983 after 30 seconds!
2018-08-23 19:42:39.167 INFO  (main) [   ] o.e.j.u.log Logging initialized
@28490ms

Note:
We have installed SOLR on a Linux server with root user and restarting it
as root user.

Any suggestion or pointers greatly appreciated!
Thanks!

Regards,

Abhijit


Permission Denied when trying to connect to Solr running on a different server

2018-08-23 Thread cyndefromva
I have a ruby on rails application that used solr and the sunspot rails gem
for search. For development I just run solr locally and that's been working
fine. But I'm trying to set up a stand-alone solr server for production. So
I installed it on its own server and created the core for my site. I updated
my application config file to point to the solr server and was able to
populate the index for my site from the command line on my application
server; so the server itself can communicate with the solr server. I also
able to verify this via the Solr Admin panel; I can see that my data is
there. But when I try to access search through my web application I'm
getting Errno::EACCES Permission denied -- connect(2) for  port 8983.

I'm thinking this is probably some sort of file permission issue but I have
no clue what. I can't find anything written in the solr logs.

I'm using the default setup for solr using java 1.8 and solr 5.4.1. The
install directory is /opt/solr and the data directory is in /var/solr. Both
of these directories are owned by solr and the process is also started on
behalf of the solr issue.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Not possible to use NOT queries with Solr Export Handler?

2018-08-23 Thread Antelmo Aguilar
Hello,

I asked this question in the IRC channel, but had to leave so was not able
to wait for a response.  So sending it through here instead with the hopes
that someone can give me some insight on the issue I am experiencing.

So in our Solr setup, we use the Solr Export request handler.  Our users
are able to construct Not queries and export the results.  We were testing
this feature and noticed that Not queries do not return anything, but
normal queries do return results.  Is there a reason for this or am I
missing something that will allow NOT queries to work with the Export
handler?

I would really appreciate the help.

-Antelmo


need help with a complicated join query

2018-08-23 Thread Steve Pruitt
At least it is complicated to me.  :>)

We are investigating how to find return a list documents whose identifier is 
contained in a multi-value field in another document.
The index consists of essentially two different documents sharing some common 
fields.
To make it simple, I will refer to them as two different documents, but they 
are in the same index.
DocA has a multi-value field that contains a set of identifiers from DocBs.  
The multi-value field is named "members"

I am trying to conceptualize a query join where for a given DocA the response 
contains those DocBs whose identifier is contained in DocA's members field.

Not sure how to piece this together.

I looked at function queries, but nothing jumped out.

Any suggestions would be greatly appreciated.

Thanks.

-SP


Question on query time boosting

2018-08-23 Thread Pratik Patel
Hello All,

I am trying to understand how exactly query time boosting works in solr.
Primarily, I want to understand if absolute boost values matter or is it
just the relative difference between various boost values which decides
scoring. Let's take following two queries for example.

// case1: q parameter

> concept_name:(*semantic*)^200 OR
> concept_name:(*machine*)^400 OR
> Abstract_note:(*semantic*)^20 OR
> Abstract_note:(*machine*)^40


//case2: q parameter

> concept_name:(*semantic*)^20 OR
> concept_name:(*machine*)^40 OR
> Abstract_note:(*semantic*)^2 OR
> Abstract_note:(*machine*)^4


Are these two queries any different?

Relative boosting is same in both of them.
I can see that they produce same results and ordering. Only difference is
that the score in case1 is 10 times the score in case2.

Thanks,
Pratik


Re: SOLR zookeeper connection timeout during startup is hardcoded to 10000ms

2018-08-23 Thread Erick Erickson
That's actually 10,000 ms, a typo in your message?

Do you have a situation where that setting is causing you trouble?
Because 10 seconds for communications with ZK is quite a long time,
I'm curious what the circumstances are that you're seeing.

Best,
Erick

On Wed, Aug 22, 2018 at 3:51 PM, Danny Shih  wrote:
> Hi,
>
> During startup in cloud mode, the SOLR zookeeper connection timeout appears 
> to be hardcoded to 1000ms:
> https://github.com/apache/lucene-solr/blob/5eab1c3c688a0d8db650c657567f197fb3dcf181/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ZkClientClusterStateProvider.java#L45
>
> And it is not configurable via zkClientTimeout (solr.xml) or SOLR_WAIT_FOR_ZK 
> (solr.in.sh).
>
> Is there a way to configure this, and if not, should I open a bug?
>
> Thanks,
> Danny


SOLR zookeeper connection timeout during startup is hardcoded to 10000ms

2018-08-23 Thread Danny Shih
Hi,

During startup in cloud mode, the SOLR zookeeper connection timeout appears to 
be hardcoded to 1000ms:
https://github.com/apache/lucene-solr/blob/5eab1c3c688a0d8db650c657567f197fb3dcf181/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ZkClientClusterStateProvider.java#L45

And it is not configurable via zkClientTimeout (solr.xml) or SOLR_WAIT_FOR_ZK 
(solr.in.sh).

Is there a way to configure this, and if not, should I open a bug?

Thanks,
Danny


Re: How to trace one query?the debug/debugQuery info are not enough to find out why a query is slow

2018-08-23 Thread Jan Høydahl
Shawn, the block cache seems to be off-heap according to 
https://lucene.apache.org/solr/guide/7_4/running-solr-on-hdfs.html 


So you have 800G across 4 nodes, that gives 500M docs and 200G index data per 
solr node and 40G per shard.
Initially I'd say this is way too much data and too little RAM per node but it 
obviously work due to the very small docs you have.
So the first I'd try (after doing some analysis of various metrics for your 
running cluster) was to adjust the size of the hdfs block-cache following the 
instructions from the link above. You'll have 20-25Gb available for this, which 
is only 1/10 of the index size.

So next step would be to replace the EC2 images with ones with more RAM and 
increase block cache further and see of that helps.

Next I'd enable autoWarmCount on filterCache, find alternatives to wildcard 
query  and more.

But all in all, I'd be very very satisfied with those low response times given 
the size of your data.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 23. aug. 2018 kl. 15:05 skrev Shawn Heisey :
> 
> On 8/23/2018 5:19 AM, zhenyuan wei wrote:
>> Thanks for your detail answer @Shawn
>> 
>> Yes I run the query in SolrCloud mode, and my collection has 20 shards,
>> each shard size is 30~50GB。
>> 4 solr server, each solr JVM  use 6GB, HDFS datanode are 4 too, each
>> datanode JVM use 2.5GB。
>> Linux server host are 4 node too,each node is 16 core/32GB RAM/1600GB SSD 。
>> 
>> So, in order to  search 2 billion docs fast( HDFS shows 787GB ),I should
>> turn on autowarm,and   How
>> much  solr RAM/how many solr node  it should be?
>> Is there a roughly  formula to budget ?
> 
> There are no generic answers, no rough formulas.  Every install is different 
> and minimum requirements are dependent on the specifics of the install.
> 
> How many replicas do you have of each of those 20 shards? Is the 787GB of 
> data the size of *one* replica, or the size of *all* replicas?  Based on the 
> info you shared, I suspect that it's the size of one replica.
> 
> Here's a guide I've written:
> 
> https://wiki.apache.org/solr/SolrPerformanceProblems
> 
> That guide doesn't consider HDFS, so the info about the OS disk cache on that 
> page is probably not helpful.  I really have no idea what requirements HDFS 
> has.  I *think* that the HDFS client block cache would replace the OS disk 
> cache, and that the Solr heap must be increased to accommodate that block 
> cache.  This might lead to GC issues, though, because ideally the cache would 
> be large enough to cache all of the index data that the Solr instance is 
> accessing.  In your case, that's a LOT of data, far more than you can fit 
> into the 32GB total system memory.Solr performance will suffer if you're not 
> able to have the system cache Solr's index data.  But I will tell you that 
> achieving a QTime of 125 on a wildcard query against a 2 billion document 
> index is impressive, not something I would expect to happen with the low 
> hardware resources you're using.
> 
> You have 20 shards.  If your replicationFactor is 3, then ideally you would 
> have 60 servers - one for each shard replica. Each server would have enough 
> memory installed that it could cache the 30-50GB of data in that shard, or at 
> least MOST of it.
> 
> IMHO, Solr should be using local storage, not a network filesystem like HDFS. 
>  Things are a lot more straightforward that way.
> 
> Thanks,
> Shawn
> 



Re: How to trace one query?the debug/debugQuery info are not enough to find out why a query is slow

2018-08-23 Thread Shawn Heisey

On 8/23/2018 5:19 AM, zhenyuan wei wrote:

Thanks for your detail answer @Shawn

Yes I run the query in SolrCloud mode, and my collection has 20 shards,
each shard size is 30~50GB。
4 solr server, each solr JVM  use 6GB, HDFS datanode are 4 too, each
datanode JVM use 2.5GB。
Linux server host are 4 node too,each node is 16 core/32GB RAM/1600GB SSD 。

So, in order to  search 2 billion docs fast( HDFS shows 787GB ),I should
turn on autowarm,and   How
much  solr RAM/how many solr node  it should be?
Is there a roughly  formula to budget ?


There are no generic answers, no rough formulas.  Every install is 
different and minimum requirements are dependent on the specifics of the 
install.


How many replicas do you have of each of those 20 shards? Is the 787GB 
of data the size of *one* replica, or the size of *all* replicas?  Based 
on the info you shared, I suspect that it's the size of one replica.


Here's a guide I've written:

https://wiki.apache.org/solr/SolrPerformanceProblems

That guide doesn't consider HDFS, so the info about the OS disk cache on 
that page is probably not helpful.  I really have no idea what 
requirements HDFS has.  I *think* that the HDFS client block cache would 
replace the OS disk cache, and that the Solr heap must be increased to 
accommodate that block cache.  This might lead to GC issues, though, 
because ideally the cache would be large enough to cache all of the 
index data that the Solr instance is accessing.  In your case, that's a 
LOT of data, far more than you can fit into the 32GB total system 
memory.Solr performance will suffer if you're not able to have the 
system cache Solr's index data.  But I will tell you that achieving a 
QTime of 125 on a wildcard query against a 2 billion document index is 
impressive, not something I would expect to happen with the low hardware 
resources you're using.


You have 20 shards.  If your replicationFactor is 3, then ideally you 
would have 60 servers - one for each shard replica. Each server would 
have enough memory installed that it could cache the 30-50GB of data in 
that shard, or at least MOST of it.


IMHO, Solr should be using local storage, not a network filesystem like 
HDFS.  Things are a lot more straightforward that way.


Thanks,
Shawn



Re: How to trace one query?the debug/debugQuery info are not enough to find out why a query is slow

2018-08-23 Thread zhenyuan wei
Thanks for your detail answer @Shawn

Yes I run the query in SolrCloud mode, and my collection has 20 shards,
each shard size is 30~50GB。
4 solr server, each solr JVM  use 6GB, HDFS datanode are 4 too, each
datanode JVM use 2.5GB。
Linux server host are 4 node too,each node is 16 core/32GB RAM/1600GB SSD 。

So, in order to  search 2 billion docs fast( HDFS shows 787GB ),I should
turn on autowarm,and   How
much  solr RAM/how many solr node  it should be?
Is there a roughly  formula to budget ?

Thanks again ~
TinsWzy



Shawn Heisey  于2018年8月23日周四 下午6:19写道:

> On 8/23/2018 4:03 AM, Shawn Heisey wrote:
> > Configuring caches cannot speed up the first time a query runs.  That
> > speeds up later runs.  To speed up the first time will require two
> > things:
> >
> > 1) Ensuring that there is enough memory in the system for the
> > operating system to effectively cache the index.  This is memory
> > *beyond* the java heap that is not allocated to any program.
>
> Followup, after fully digesting the latest reply:
>
> HDFS changes things a little bit.  You would need to talk to somebody
> about caching HDFS data effectively.  I think that in that case, you
> *do* need to use the heap to create a large HDFS client cache, but I
> have no personal experience with HDFS, so I do not know for sure.  Note
> that having a very large heap can make garbage collection pauses become
> extreme.
>
> With 2 billion docs, I'm assuming that you're running SolrCloud and that
> the index is sharded.  SolrCloud gives you query load balancing for
> free.  But I think you're probably going to need a lot more than 4
> servers, and each server is probably going to need a lot of memory.  You
> haven't indicated how many shards or replicas are involved here.  For
> optimal performance, every shard needs to be on a separate server.
>
> Searching 2 billion docs, especially with wildcards, may not be possible
> to get working REALLY fast.  Without a LOT of hardware, particularly
> memory, it can be completely impractical to cache that much data.
> Terabytes of memory is *very* expensive, especially if it's scattered
> across many servers.
>
> Thanks,
> Shawn
>
>


Re: Want to start contributing.

2018-08-23 Thread Charlie Hull

On 20/08/2018 18:45, Rohan Chhabra wrote:

Hi all,

I am an absolute beginner (dummy) in the field of contributing open source.
But I am interested in contributing to open source. How do i start? Solr is
a java based search engine based on Lucene. I am good at Java and therefore
chose this to start.

I need guidance. Help required!!



A related topic: we are running two free Lucene Hackdays, in London on 
October 9th and Montreal on October 15th (the week of the Activate 
conference):

https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/252740719/
https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/253610289/

This would be a great place to meet and learn from existing Lucene 
committers.


Best

Charlie


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: How to trace one query?the debug/debugQuery info are not enough to find out why a query is slow

2018-08-23 Thread Shawn Heisey

On 8/23/2018 4:03 AM, Shawn Heisey wrote:
Configuring caches cannot speed up the first time a query runs.  That 
speeds up later runs.  To speed up the first time will require two 
things:


1) Ensuring that there is enough memory in the system for the 
operating system to effectively cache the index.  This is memory 
*beyond* the java heap that is not allocated to any program.


Followup, after fully digesting the latest reply:

HDFS changes things a little bit.  You would need to talk to somebody 
about caching HDFS data effectively.  I think that in that case, you 
*do* need to use the heap to create a large HDFS client cache, but I 
have no personal experience with HDFS, so I do not know for sure.  Note 
that having a very large heap can make garbage collection pauses become 
extreme.


With 2 billion docs, I'm assuming that you're running SolrCloud and that 
the index is sharded.  SolrCloud gives you query load balancing for 
free.  But I think you're probably going to need a lot more than 4 
servers, and each server is probably going to need a lot of memory.  You 
haven't indicated how many shards or replicas are involved here.  For 
optimal performance, every shard needs to be on a separate server.


Searching 2 billion docs, especially with wildcards, may not be possible 
to get working REALLY fast.  Without a LOT of hardware, particularly 
memory, it can be completely impractical to cache that much data.  
Terabytes of memory is *very* expensive, especially if it's scattered 
across many servers.


Thanks,
Shawn



Re: How to trace one query?the debug/debugQuery info are not enough to find out why a query is slow

2018-08-23 Thread Shawn Heisey

On 8/23/2018 3:41 AM, zhenyuan wei wrote:

Thank you very much to answer.  @Jan Høydahl
My query is simple, just wildcard last 2 char in this query(have more other
query to optimize)

  curl "
http://emr-worker-1:8983/solr/collection005/query?q=v10_s:YY*&rows=10&&fl=id&echoParams=all
"


I think that's the answer right there -- wildcard query. Wildcard 
queries have a tendency to be slow, because of how they work.  What is 
the nature of your v10_s field?  Does that wildcard query match a lot of 
terms?  When a wildcard query executes, Solr asks the index for all 
terms that match it, and then constructs a query with all of those terms 
in it.  If there are ten million terms that match the wildcard, the 
query will *quite literally* have ten million entries inside it.  Every 
one of the terms will need to be separately searched against the index.  
Each term will be fast, but it adds up if there are a lot of them.  This 
query had a numFound larger than one hundred thousand.  Which suggests 
that there were at least that many terms in the query. So basically in 
the time it took, Solr first gathered a huge list of terms, and then 
internally executed over one hundred thousand individual queries.


Changing your field definition so you can avoid wildcard queries will go 
a long way towards speeding things up.Typically this involves some kind 
of ngram tokenizer or filter. It will make the index much larger, but 
tends to speed things up.


Your example says the QTime is 125 milliseconds, and your message talks 
about times of 40 milliseconds.  This is NOT slow. If you're trying to 
maximize queries per second, you need to know that handling a high query 
load requires multiple servers handling multiple replicas of your index, 
and some kind of load balancing.


Configuring caches cannot speed up the first time a query runs.  That 
speeds up later runs.  To speed up the first time will require two things:


1) Ensuring that there is enough memory in the system for the operating 
system to effectively cache the index.  This is memory *beyond* the java 
heap that is not allocated to any program.
2) Changing the query to a type that executes faster and adjusting the 
schema to allow the new type to work.  Wildcard queries are one of the 
worst options.


In a later message, you indicated that your cache autowarmCount values 
are mostly set to zero.  This means that anytime you make a change to 
the index, your caches are completely gone, and that the one cache with 
a nonzero setting is using NoOpRegenerator, so it's not actually doing 
any warming.  With auto warming, the most recent entries in the cache 
will be re-executed to warm the new caches.  This can help with 
performance, but if you make autoWarmCount too large, it will make 
commits take a very long time.  Note that documentCache actually doesn't 
do warming, so that setting is irrelevant on that cache.


Thanks,
Shawn



Re: How to trace one query?the debug/debugQuery info are not enough to find out why a query is slow

2018-08-23 Thread zhenyuan wei
I have 4 solr server, each allocated 6GB。My dataset on HDFS is 787GB, 2 billion
documents  totally,each document is 300 Bytes。 Follow is my cache related
configuration。





20
200


zhenyuan wei  于2018年8月23日周四 下午5:41写道:

> Thank you very much to answer.  @Jan Høydahl
> My query is simple, just wildcard last 2 char in this query(have more
> other query to optimize)
>
>  curl "
> http://emr-worker-1:8983/solr/collection005/query?q=v10_s:YY*&rows=10&&fl=id&echoParams=all
> "
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "*QTime":125,*
> "params":{
>   "q":"v10_s:YY*",
>   "df":"_text_",
>   "echoParams":"all",
>   "indent":"true",
>   "fl":"id",
>   "rows":"10",
>   "wt":"json"}},
>   "response":{"numFound":118181,"start":0,"maxScore":1.0,"docs":[
> ...
>   }}
>
> The first time query , it return slow, second time it returns very fast。It
> maybe cached already。
> With  "debugQuery=true&shards.info=true",  I can see all shards toke 40
> or more millisecond 。
> The second time the query executes , all shards query spend only 1~5
> millisecond。
> But I dont know how to optimize the first time the query
> execute,filterCache?QueryResultCache?
> or hdfs blockCache? and how much RAM  I should allocate to them?
>
> I want to turn down QTime to 50ms or less in the first time query executes。
>
>
>
>
> Jan Høydahl  于2018年8月23日周四 下午5:19写道:
>
>> Hi,
>>
>> With debugQuery you see the timings. What component spends the most time?
>> With shards.info=true you see what shard is the slowest, if your index
>> is sharded.
>> With echoParams=all you get the full list of query parameters in use,
>> perhaps you spot something?
>> If you start Solr with -v option then you get more verbose logging in
>> solr.log which may help
>>
>> Can you share with us how your query looks like, including all parameters
>> from the  section with echoParams=all enabled?
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>>
>> > 23. aug. 2018 kl. 11:09 skrev zhenyuan wei :
>> >
>> > Hi all,
>> >I do care query performance, but do not know how to find out the
>> reason
>> > why a query so slow.
>> > *How to trace one query?*the debug/debugQuery info are not enough to
>> find
>> > out why a query is slow。
>> >
>> >
>> > Thanks a lot~
>>
>>


Re: Solr unable to start up after setting up SSL in Solr 7.4.0

2018-08-23 Thread Shawn Heisey

On 8/23/2018 2:42 AM, Jan Høydahl wrote:

Don't need a git checkout to pull a text file :)
https://github.com/apache/lucene-solr/blob/branch_7x/solr/bin/solr.cmd 

https://github.com/apache/lucene-solr/blob/branch_7x/solr/server/scripts/cloud-scripts/zkcli.bat
 



Good point.  That can save some considerable time, especially with a 
slow Internet connection.


Thanks,
Shawn



Re: How to trace one query?the debug/debugQuery info are not enough to find out why a query is slow

2018-08-23 Thread zhenyuan wei
Thank you very much to answer.  @Jan Høydahl
My query is simple, just wildcard last 2 char in this query(have more other
query to optimize)

 curl "
http://emr-worker-1:8983/solr/collection005/query?q=v10_s:YY*&rows=10&&fl=id&echoParams=all
"
{
  "responseHeader":{
"zkConnected":true,
"status":0,
"*QTime":125,*
"params":{
  "q":"v10_s:YY*",
  "df":"_text_",
  "echoParams":"all",
  "indent":"true",
  "fl":"id",
  "rows":"10",
  "wt":"json"}},
  "response":{"numFound":118181,"start":0,"maxScore":1.0,"docs":[
...
  }}

The first time query , it return slow, second time it returns very fast。It
maybe cached already。
With  "debugQuery=true&shards.info=true",  I can see all shards toke 40 or
more millisecond 。
The second time the query executes , all shards query spend only 1~5
millisecond。
But I dont know how to optimize the first time the query
execute,filterCache?QueryResultCache?
or hdfs blockCache? and how much RAM  I should allocate to them?

I want to turn down QTime to 50ms or less in the first time query executes。




Jan Høydahl  于2018年8月23日周四 下午5:19写道:

> Hi,
>
> With debugQuery you see the timings. What component spends the most time?
> With shards.info=true you see what shard is the slowest, if your index is
> sharded.
> With echoParams=all you get the full list of query parameters in use,
> perhaps you spot something?
> If you start Solr with -v option then you get more verbose logging in
> solr.log which may help
>
> Can you share with us how your query looks like, including all parameters
> from the  section with echoParams=all enabled?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 23. aug. 2018 kl. 11:09 skrev zhenyuan wei :
> >
> > Hi all,
> >I do care query performance, but do not know how to find out the
> reason
> > why a query so slow.
> > *How to trace one query?*the debug/debugQuery info are not enough to find
> > out why a query is slow。
> >
> >
> > Thanks a lot~
>
>


Re: How to trace one query?the debug/debugQuery info are not enough to find out why a query is slow

2018-08-23 Thread Jan Høydahl
Hi,

With debugQuery you see the timings. What component spends the most time?
With shards.info=true you see what shard is the slowest, if your index is 
sharded.
With echoParams=all you get the full list of query parameters in use, perhaps 
you spot something?
If you start Solr with -v option then you get more verbose logging in solr.log 
which may help

Can you share with us how your query looks like, including all parameters from 
the  section with echoParams=all enabled?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 23. aug. 2018 kl. 11:09 skrev zhenyuan wei :
> 
> Hi all,
>I do care query performance, but do not know how to find out the reason
> why a query so slow.
> *How to trace one query?*the debug/debugQuery info are not enough to find
> out why a query is slow。
> 
> 
> Thanks a lot~



How to trace one query?the debug/debugQuery info are not enough to find out why a query is slow

2018-08-23 Thread zhenyuan wei
Hi all,
I do care query performance, but do not know how to find out the reason
why a query so slow.
*How to trace one query?*the debug/debugQuery info are not enough to find
out why a query is slow。


Thanks a lot~


Re: Solr unable to start up after setting up SSL in Solr 7.4.0

2018-08-23 Thread Jan Høydahl
Don't need a git checkout to pull a text file :) 
https://github.com/apache/lucene-solr/blob/branch_7x/solr/bin/solr.cmd 

https://github.com/apache/lucene-solr/blob/branch_7x/solr/server/scripts/cloud-scripts/zkcli.bat
 


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 23. aug. 2018 kl. 05:51 skrev Shawn Heisey :
> 
> On 8/22/2018 8:31 PM, Zheng Lin Edwin Yeo wrote:
>> Hi Jan and Shawn,
>> 
>> So far I am still getting the error from the workaround and quick fix
>> methods.
>> Not sure if it is good to continue to use the files from Solr 7.3.1 while
>> waiting for the release of Solr 7.5.0?
> 
> You could check out the source code, switch to branch_7x, and copy 
> solr\bin\solr.cmd from there to your 7.4.0 install's bin directory.
> 
> Change to a suitable directory, and run the following commands.  A new 
> directory inside the current directory will be created.  By default it will 
> have the name "lucene-solr".
> 
> git clone https://git-wip-us.apache.org/repos/asf/lucene-solr.git
> cd lucene-solr
> git checkout branch_7x
> 
> This is where you can obtain git for windows:
> 
> https://git-scm.com/download/win
> 
> Thanks,
> Shawn
>