Re: Limit Solr to search for only 500 records based on the search criteria

2016-06-16 Thread Erick Erickson
What is? You have to separate getting the response from rendering, so first
I'd measure with curl or similar. Second, returning 1,000 records is
usually a bad idea, when returning that many rows you may want to use the
export handler.

Sent from my phone
On Jun 16, 2016 3:25 AM, "Thrinadh Kuppili"  wrote:

> Thanks Eric,
> Can you please elaborate "I'd also suggest that your response times are
> tunable" ??
>
> I have set the start to 0 and rows to 1000 initially and entered a search
> field as below
>
> CompanyName : private limited
>
> when i clicked on search button in the UI it is not responsive at all.
>
> When i remove the start and rows values it is good and resulting in 65k
> records out of which 10 are displayed in the solr UI.
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Limit-Solr-to-search-for-only-500-records-based-on-the-search-criteria-tp4282519p4282559.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: [SolrCloud] shard hash ranges changed after restoring backup

2016-06-16 Thread Erick Erickson
In essence, no. The data is, at best, in the wrong shard and at worst
nowhere.

Sent from my phone
On Jun 16, 2016 8:26 AM, "Gary Yao"  wrote:

> Hi Erick,
>
> I should add that our Solr cluster is in production and new documents
> are constantly indexed. The new cluster has been up for three weeks now.
> The problem was discovered only now because in our use case Atomic
> Updates and RealTime Gets are mostly performed on new documents. With
> almost absolute certainty there are already documents in the index that
> were distributed to the shards according to the new hash ranges. If we
> just changed the hash ranges in ZooKeeper, the index would still be in
> an inconsistent state.
>
> Is there any way to recover from this without having to re-index all
> documents?
>
> Best,
> Gary
>
> 2016-06-15 19:23 GMT+02:00 Erick Erickson :
> > Simplest, though a bit risky is to manually edit the znode and
> > correct the znode entry. There are various tools out there, including
> > one that ships with Zookeeper (see the ZK documentation).
> >
> > Or you can use the zkcli scripts (the Zookeeper ones) to get the znode
> > down to your local machine, edit it there and then push it back up to ZK.
> >
> > I'd do all this with my Solr nodes shut down, then insure that my ZK
> > ensemble was consistent after the update etc
> >
> > Best,
> > Erick
> >
> > On Wed, Jun 15, 2016 at 8:36 AM, Gary Yao  wrote:
> >> Hi all,
> >>
> >> My team at work maintains a SolrCloud 5.3.2 cluster with multiple
> >> collections configured with sharding and replication.
> >>
> >> We recently backed up our Solr indexes using the built-in backup
> >> functionality. After the cluster was restored from the backup, we
> >> noticed that atomic updates of documents are failing occasionally with
> >> the error message 'missing required field [...]'. The exceptions are
> >> thrown on a host on which the document to be updated is not stored. From
> >> this we are deducing that there is a problem with finding the right host
> >> by the hash of the uniqueKey. Indeed, our investigations so far showed
> >> that for at least one collection in the new cluster, the shards have
> >> different hash ranges assigned now. We checked the hash ranges by
> >> querying /admin/collections?action=CLUSTERSTATUS. Find below the shard
> >> hash ranges of one collection that we debugged.
> >>
> >>   Old cluster:
> >> shard1_0 8000 - aaa9
> >> shard1_1  - d554
> >> shard2_0 d555 - fffe
> >> shard2_1  - 2aa9
> >> shard3_0 2aaa - 5554
> >> shard3_1  - 7fff
> >>
> >>   New cluster:
> >> shard1 8000 - aaa9
> >> shard2  - d554
> >> shard3 d555 - 
> >> shard4 0 - 2aa9
> >> shard5 2aaa - 5554
> >> shard6  - 7fff
> >>
> >>   Note that the shard names differ because the old cluster's shards were
> >>   split.
> >>
> >> As you can see, the ranges of shard3 and shard4 differ from the old
> >> cluster. This change of hash ranges matches with the symptoms we are
> >> currently experiencing.
> >>
> >> We found this JIRA ticket
> https://issues.apache.org/jira/browse/SOLR-5750
> >> in which David Smiley comments:
> >>
> >>   shard hash ranges aren't restored; this error could be disasterous
> >>
> >> It seems that this is what happened to us. We would like to hear some
> >> suggestions on how we could recover from this problem.
> >>
> >> Best,
> >> Gary
>


Re: Long STW GCs with Solr Cloud

2016-06-16 Thread Cas Rusnov
Hey thanks for your reply.

Looks like running the suggested CMS config from Shawn, we're getting some
nodes with 30+sec pauses, I gather due to large heap, interestingly enough
while the scenario Jeff talked about is remarkably similar (we use field
collapsing), including the performance aspects of it, we are getting
concurrent mode failures both due to new space allocation failures and due
to promotion failures. I suspect there's a lot of garbage building up.
We're going to run tests with field collapsing disabled and see if that
makes a difference.

Cas


On Thu, Jun 16, 2016 at 1:08 PM, Jeff Wartes  wrote:

> Check your gc log for CMS “concurrent mode failure” messages.
>
> If a concurrent CMS collection fails, it does a stop-the-world pause while
> it cleans up using a *single thread*. This means the stop-the-world CMS
> collection in the failure case is typically several times slower than a
> concurrent CMS collection. The single-thread business means it will also be
> several times slower than the Parallel collector, which is probably what
> you’re seeing. I understand that it needs to stop the world in this case,
> but I really wish the CMS failure would fall back to a Parallel collector
> run instead.
> The Parallel collector is always going to be the fastest at getting rid of
> garbage, but only because it stops all the application threads while it
> runs, so it’s got less complexity to deal with. That said, it’s probably
> not going to be orders of magnitude faster than a (successfully) concurrent
> CMS collection.
>
> Regardless, the bigger the heap, the bigger the pause.
>
> If your application is generating a lot of garbage, or can generate a lot
> of garbage very suddenly, CMS concurrent mode failures are more likely. You
> can turn down the  -XX:CMSInitiatingOccupancyFraction value in order to
> give the CMS collection more of a head start at the cost of more frequent
> collections. If that doesn’t work, you can try using a bigger heap, but you
> may eventually find yourself trying to figure out what about your query
> load generates so much garbage (or causes garbage spikes) and trying to
> address that. Even G1 won’t protect you from highly unpredictable garbage
> generation rates.
>
> In my case, for example, I found that a very small subset of my queries
> were using the CollapseQParserPlugin, which requires quite a lot of memory
> allocations, especially on a large index. Although generally this was fine,
> if I got several of these rare queries in a very short window, it would
> always spike enough garbage to cause CMS concurrent mode failures. The
> single-threaded concurrent-mode failure would then take long enough that
> the ZK heartbeat would fail, and things would just go downhill from there.
>
>
>
> On 6/15/16, 3:57 PM, "Cas Rusnov"  wrote:
>
> >Hey Shawn! Thanks for replying.
> >
> >Yes I meant HugePages not HugeTable, brain fart. I will give the
> >transparent off option a go.
> >
> >I have attempted to use your CMS configs as is and also the default
> >settings and the cluster dies under our load (basically a node will get a
> >35-60s GC STW and then the others in the shard will take the load, and
> they
> >will in turn get long STWs until the shard dies), which is why basically
> in
> >a fit of desperation I tried out ParallelGC and found it to be half-way
> >acceptable. I will run a test using your configs (and the defaults) again
> >just to be sure (since I'm certain the machine config has changed since we
> >used your unaltered settings).
> >
> >Thanks!
> >Cas
> >
> >
> >On Wed, Jun 15, 2016 at 3:41 PM, Shawn Heisey 
> wrote:
> >
> >> On 6/15/2016 3:05 PM, Cas Rusnov wrote:
> >> > After trying many of the off the shelf configurations (including CMS
> >> > configurations but excluding G1GC, which we're still taking the
> >> > warnings about seriously), numerous tweaks, rumors, various instance
> >> > sizes, and all the rest, most of which regardless of heap size and
> >> > newspace size resulted in frequent 30+ second STW GCs, we settled on
> >> > the following configuration which leads to occasional high GCs but
> >> > mostly stays between 10-20 second STWs every few minutes (which is
> >> > almost acceptable): -XX:+AggressiveOpts -XX:+UnlockDiagnosticVMOptions
> >> > -XX:+UseAdaptiveSizePolicy -XX:+UseLargePages -XX:+UseParallelGC
> >> > -XX:+UseParallelOldGC -XX:MaxGCPauseMillis=15000 -XX:MaxNewSize=12000m
> >> > -XX:ParGCCardsPerStrideChunk=4096 -XX:ParallelGCThreads=16 -Xms31000m
> >> > -Xmx31000m
> >>
> >> You mentioned something called "HugeTable" ... I assume you're talking
> >> about huge pages.  If that's what you're talking about, have you also
> >> turned off transparent huge pages?  If you haven't, you might want to
> >> completely disable huge pages in your OS.  There's evidence that the
> >> transparent option can affect performance.
> >>
> >> I assume you've probably looked at my GC info at the following URL:
> >>
> >> http://wiki.apache.org/solr/ShawnHeise

Re: [E] Re: Stemming

2016-06-16 Thread Aurélien MAZOYER

No problem :-)

Aurélien

Le 16/06/2016 22:36, Jamal, Sarfaraz a écrit :

Oh, is this what you meant?

   
 
   content_stemming
   
 
   

I changed it to content_stemming and now it seems to work :) - It was _text_ 
before -

Thanks! I will update if I discover anything amiss

Thanks again so much =)

Sas

-Original Message-
From: Aurélien MAZOYER [mailto:aurelien.mazo...@francelabs.com]
Sent: Thursday, June 16, 2016 4:36 PM
To: solr-user@lucene.apache.org
Subject: Re: [E] Re: Stemming

Hi,

I was just wondering if you are sure that you query only that field (or fields 
that use your text_stem analyzer) and not other fields (in your qf for example 
is you use edismax) that can give you uncorrect results.

Regards,

Aurélien

Le 16/06/2016 22:29, Jamal, Sarfaraz a écrit :

Hello =)

Just to be safe and make sure it's happening at indexing time AS WELL
as QUERYING time -

I modified it to be like so:



  
  
  
  
  
  


  
  
  
  
  
  
   


I am re-indexing the files
And what do you mean about only querying one field? I am not entirely sure I 
understand..

Sas

-Original Message-
From: Aurélien MAZOYER [mailto:aurelien.mazo...@francelabs.com]
Sent: Thursday, June 16, 2016 4:20 PM
To: solr-user@lucene.apache.org
Subject: [E] Re: Stemming

Hi,

Yes you should have the same resultset.

Are you sure that you reindex all the data after changing your schema?
Are you sure that you put your analyzer both at indexing and querying?
Are you sure you query only one field?

Regards,

Aurélien

Le 16/06/2016 21:13, Jamal, Sarfaraz a écrit :

Hi Guys,

I have enabled stemming:
 




 

In the Admin Analysis, I type in running or runs and they both break down to 
run.
However when I search for run, runs, or running with an actual query
-

It brings back three different sets of results.

Is that correct?

I would imagine that all three would bring back the exact same resultset?

Sas





RE: [E] Re: Stemming

2016-06-16 Thread Jamal, Sarfaraz
Oh, is this what you meant?

  

  content_stemming
  

  

I changed it to content_stemming and now it seems to work :) - It was _text_ 
before -

Thanks! I will update if I discover anything amiss

Thanks again so much =)

Sas

-Original Message-
From: Aurélien MAZOYER [mailto:aurelien.mazo...@francelabs.com] 
Sent: Thursday, June 16, 2016 4:36 PM
To: solr-user@lucene.apache.org
Subject: Re: [E] Re: Stemming

Hi,

I was just wondering if you are sure that you query only that field (or fields 
that use your text_stem analyzer) and not other fields (in your qf for example 
is you use edismax) that can give you uncorrect results.

Regards,

Aurélien

Le 16/06/2016 22:29, Jamal, Sarfaraz a écrit :
> Hello =)
>
> Just to be safe and make sure it's happening at indexing time AS WELL 
> as QUERYING time -
>
> I modified it to be like so:
>
>
>   
> 
>  words="lang/stopwords_en.txt" ignoreCase="true"/>
> 
> 
>  protected="protwords.txt"/>
> 
>   
>   
> 
>  words="lang/stopwords_en.txt" ignoreCase="true"/>
> 
> 
>  protected="protwords.txt"/>
> 
>
>
>
> I am re-indexing the files
> And what do you mean about only querying one field? I am not entirely sure I 
> understand..
>
> Sas
>
> -Original Message-
> From: Aurélien MAZOYER [mailto:aurelien.mazo...@francelabs.com]
> Sent: Thursday, June 16, 2016 4:20 PM
> To: solr-user@lucene.apache.org
> Subject: [E] Re: Stemming
>
> Hi,
>
> Yes you should have the same resultset.
>
> Are you sure that you reindex all the data after changing your schema?
> Are you sure that you put your analyzer both at indexing and querying?
> Are you sure you query only one field?
>
> Regards,
>
> Aurélien
>
> Le 16/06/2016 21:13, Jamal, Sarfaraz a écrit :
>> Hi Guys,
>>
>> I have enabled stemming:
>> 
>>  
>>  
>>  > language="English"/>
>>  
>> 
>>
>> In the Admin Analysis, I type in running or runs and they both break down to 
>> run.
>> However when I search for run, runs, or running with an actual query 
>> -
>>
>> It brings back three different sets of results.
>>
>> Is that correct?
>>
>> I would imagine that all three would bring back the exact same resultset?
>>
>> Sas
>>



RE: [E] Re: Stemming

2016-06-16 Thread Jamal, Sarfaraz
Hello =)

Just to be safe and make sure it's happening at indexing time AS WELL as 
QUERYING time -

I modified it to be like so:

  

  
  
  
  
  
  


  
  
  
  
  
  
 
  

I am re-indexing the files
And what do you mean about only querying one field? I am not entirely sure I 
understand..

Sas

-Original Message-
From: Aurélien MAZOYER [mailto:aurelien.mazo...@francelabs.com] 
Sent: Thursday, June 16, 2016 4:20 PM
To: solr-user@lucene.apache.org
Subject: [E] Re: Stemming

Hi,

Yes you should have the same resultset.

Are you sure that you reindex all the data after changing your schema?
Are you sure that you put your analyzer both at indexing and querying?
Are you sure you query only one field?

Regards,

Aurélien

Le 16/06/2016 21:13, Jamal, Sarfaraz a écrit :
> Hi Guys,
>
> I have enabled stemming:
>
>   
>   
>language="English"/>
>   
>
>
> In the Admin Analysis, I type in running or runs and they both break down to 
> run.
> However when I search for run, runs, or running with an actual query -
>
> It brings back three different sets of results.
>
> Is that correct?
>
> I would imagine that all three would bring back the exact same resultset?
>
> Sas
>



Re: [E] Re: Stemming

2016-06-16 Thread Aurélien MAZOYER

Hi,

I was just wondering if you are sure that you query only that field (or 
fields that use your text_stem analyzer) and not other fields (in your 
qf for example is you use edismax) that can give you uncorrect results.


Regards,

Aurélien

Le 16/06/2016 22:29, Jamal, Sarfaraz a écrit :

Hello =)

Just to be safe and make sure it's happening at indexing time AS WELL as 
QUERYING time -

I modified it to be like so:

   

  
  
  
  
  
  


  
  
  
  
  
  
   
   

I am re-indexing the files
And what do you mean about only querying one field? I am not entirely sure I 
understand..

Sas

-Original Message-
From: Aurélien MAZOYER [mailto:aurelien.mazo...@francelabs.com]
Sent: Thursday, June 16, 2016 4:20 PM
To: solr-user@lucene.apache.org
Subject: [E] Re: Stemming

Hi,

Yes you should have the same resultset.

Are you sure that you reindex all the data after changing your schema?
Are you sure that you put your analyzer both at indexing and querying?
Are you sure you query only one field?

Regards,

Aurélien

Le 16/06/2016 21:13, Jamal, Sarfaraz a écrit :

Hi Guys,

I have enabled stemming:







In the Admin Analysis, I type in running or runs and they both break down to 
run.
However when I search for run, runs, or running with an actual query -

It brings back three different sets of results.

Is that correct?

I would imagine that all three would bring back the exact same resultset?

Sas





Re: Stemming

2016-06-16 Thread Aurélien MAZOYER

Hi,

Yes you should have the same resultset.

Are you sure that you reindex all the data after changing your schema?
Are you sure that you put your analyzer both at indexing and querying?
Are you sure you query only one field?

Regards,

Aurélien

Le 16/06/2016 21:13, Jamal, Sarfaraz a écrit :

Hi Guys,

I have enabled stemming:
   




   

In the Admin Analysis, I type in running or runs and they both break down to 
run.
However when I search for run, runs, or running with an actual query -

It brings back three different sets of results.

Is that correct?

I would imagine that all three would bring back the exact same resultset?

Sas





Re: Long STW GCs with Solr Cloud

2016-06-16 Thread Jeff Wartes
Check your gc log for CMS “concurrent mode failure” messages. 

If a concurrent CMS collection fails, it does a stop-the-world pause while it 
cleans up using a *single thread*. This means the stop-the-world CMS collection 
in the failure case is typically several times slower than a concurrent CMS 
collection. The single-thread business means it will also be several times 
slower than the Parallel collector, which is probably what you’re seeing. I 
understand that it needs to stop the world in this case, but I really wish the 
CMS failure would fall back to a Parallel collector run instead.
The Parallel collector is always going to be the fastest at getting rid of 
garbage, but only because it stops all the application threads while it runs, 
so it’s got less complexity to deal with. That said, it’s probably not going to 
be orders of magnitude faster than a (successfully) concurrent CMS collection.

Regardless, the bigger the heap, the bigger the pause.

If your application is generating a lot of garbage, or can generate a lot of 
garbage very suddenly, CMS concurrent mode failures are more likely. You can 
turn down the  -XX:CMSInitiatingOccupancyFraction value in order to give the 
CMS collection more of a head start at the cost of more frequent collections. 
If that doesn’t work, you can try using a bigger heap, but you may eventually 
find yourself trying to figure out what about your query load generates so much 
garbage (or causes garbage spikes) and trying to address that. Even G1 won’t 
protect you from highly unpredictable garbage generation rates.

In my case, for example, I found that a very small subset of my queries were 
using the CollapseQParserPlugin, which requires quite a lot of memory 
allocations, especially on a large index. Although generally this was fine, if 
I got several of these rare queries in a very short window, it would always 
spike enough garbage to cause CMS concurrent mode failures. The single-threaded 
concurrent-mode failure would then take long enough that the ZK heartbeat would 
fail, and things would just go downhill from there.



On 6/15/16, 3:57 PM, "Cas Rusnov"  wrote:

>Hey Shawn! Thanks for replying.
>
>Yes I meant HugePages not HugeTable, brain fart. I will give the
>transparent off option a go.
>
>I have attempted to use your CMS configs as is and also the default
>settings and the cluster dies under our load (basically a node will get a
>35-60s GC STW and then the others in the shard will take the load, and they
>will in turn get long STWs until the shard dies), which is why basically in
>a fit of desperation I tried out ParallelGC and found it to be half-way
>acceptable. I will run a test using your configs (and the defaults) again
>just to be sure (since I'm certain the machine config has changed since we
>used your unaltered settings).
>
>Thanks!
>Cas
>
>
>On Wed, Jun 15, 2016 at 3:41 PM, Shawn Heisey  wrote:
>
>> On 6/15/2016 3:05 PM, Cas Rusnov wrote:
>> > After trying many of the off the shelf configurations (including CMS
>> > configurations but excluding G1GC, which we're still taking the
>> > warnings about seriously), numerous tweaks, rumors, various instance
>> > sizes, and all the rest, most of which regardless of heap size and
>> > newspace size resulted in frequent 30+ second STW GCs, we settled on
>> > the following configuration which leads to occasional high GCs but
>> > mostly stays between 10-20 second STWs every few minutes (which is
>> > almost acceptable): -XX:+AggressiveOpts -XX:+UnlockDiagnosticVMOptions
>> > -XX:+UseAdaptiveSizePolicy -XX:+UseLargePages -XX:+UseParallelGC
>> > -XX:+UseParallelOldGC -XX:MaxGCPauseMillis=15000 -XX:MaxNewSize=12000m
>> > -XX:ParGCCardsPerStrideChunk=4096 -XX:ParallelGCThreads=16 -Xms31000m
>> > -Xmx31000m
>>
>> You mentioned something called "HugeTable" ... I assume you're talking
>> about huge pages.  If that's what you're talking about, have you also
>> turned off transparent huge pages?  If you haven't, you might want to
>> completely disable huge pages in your OS.  There's evidence that the
>> transparent option can affect performance.
>>
>> I assume you've probably looked at my GC info at the following URL:
>>
>> http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr
>>
>> The parallel collector is most definitely not a good choice.  It does
>> not optimize for latency.  It's my understanding that it actually
>> prefers full GCs, because it is optimized for throughput.  Solr thrives
>> on good latency, throughput doesn't matter very much.
>>
>> If you want to continue avoiding G1, you should definitely be using
>> CMS.  My recommendation right now would be to try the G1 settings on my
>> wiki page under the heading "Current experiments" or the CMS settings
>> just below that.
>>
>> The out-of-the-box GC tuning included with Solr 6 is probably a better
>> option than the parallel collector you've got configured now.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
>-- 
>
>Cas Rusnov,
>
>Engineer
>[image: Manz

RE: [E] Re: Stemming

2016-06-16 Thread Jamal, Sarfaraz
HI Ahmet,

Thanks for your guidance.

I just tried the following two configurations:

  





  

And

  

  
  
  
  
  
  

  

They both produced three different sets of results

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: Thursday, June 16, 2016 3:37 PM
To: solr-user@lucene.apache.org
Subject: [E] Re: Stemming



Hi Jamal,

Snowball requires lowercase filter above it.
This is documented in javadocs but it is a small but important detail.
Please use a lowercase filter after the whitescpace tokenizer.


Ahmet
On Thursday, June 16, 2016 10:13 PM, "Jamal, Sarfaraz" 
 wrote:



Hi Guys,

I have enabled stemming:
  




  

In the Admin Analysis, I type in running or runs and they both break down to 
run.
However when I search for run, runs, or running with an actual query -

It brings back three different sets of results.

Is that correct?

I would imagine that all three would bring back the exact same resultset?

Sas 


Re: ConcurrentMergeScheduler options not exposed

2016-06-16 Thread Shawn Heisey
On 6/16/2016 2:35 AM, Michael McCandless wrote:
>
> Hmm, merging can't read at 800 MB/sec and only write at 20 MB/sec for
> very long ... unless there is a huge percentage of deletes. Also, by
> default CMS doesn't throttle forced merges (see
> CMS.get/setForceMergeMBPerSec). Maybe capture
> IndexWriter.setInfoStream output?

I can see the problem myself.  I have a RAID10 array with six SATA
disks.  When I click the Optimize button for a core that's several
gigabytes, iotop shows me reads happening at about 100MB/s for several
seconds, then writes clocking no more than 25 MB/s, and usually a lot
less.  The last several gigabytes that were written were happening at
less than 5 MB/s.  This is VERY slow, and does affect my nightly
indexing processes.

Asking the shell to copy a 5GB file revealed sustained write rates of
over 500MB/s, so the hardware can definitely go faster.

I patched in an option for solrconfig.xml where I could force it to call
disableAutoIOThrottle().  I included logging in my patch to make
absolutely sure that the new code was used.  This option made no
difference in the write speed.  I also enabled infoStream, but either I
configured it wrong or I do not know where to look for the messages.  I
was modifying and compiling branch_5_5.

This is the patch that I applied:

http://apaste.info/wKG

I did see the expected log entries in solr.log when I restarted with the
patch and the new option in solrconfig.xml.

What else can I look at?

Thanks,
Shawn



Re: Stemming

2016-06-16 Thread Ahmet Arslan


Hi Jamal,

Snowball requires lowercase filter above it.
This is documented in javadocs but it is a small but important detail.
Please use a lowercase filter after the whitescpace tokenizer.


Ahmet
On Thursday, June 16, 2016 10:13 PM, "Jamal, Sarfaraz" 
 wrote:



Hi Guys,

I have enabled stemming:
  




  

In the Admin Analysis, I type in running or runs and they both break down to 
run.
However when I search for run, runs, or running with an actual query -

It brings back three different sets of results.

Is that correct?

I would imagine that all three would bring back the exact same resultset?

Sas 


tlogs not deleting as usual in Solr 5.5.1?

2016-06-16 Thread Chris Morley
 The repetition below is on purpose to show the contrast between solr 
versions.
  
 In Solr 4.10.3, we have autocommits disabled.  We do a dataimport of a few 
hundred thousand records and have a tlog that grows to ~1.2G.
  
 In Solr 5.5.1,  we have autocommits disabled.  We do a dataimport of a few 
hundred thousand records and have a tlog that grows to ~1.6G. (same exact 
data, slightly larger tlog but who knows, that's fine)
  
 In Solr 4.10.3 tlogs ARE deleted after issuing update?commit=true.  
(And deleted immediately.)
  
 In Solr 5.5.1  tlogs ARE NOT deleted after issuing update?commit=true.
  
 We want the tlog to delete like it did in Solr 4.10.3.  Perhaps there is a 
configuration setting or feature of Solr 5.5.1 that causes this?
  
 Would appreciate any tips on configuration or code we could change to 
ensure the tlog will delete after a hard commit.
  

  
  



Stemming

2016-06-16 Thread Jamal, Sarfaraz
Hi Guys,

I have enabled stemming:
  




  

In the Admin Analysis, I type in running or runs and they both break down to 
run.
However when I search for run, runs, or running with an actual query -

It brings back three different sets of results.

Is that correct?

I would imagine that all three would bring back the exact same resultset?

Sas 



Re: Solr 6.1.x Release Date ??

2016-06-16 Thread Steve Rowe
Tomorrow-ish.

--
Steve
www.lucidworks.com

> On Jun 16, 2016, at 4:14 AM, Ramesh shankar  wrote:
> 
> Hi,
> 
> Yes, i used the solr-6.1.0-79 nightly builds and [subquery] transformer is
> working fine in, any idea of the expected release date for 6.1 ?
> 
> Regards
> Ramesh
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-6-1-x-Release-Date-tp4280945p4282562.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Core Admin API: Create Solr core if it does not exist

2016-06-16 Thread Andreas Hubold

Hi,

we're still using Solr 4.10.4 without SolrCloud and create cores 
dynamically using the Core Admin API.


We have multiple applications that access a core and create it if it 
doesn't exist. To this end, we use the STATUS action to see if a 
required core exists and if it doesn't, create it with the CREATE 
action. In general, this works very well.


However we have a problem if two applications try to create the same 
core concurrenty. With unlucky timing, both see a non-existing core in 
the STATUS response and then try to CREATE it. One CREATE succeeds, the 
other fails with 400 Bad Request. We can handle that properly in our 
application.


But Solr logs ugly errors for the failed CREATE call. See below.
I'd guess, I can ignore those errors, including the "POSSIBLE RESOURCE 
LEAK"?


Is there a better way to programmaticaly create a core if it doesn't 
exist, maybe with just one admin api request (CREATE with a parameter 
such as ignoreIfExists=true)
Or can we change the logging somehow (e.g. to WARN) so that 
administrators don't get alarmed?


Thank you,
Andreas

Example requests, first succeeds, second fails:

[16/Jun/2016:02:48:08 +0200] "GET 
/solr/admin/cores?action=CREATE&name=blueprint_helios_comments&instanceDir=cores%2Fblueprint_helios_comments&dataDir=data&configSet=elastic&wt=javabin&version=2 
HTTP/1.1" 200 73 414ms
[16/Jun/2016:02:48:09 +0200] "GET 
/solr/admin/cores?action=CREATE&name=blueprint_helios_comments&instanceDir=cores%2Fblueprint_helios_comments&dataDir=data&configSet=elastic&wt=javabin&version=2 
HTTP/1.1" 400 314 1325ms


From the log:

2016-06-16 02:48:09.156 [ERROR] org.apache.solr.core.CoreContainer - 
Error creating core [blueprint_helios_comments]: Error opening new searcher

org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.(SolrCore.java:881)
at org.apache.solr.core.SolrCore.(SolrCore.java:654)
at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at 
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at 
org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:683)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
at 
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at 
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)

at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at 
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1574)

at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1686)
at org.apache.solr.core.SolrCore.(SolrCore.java:853)
... 27 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock 
obtain timed out: 
NativeFSLock@/opt/coremedia/cm7-solr-master-tomcat/solr-home/cores/blueprint_helios_comments/data/index/write.lock

at org.apache.lucene.store.Lock.obtain(Lock.java:89)
at org.apache.lucene.index.IndexWriter.(IndexWriter.ja

Re: Boosting exact match fields.

2016-06-16 Thread elisabeth benoit
In addition to what was proposed

We use the technic described here

https://github.com/cominvent/exactmatch

and it works quite well.

Best regards
Elisabeth

2016-06-15 16:32 GMT+02:00 Alessandro Benedetti :

> In addition to what Erick correctly proposed,
> are you storing norms for your field of interest ( to boost documents with
> shorter field values )?
> If you are, I find suspicious "Sony Ear Phones" to win over "Ear Phones"
> for your "Ear Phones" query.
> What are the other factors currently involved in your relevancy score
> calculus ?
>
> Cheers
>
> On Tue, Jun 14, 2016 at 4:48 PM, Erick Erickson 
> wrote:
>
> > If these are the complete field, i.e. your document
> > contains exactly "ear phones" and not "ear phones
> > are great" use a copyField to put it into an "exact_match"
> > field that uses a much simpler analysis chain based
> > on KeywordTokenizer (plus, perhaps things like
> > lowercaseFilter, maybe strip punctuation and the like".
> > Then you add a clause on exact_match boosted
> > really high.
> >
> > Best,
> > Erick
> >
> > On Tue, Jun 14, 2016 at 1:01 AM, Naveen Pajjuri
> >  wrote:
> > > Hi,
> > >
> > > I have documents with a field (data type definition for that field is
> > > below) values as ear phones, sony ear phones, philips ear phones. when
> i
> > > query for earphones sony ear phones is the top result where as i want
> ear
> > > phones as top result. please suggest how to boost exact matches. PS: I
> > have
> > > earphones => ear phones in my synonyms.txt and the datatype definition
> > for
> > > that field keywords is  > > positionIncrementGap="100">   > > "solr.WhitespaceTokenizerFactory"/>  > class="solr.StopFilterFactory"
> > > ignoreCase="true" words="stopwords.txt"/>  > > "solr.LowerCaseFilterFactory"/>  class="solr.SynonymFilterFactory"
> > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>  class=
> > > "solr.RemoveDuplicatesTokenFilterFactory"/>   > > "query">   > class=
> > > "solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> >  > > class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true"
> > > expand="true"/>   > class=
> > > "solr.RemoveDuplicatesTokenFilterFactory"/>  
> > REGARDS,
> > > Naveen
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Strange highlighting on search

2016-06-16 Thread Tom Evans
Hi all

I'm investigating a bug where by every term in the highlighted field
gets marked for highlighting instead of just the words that match the
fulltext portion of the query. This is on Solr 5.5.0, but I didn't see
any bug fixes related to highlighting in 5.5.1 or 6.0 release notes.

The query that affects it is where we have a not clause on a specific
field (not the fulltext field) and also only include documents where
that field has a value:

q: cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *]
AND -ingredient_tag_id:(35223)

This returns the correct results, but the highlighting has matched
every word in the results (see below for debugQuery output). If I
change the query to put the exclusion in to an fq, the highlighting is
correct again (and the results are correct):

q: cosmetics_packaging_fulltext:(Mist)
fq: {!cache=false} ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)

Is there any way I can make the query and highlighting work as
expected as part of q?

Is there any downside to putting the exclusion part in the fq in terms
of performance? We don't use score at all for our results, we always
order by other parameters.

Cheers

Tom

Query with strange highlighting:

{
  "responseHeader":{
"status":0,
"QTime":314,
"params":{
  "q":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
  "hl":"true",
  "hl.simple.post":"",
  "indent":"true",
  "fl":"id,product",
  "hl.fragsize":"0",
  "hl.fl":"product",
  "rows":"5",
  "wt":"json",
  "debugQuery":"true",
  "hl.simple.pre":""}},
  "response":{"numFound":10132,"start":0,"docs":[
  {
"id":"2403841-1498608",
"product":"Mist"},
  {
"id":"2410603-1502577",
"product":"Mist"},
  {
"id":"5988531-3882415",
"product":"Ao + Mist"},
  {
"id":"6020805-3904203",
"product":"UV Mist Cushion SPF 50+ PA+++"},
  {
"id":"2617977-1629335",
"product":"Ultra Radiance Facial Re-Hydrating Mist"}]
  },
  "highlighting":{
"2403841-1498608":{
  "product":["Mist"]},
"2410603-1502577":{
  "product":["Mist"]},
"5988531-3882415":{
  "product":["Ao + Mist"]},
"6020805-3904203":{
  "product":["UV Mist Cushion
SPF 50+ PA+++"]},
"2617977-1629335":{
  "product":["Ultra Radiance Facial
Re-Hydrating Mist"]}},
  "debug":{
"rawquerystring":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
"querystring":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
"parsedquery":"+cosmetics_packaging_fulltext:mist
+ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223",
"parsedquery_toString":"+cosmetics_packaging_fulltext:mist
+ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223",
"explain":{
  "2403841-1498608":"\n40.082462 = sum of:\n  39.92971 =
weight(cosmetics_packaging_fulltext:mist in 13983)
[ClassicSimilarity], result of:\n39.92971 =
score(doc=13983,freq=39.0), product of:\n  0.9882648 =
queryWeight, product of:\n6.469795 = idf(docFreq=22502,
maxDocs=5342472)\n0.15275055 = queryNorm\n  40.40386 =
fieldWeight in 13983, product of:\n6.244998 = tf(freq=39.0),
with freq of:\n  39.0 = termFreq=39.0\n6.469795 =
idf(docFreq=22502, maxDocs=5342472)\n1.0 =
fieldNorm(doc=13983)\n  0.15275055 = ingredient_tag_id:[0 TO *],
product of:\n1.0 = boost\n0.15275055 = queryNorm\n",
  "2410603-1502577":"\n40.082462 = sum of:\n  39.92971 =
weight(cosmetics_packaging_fulltext:mist in 14023)
[ClassicSimilarity], result of:\n39.92971 =
score(doc=14023,freq=39.0), product of:\n  0.9882648 =
queryWeight, product of:\n6.469795 = idf(docFreq=22502,
maxDocs=5342472)\n0.15275055 = queryNorm\n  40.40386 =
fieldWeight in 14023, product of:\n6.244998 = tf(freq=39.0),
with freq of:\n  39.0 = termFreq=39.0\n6.469795 =
idf(docFreq=22502, maxDocs=5342472)\n1.0 =
fieldNorm(doc=14023)\n  0.15275055 = ingredient_tag_id:[0 TO *],
product of:\n1.0 = boost\n0.15275055 = queryNorm\n",
  "5988531-3882415":"\n37.435104 = sum of:\n  37.282352 =
weight(cosmetics_packaging_fulltext:mist in 1062788)
[ClassicSimilarity], result of:\n37.282352 =
score(doc=1062788,freq=34.0), product of:\n  0.9882648 =
queryWeight, product of:\n6.469795 = idf(docFreq=22502,
maxDocs=5342472)\n0.15275055 = queryNorm\n  37.725063 =
fieldWeight in 1062788, product of:\n5.8309517 =
tf(freq=34.0), with freq of:\n  34.0 = termFreq=34.0\n
6.469795 = idf(docFreq=22502, maxDocs=5342472)\n1.0 =
fieldNorm(doc=1062788)\n  0.15275055 = ingredient_tag_id:[0 TO *],
product of:\n1.0 = boost\n0.15275055 = queryNorm\n",
  "6020805-3904203":"\n30.816679 = sum of:\n  30.663929 =
weight(cosmeti

Re: [SolrCloud] shard hash ranges changed after restoring backup

2016-06-16 Thread Gary Yao
Hi Erick,

I should add that our Solr cluster is in production and new documents
are constantly indexed. The new cluster has been up for three weeks now.
The problem was discovered only now because in our use case Atomic
Updates and RealTime Gets are mostly performed on new documents. With
almost absolute certainty there are already documents in the index that
were distributed to the shards according to the new hash ranges. If we
just changed the hash ranges in ZooKeeper, the index would still be in
an inconsistent state.

Is there any way to recover from this without having to re-index all
documents?

Best,
Gary

2016-06-15 19:23 GMT+02:00 Erick Erickson :
> Simplest, though a bit risky is to manually edit the znode and
> correct the znode entry. There are various tools out there, including
> one that ships with Zookeeper (see the ZK documentation).
>
> Or you can use the zkcli scripts (the Zookeeper ones) to get the znode
> down to your local machine, edit it there and then push it back up to ZK.
>
> I'd do all this with my Solr nodes shut down, then insure that my ZK
> ensemble was consistent after the update etc
>
> Best,
> Erick
>
> On Wed, Jun 15, 2016 at 8:36 AM, Gary Yao  wrote:
>> Hi all,
>>
>> My team at work maintains a SolrCloud 5.3.2 cluster with multiple
>> collections configured with sharding and replication.
>>
>> We recently backed up our Solr indexes using the built-in backup
>> functionality. After the cluster was restored from the backup, we
>> noticed that atomic updates of documents are failing occasionally with
>> the error message 'missing required field [...]'. The exceptions are
>> thrown on a host on which the document to be updated is not stored. From
>> this we are deducing that there is a problem with finding the right host
>> by the hash of the uniqueKey. Indeed, our investigations so far showed
>> that for at least one collection in the new cluster, the shards have
>> different hash ranges assigned now. We checked the hash ranges by
>> querying /admin/collections?action=CLUSTERSTATUS. Find below the shard
>> hash ranges of one collection that we debugged.
>>
>>   Old cluster:
>> shard1_0 8000 - aaa9
>> shard1_1  - d554
>> shard2_0 d555 - fffe
>> shard2_1  - 2aa9
>> shard3_0 2aaa - 5554
>> shard3_1  - 7fff
>>
>>   New cluster:
>> shard1 8000 - aaa9
>> shard2  - d554
>> shard3 d555 - 
>> shard4 0 - 2aa9
>> shard5 2aaa - 5554
>> shard6  - 7fff
>>
>>   Note that the shard names differ because the old cluster's shards were
>>   split.
>>
>> As you can see, the ranges of shard3 and shard4 differ from the old
>> cluster. This change of hash ranges matches with the symptoms we are
>> currently experiencing.
>>
>> We found this JIRA ticket https://issues.apache.org/jira/browse/SOLR-5750
>> in which David Smiley comments:
>>
>>   shard hash ranges aren't restored; this error could be disasterous
>>
>> It seems that this is what happened to us. We would like to hear some
>> suggestions on how we could recover from this problem.
>>
>> Best,
>> Gary


Re: solr spellcheck suggest correct word when FileBasedSpellChecker

2016-06-16 Thread Alessandro Benedetti
Taking a look into the code :
The spellcheck.alternativeTermCount Parameter

Specify the number of suggestions to return for each query term existing in
the index and/or dictionary. Presumably, users will want fewer suggestions
for words with docFrequency>0. Also setting this value turns "on"
context-sensitive spell suggestions.


This parameter , if set to 0, should basically avoid suggestions if you get
a correctly spelled term.

The default value is 0 for that.

So it is weird you get the suggestions, you could debug that bit and see,
unfortunately I don't have time right now !


Cheers


On Sat, Jun 11, 2016 at 1:11 AM, khawar yunus  wrote:

> I am in the same boat as you.  did you figure out why it does that?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr-spellcheck-suggest-correct-word-when-FileBasedSpellChecker-tp4138769p4281821.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: ConcurrentMergeScheduler options not exposed

2016-06-16 Thread Michael McCandless
Hmm, merging can't read at 800 MB/sec and only write at 20 MB/sec for very
long ... unless there is a huge percentage of deletes.

Also, by default CMS doesn't throttle forced merges (see
CMS.get/setForceMergeMBPerSec).

Maybe capture IndexWriter.setInfoStream output?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jun 15, 2016 at 9:12 PM, Shawn Heisey  wrote:

> On the IRC channel, I ran into somebody who was having problems with
> optimizes on their Solr indexes taking a really long time.  When
> investigating, they found that during the optimize, *reads* were
> happening on their SSD disk at over 800MB/s, but *writes* were
> proceeding at only 20 MB/s.
>
> Looking into ConcurrentMergeScheduler, I discovered that it does indeed
> have a default write throttle of only 20 MB/s.  I saw code that would
> sometimes set the speed to unlimited, but had a hard time figuring out
> what circumstances will result in the different settings, so based on
> the user experience, I assume that the 20MB/s throttle must be applied
> for Solr optimizes.
>
> From what I can see in the code, there's currently no way in
> solrconfig.xml to configure scheduler options like the maximum write
> speed.  Before I an open an issue to add additional configuration
> options for the merge scheduler, I thought it might be a good idea to
> just double-check with everyone here to see whether there's something I
> missed.
>
> This is likely even affecting people who are not using SSD storage.
> Most modern magnetic disks can easily exceed 20MB/s on both reads and
> writes.  Some RAID arrays can write REALLY fast.
>
> Thanks,
> Shawn
>
>


Re: Solr 6.1.x Release Date ??

2016-06-16 Thread Ramesh shankar
Hi,

Yes, i used the solr-6.1.0-79 nightly builds and [subquery] transformer is
working fine in, any idea of the expected release date for 6.1 ?

Regards
Ramesh



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-6-1-x-Release-Date-tp4280945p4282562.html
Sent from the Solr - User mailing list archive at Nabble.com.


SOLR war for SOLR 6

2016-06-16 Thread Bharath Kumar
Hi,

I was trying to generate a solr war out of the solr 6 source, but even
after i create the war, i was not able to get it deployed correctly on
jboss.

Wanted to know if anyone was able to successfully generate solr war and
deploy it on tomcat or jboss? Really appreciate your help on this.

-- 
Thanks & Regards,
Bharath MV Kumar

"Life is short, enjoy every moment of it"


Re: Limit Solr to search for only 500 records based on the search criteria

2016-06-16 Thread Thrinadh Kuppili
Thanks Eric,
Can you please elaborate "I'd also suggest that your response times are
tunable" ??

I have set the start to 0 and rows to 1000 initially and entered a search
field as below

CompanyName : private limited

when i clicked on search button in the UI it is not responsive at all.

When i remove the start and rows values it is good and resulting in 65k
records out of which 10 are displayed in the solr UI.







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Limit-Solr-to-search-for-only-500-records-based-on-the-search-criteria-tp4282519p4282559.html
Sent from the Solr - User mailing list archive at Nabble.com.