date:20160825

RE: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

2016-08-25 Thread Jon Hawkesworth

Thanks for your suggestion.  Here's a chunk of info from the logging in the 
solr admin page below.  Is there somewhere else I should be looking too?

It looks to me like its stuck in  a never-ending loop of attempting recovery 
that fails.

I don't know if the warnings from IndexFetcher are relevant or not, and if they 
are, what I can do about them?

Our system has been feeding 150k docs a day into this cluster for nearly two 
months now.  I have a backlog of approx 45million more documents I need to get 
loaded but until I have a healthy looking cluster it would be foolish to start 
loading even more.

Jon


Time (Local)

Level

Core

Logger

Message

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.nvm did not match. expected checksum is 1754812894 and actual is 
checksum 3450541029. expected length is 108 and actual length is 108

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.fnm did not match. expected checksum is 2714900770 and actual is 
checksum 1393668596. expected length is 1265 and actual length is 1265

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene50_0.doc did not match. expected checksum is 1374818988 and 
actual is checksum 1039421217. expected length is 110 and actual length is 433

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene50_0.tim did not match. expected checksum is 1001343351 and 
actual is checksum 3395571641. expected length is 2025 and actual length is 7662

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene50_0.tip did not match. expected checksum is 814607015 and 
actual is checksum 1271109784. expected length is 301 and actual length is 421

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene54_0.dvd did not match. expected checksum is 875968405 and 
actual is checksum 4024097898. expected length is 96 and actual length is 144

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.si did not match. expected checksum is 2341973651 and actual is 
checksum 281320882. expected length is 535 and actual length is 535

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.fdx did not match. expected checksum is 2874533507 and actual is 
checksum 3545673052. expected length is 84 and actual length is 84

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.nvd did not match. expected checksum is 663721296 and actual is 
checksum 1107475498. expected length is 59 and actual length is 68

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.fdt did not match. expected checksum is 2953417110 and actual is 
checksum 471758721. expected length is 1109 and actual length is 7185

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File segments_h7g8 did not match. expected checksum is 2040860271 and actual is 
checksum 187396. expected length is 2056 and actual length is 1926

8/26/2016, 6:17:53 AM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica3\data\tlog\tlog.321
 refcount=2} active=true starting pos=0

8/26/2016, 6:17:53 AM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=12 deletes=0 
deleteByQuery=0 errors=0 positionOfStart=0}

8/26/2016, 6:17:53 AM

ERROR false

RecoveryStrategy

Could not publish as ACTIVE after succesful recovery

8/26/2016, 6:17:53 AM

ERROR false

RecoveryStrategy

Recovery failed - trying again... (0)

8/26/2016, 6:18:13 AM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica3\data\tlog\tlog.322
 refcount=2} active=true starting pos=0

8/26/2016, 6:18:13 AM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=1 deletes=0 deleteByQuery=0 
errors=0 positionOfStart=0}

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.fdt did not match. expected checksum is 4059848174 and actual is 
checksum 4234063128. expected length is 3060 and actual length is 1772

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.fdx did not match. expected checksum is 2421590578 and actual is 
checksum 1492609115. expected length is 84 and actual length is 84

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene54_0.dvd did not match. expected checksum is 2898024557 and 
actual is checksum 3762900089. expected length is 99 and actual length is 97

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.si did not match. expected checksum is 730964774 and actual is 
checksum 1292368805. expected length is 535 and actual length is 535

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.nvd did not match. expected checksum is 2920743481 and actual is 
checksum 2869652522. expected length is 59 and actual length is 59

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.nvm did not match. expected checksum is 328126313 and actual is 
checksum 1484623710. expected length is 108 and actual length is 108

8/26/2016, 6:22:12 AM

WARN false

High load, frequent updates, low latency requirement use case

2016-08-25 Thread Brent P

I'm trying to set up a Solr Cloud cluster to support a system with the
following characteristics:

It will be writing documents at a rate of approximately 500 docs/second,
and running search queries at about the same rate.
The documents are fairly small, with about 10 fields, most of which range
in size from a simple int to a string that holds a UUID. There's a date
field, and then three text fields that typically hold in the range of 350
to 500 chars.
Documents should be available for searching within 30 seconds of being
added.
We need an average search latency of 50 ms or faster.

We've been using DataStax Enterprise with decent results, but trying to
determine if we can get more out of the latest version of Solr Cloud, as we
originally chose DSE ~4 years ago *I believe* because its Cassandra-backed
Solr provided redundancy/high availability features that weren't currently
available with straight Solr (not even sure if Solr Cloud was available
then).

We have 24 fairly beefy servers (96 CPU cores, 256 GB RAM, SSDs) for the
task, and I'm trying to figure out the best way to distribute the documents
into collections, cores, and shards.

If I can categorize a document into one of 8 "types", should I create 8
collections? Is that going to provide better performance than putting them
all into one collection and then using a filter query with the type field
when doing a search?

What are the options/things to consider when deciding on the number of
shards for each collection? As far as I know, I don't choose the number of
Solr cores, that is just determined base on the replication factor (and
shard count?).

Some of the settings I'm using in my solrconfig that seem important:
${solr.lock.type:native}

  ${solr.autoCommit.maxTime:3}
  false


  ${solr.autoSoftCommit.maxTime:1000}

true
8

I've got the updateLog/transaction log enabled, as I think I read it's
required for Solr Cloud.

Are there any settings I should look at that affect performance
significantly, especially outside of the solrconfig.xml for each collection
(like jetty configs, logging properties, etc)?

How much impact do the  directives in the solrconfig have on
performance? Do they only take effect if I have something configured that
requires them, and therefore if I'm missing one that I need, I'd get an
error if it's not defined?

Any help will be greatly appreciated. Thanks!
-Brent

Re: Inventor-template vs Inventor template - issue with hyphen

2016-08-25 Thread shamik

Thanks Erick. I did look into the analyser tool and debug query and posted
the results in my post. WDF is correctly stripping off the "-" from
Inventor-template, both terms are getting broken down to "inventor templat".
But not sure why the query construct is different during query time. Here's
parsed query:

*Inventor-template*


(+DisjunctionMaxQuery(((+CommandSrch:inventor +CommandSrch:templat) |
text:"inventor templat"^1.5 | Description:"inventor templat"^2.0 |
title:"inventor templat"^3.5 | keywords:"inventor templat"^1.2)~0.01)
Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(147216960),date(PublishDate)))+1.0)))/no_coord



+((+CommandSrch:inventor +CommandSrch:templat) | text:"inventor templat"^1.5
| Description:"inventor templat"^2.0 | title:"inventor templat"^3.5 |
keywords:"inventor templat"^1.2)~0.01 Source2:sfdcarticles^9.0
Source2:downloads^5.0 
1.0/(3.16E-11*float(ms(const(147216960),date(PublishDate)))+1.0)


*Inventor template*


(+(+DisjunctionMaxQuery((CommandSrch:inventor | text:inventor^1.5 |
Description:inventor^2.0 | title:inventor^3.5 | keywords:inventor^1.2)~0.01)
+DisjunctionMaxQuery((CommandSrch:templat | text:templat^1.5 |
Description:templat^2.0 | title:templat^3.5 | keywords:templat^1.2)~0.01))
Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(147216960),date(PublishDate)))+1.0)))/no_coord



+(+(CommandSrch:inventor | text:inventor^1.5 | Description:inventor^2.0 |
title:inventor^3.5 | keywords:inventor^1.2)~0.01 +(CommandSrch:templat |
text:templat^1.5 | Description:templat^2.0 | title:templat^3.5 |
keywords:templat^1.2)~0.01) Source2:sfdcarticles^9.0 Source2:downloads^5.0 
1.0/(3.16E-11*float(ms(const(147216960),date(PublishDate)))+1.0)


The part I'm confused is why the two queries are being interpreted
differently ?

Thanks,
Shamik



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Inventor-template-vs-Inventor-template-issue-with-hyphen-tp4293357p4293380.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Suggester no results

2016-08-25 Thread Scott Vanderbilt

Bradley:

You're a bloody genius!

That's exactly what I needed to make it work.

For the sake of the archives, after modifying the solrconfig.xml as 
indicated and rebuilding the suggester dictionary, the queries started 
to kick back results like crazy.

For what it's worth, I'm running Solr 6.1.0.

Thanks so much.

On 8/25/2016 6:48 PM, Bradley Belyeu wrote:

Scott, I’m fairly new to suggesters having just recently built my first one. 
But where my configuration differs from yours is on this line:

string

I used the field type name that I had defined instead like:

 textSuggest 

I’m not certain that would help, but I can’t see where your config is much 
different form mine elsewhere.
What version of Solr are you running?

On 8/25/16, 6:41 PM, "Scott Vanderbilt"  wrote:

I'm having the exact same problem the O.P. describes from his email back
in May, but my configuration dose not have the same defect his had. So I
am at a loss to understand why my suggest queries are returning no results.

Here is my config:

Relevant bits from schema.xml:
--

...

...

...

Relevant bits from solrconfig.xml:
--

 mySuggester
 true
 10

 suggest

...

 mySuggester
 AnalyzingInfixLookupFactory
 suggester_infix_dir
 DocumentDictionaryFactory
 text_suggest
 string
 false
 false

I build the dictionary like this:

   http://example.com:8983/solr/rrib/suggest?suggest.build=true

and get back this response:

  0
  964

   build

I then attempt a query like this:

http://example.com:8983/solr/rrib/suggest?q=re

and get back this response:

  0
  37

0

In Solr's Admin interface, I see the following under OTHER->suggest on
Plugins/Stats page:

class:suggest
description:Suggester component
src:
version:6.1.0
stats:
mySuggester:   SolrSuggester [ name=mySuggester,
sourceLocation=null, storeDir=,
lookupImpl=AnalyzingInfixLookupFactory,
dictionaryImpl=DocumentDictionaryFactory, sizeInBytes=6795 ]
totalSizeInBytes:   6795

The value of 6,795 bytes seems pretty small to me for a repository of
403 XML files containing about 1.5 MB of mark-up. Perhaps that is a clue
that the dictionary has not been fully populated, which probably
explains the empty result sets, but I cannot figure out why

Any assistance would be gratefully received.

Thank you.

On 5/6/2016 9:42 AM, Erick Erickson wrote:
> First off, kudos for providing the details, that really helps!
>
> The root of your problem is that your suggest field has stored="false".
> DocumentDictionaryFactory reads through all the
> docs in your corpus, extracts the stored data and puts it in the FST. 
Since
> you don't have any stored data your FST is...er...minimal.
>
> I'd also add
> suggester_fuzzy_dir
> to the searchComponent. You'll find the FST on disk in that directory 
where it
> can be read next time Solr starts up. It is also helpful for figuring out
> whether there are suggestions to be had.
>
> And a minor nit, you probably don't want to specify suggest.dictionary
> in your query,
> that's already specified in your config.
>
> And it looks like you're alive to the fact that with that setup
> capitalization matters
> as does the fact that these suggestions be matched from the beginning of 
the
> field...
>
> Best,
> Erick
>
> On Thu, May 5, 2016 at 1:05 AM, Grigoris Iliopoulos
>  wrote:
>> Hi there,
>>
>> I want to use the Solr suggester component for city names. I have the
>> following settings:
>> schema.xml
>>
>> Field definition
>>
>> 
>>   
>> 
>> 
>> 
>>   
>> 
>>
>> The field i want to apply the suggester on
>>
>> 
>>
>> The copy field
>>
>> 
>>
>> The field
>>
>> 
>>
>> solr-config.xml
>>
>> 
>>   
>> true
>> 10
>> mySuggester
>>   
>>   
>> suggest
>>   
>> 
>>
>>
>>
>> 
>>   
>> mySuggester
>> FuzzyLookupFactory
>> DocumentDictionaryFactory
>> citySuggest
>> string
>>   
>>

Re: Solr Suggester no results

2016-08-25 Thread Bradley Belyeu

Scott, I’m fairly new to suggesters having just recently built my first one. 
But where my configuration differs from yours is on this line:

string

I used the field type name that I had defined instead like:

 textSuggest 

I’m not certain that would help, but I can’t see where your config is much 
different form mine elsewhere.
What version of Solr are you running?


On 8/25/16, 6:41 PM, "Scott Vanderbilt"  wrote:

I'm having the exact same problem the O.P. describes from his email back 
in May, but my configuration dose not have the same defect his had. So I 
am at a loss to understand why my suggest queries are returning no results.

Here is my config:

Relevant bits from schema.xml:
--


...

...


...


   
   
   



Relevant bits from solrconfig.xml:
--

   
 mySuggester
 true
 10
   
   
 suggest
   

...

   
 mySuggester
 AnalyzingInfixLookupFactory
 suggester_infix_dir
 DocumentDictionaryFactory
 text_suggest
 string
 false
 false
   


I build the dictionary like this:

   http://example.com:8983/solr/rrib/suggest?suggest.build=true

and get back this response:


   
  0
  964
   
   build


I then attempt a query like this:

http://example.com:8983/solr/rrib/suggest?q=re

and get back this response:


   
  0
  37
   
   
  
 
0

 
  
   


In Solr's Admin interface, I see the following under OTHER->suggest on 
Plugins/Stats page:

class:suggest
description:Suggester component
src:
version:6.1.0
stats:
mySuggester:   SolrSuggester [ name=mySuggester,
sourceLocation=null, storeDir=,
lookupImpl=AnalyzingInfixLookupFactory,
dictionaryImpl=DocumentDictionaryFactory, sizeInBytes=6795 ]
totalSizeInBytes:   6795

The value of 6,795 bytes seems pretty small to me for a repository of 
403 XML files containing about 1.5 MB of mark-up. Perhaps that is a clue 
that the dictionary has not been fully populated, which probably 
explains the empty result sets, but I cannot figure out why

Any assistance would be gratefully received.

Thank you.

On 5/6/2016 9:42 AM, Erick Erickson wrote:
> First off, kudos for providing the details, that really helps!
>
> The root of your problem is that your suggest field has stored="false".
> DocumentDictionaryFactory reads through all the
> docs in your corpus, extracts the stored data and puts it in the FST. 
Since
> you don't have any stored data your FST is...er...minimal.
>
> I'd also add
> suggester_fuzzy_dir
> to the searchComponent. You'll find the FST on disk in that directory 
where it
> can be read next time Solr starts up. It is also helpful for figuring out
> whether there are suggestions to be had.
>
> And a minor nit, you probably don't want to specify suggest.dictionary
> in your query,
> that's already specified in your config.
>
> And it looks like you're alive to the fact that with that setup
> capitalization matters
> as does the fact that these suggestions be matched from the beginning of 
the
> field...
>
> Best,
> Erick
>
> On Thu, May 5, 2016 at 1:05 AM, Grigoris Iliopoulos
>  wrote:
>> Hi there,
>>
>> I want to use the Solr suggester component for city names. I have the
>> following settings:
>> schema.xml
>>
>> Field definition
>>
>> 
>>   
>> 
>> 
>> 
>>   
>> 
>>
>> The field i want to apply the suggester on
>>
>> 
>>
>> The copy field
>>
>> 
>>
>> The field
>>
>> 
>>
>> solr-config.xml
>>
>> 
>>   
>> true
>> 10
>> mySuggester
>>   
>>   
>> suggest
>>   
>> 
>>
>>
>>
>> 
>>   
>> mySuggester
>> FuzzyLookupFactory
>> DocumentDictionaryFactory
>> citySuggest
>> string
>>   
>> 
>>
>> Then i run
>>
>> 
http://localhost:8983/solr/company/suggest?suggest=true=mySuggester=json=Ath=true
>>
>> to build the suggest component
>>
>> Finally i run
>>
>>
http://localhost:8983/solr/company/suggest?suggest=true=mySuggester=json=Ath
>>

Re: Inventor-template vs Inventor template - issue with hyphen

2016-08-25 Thread Erick Erickson

Look at your admin/analysis page. Worddelimitetfilterfactory breaks on non
alpha-num. Also, adding =query will show you the parsed form of the
query and that'll help

On Aug 25, 2016 4:41 PM, "Shamik Bandopadhyay"  wrote:

Hi,

  I'm trying to figure out search behaviour related to similar terms, one
with and without the hyphen. Both of them are generating a different result
set , the search without the hyphen is bringing back more result compared
to the other. Here's the fieldtype definition :






















If I run the search term through the analyzer, the final indexed data for
both term (hyphen and without) results in  --> *inventor templat*

I was under the impression that based on my analyzers, both search term
will produce same result.

Here's the output from debug and splainer.

*Inventor-template*
*-*

(+DisjunctionMaxQuery(((+CommandSrch:inventor
+CommandSrch:templat) | text:"inventor templat"^1.5 | Description:"inventor
templat"^2.0 | title:"inventor templat"^3.5 | keywords:"inventor
templat"^1.2)~0.01) Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(147208320),
date(PublishDate)))+1.0)))/no_coord

+((+CommandSrch:inventor
+CommandSrch:templat) | text:"inventor templat"^1.5 | Description:"inventor
templat"^2.0 | title:"inventor templat"^3.5 | keywords:"inventor
templat"^1.2)~0.01
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)

>From Splainer:

10.974786 Sum of the following:
 9.203462 Dismax (max plus:0.01 times others)
   9.198681 title:"inventor templat"

   0.4781131 text:"inventor templat"

 1.7644342 Source2:sfdcarticles

 0.006889837 1.0/(3.16E-11*float(ms(const(147208320),date(
PublishDate)))+1.0)


*Inventor template*
*--*

(+(+DisjunctionMaxQuery((CommandSrch:inventor |
text:inventor^1.5 | Description:inventor^2.0 | title:inventor^3.5 |
keywords:inventor^1.2)~0.01) +DisjunctionMaxQuery((CommandSrch:templat |
text:templat^1.5 | Description:templat^2.0 | title:templat^3.5 |
keywords:templat^1.2)~0.01)) Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(147208320),
date(PublishDate)))+1.0)))/no_coord

+(+(CommandSrch:inventor |
text:inventor^1.5 | Description:inventor^2.0 | title:inventor^3.5 |
keywords:inventor^1.2)~0.01 +(CommandSrch:templat | text:templat^1.5 |
Description:templat^2.0 | title:templat^3.5 | keywords:templat^1.2)~0.01)
Source2:sfdcarticles^9.0 Source2:downloads^5.0
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)

>From splainer :

9.915069 Sum of the following:
 5.03947 Dismax (max plus:0.01 times others)
   5.038846 title:templat

   0.062400598 text:templat

 4.767776 Dismax (max plus:0.01 times others)
   4.7674117 title:inventor

   0.03642158 text:inventor

 0.098686054 Source2:CloudHelp

 0.009136423
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)


I'm using edismax.


Just wondering what I'm missing here. Any help will be appreciated.

Regards,
Shamik

Re: solr.NRTCachingDirectoryFactory

2016-08-25 Thread Rallavagu


Follow up update ...

Set autowarm count to zero for caches for NRT and I could negotiate 
latency from 2 min to 5 min :)


However, still seeing high QTimes and wondering where else can I look? 
Should I debug the code or run some tools to isolate bottlenecks (disk 
I/O, CPU or Query itself). Looking for some tuning advice. Thanks.



On 7/26/16 9:42 AM, Erick Erickson wrote:

And, I might add, you should look through your old logs
and see how long it takes to open a searcher. Let's
say Shawn's lower bound is what you see, i.e.
it takes a minute each to execute all the autowarming
in filterCache and queryResultCache... So you're current
latency is _at least_ 2 minutes between the time something
is indexed and it's available for search just for autowarming.

Plus up to another 2 minutes for your soft commit interval
to expire.

So if your business people haven't noticed a 4 minute
latency yet, tell them they don't know what they're talking
about when they insist on the NRT interval being a few
seconds ;).

Best,
Erick

On Tue, Jul 26, 2016 at 7:20 AM, Rallavagu  wrote:



On 7/26/16 5:46 AM, Shawn Heisey wrote:


On 7/22/2016 10:15 AM, Rallavagu wrote:











As Erick indicated, these settings are incompatible with Near Real Time
updates.

With those settings, every time you commit and create a new searcher,
Solr will execute up to 1000 queries (potentially 500 for each of the
caches above) before that new searcher will begin returning new results.

I do not know how fast your filter queries execute when they aren't
cached... but even if they only take 100 milliseconds each, that's could
take up to a minute for filterCache warming.  If each one takes two
seconds and there are 500 entries in the cache, then autowarming the
filterCache would take nearly 17 minutes. You would also need to wait
for the warming queries on queryResultCache.

The autowarmCount on my filterCache is 4, and warming that cache *still*
sometimes takes ten or more seconds to complete.

If you want true NRT, you need to set all your autowarmCount values to
zero.  The tradeoff with NRT is that your caches are ineffective
immediately after a new searcher is created.


Will look into this and make changes as suggested.



Looking at the "top" screenshot ... you have plenty of memory to cache
the entire index.  Unless your queries are extreme, this is usually
enough for good performance.

One possible problem is that cache warming is taking far longer than
your autoSoftCommit interval, and the server is constantly busy making
thousands of warming queries.  Reducing autowarmCount, possibly to zero,
*might* fix that. I would expect higher CPU load than what your
screenshot shows if this were happening, but it still might be the
problem.


Great point. Thanks for the help.



Thanks,
Shawn

I'm having the exact same problem the O.P. describes from his email back
in May, but my configuration dose not have the same defect his had. So I
am at a loss to understand why my suggest queries are returning no results.

Here is my config:

Relevant bits from schema.xml:
--
multiValued="true"/>
stored="true" multiValued="true"/>

...
stored="true" multiValued="true" />

...

...
positionIncrementGap="100">

Relevant bits from solrconfig.xml:
--

mySuggester
true
10

suggest

...

mySuggester
AnalyzingInfixLookupFactory
suggester_infix_dir
DocumentDictionaryFactory
text_suggest
string
false
false

I build the dictionary like this:

http://example.com:8983/solr/rrib/suggest?suggest.build=true

and get back this response:

0
964

build

I then attempt a query like this:

http://example.com:8983/solr/rrib/suggest?q=re

and get back this response:

0
37

In Solr's Admin interface, I see the following under OTHER->suggest on
Plugins/Stats page:

class:suggest
description:Suggester component
src:
version:6.1.0
stats:
mySuggester: SolrSuggester [ name=mySuggester,
sourceLocation=null, storeDir=,
lookupImpl=AnalyzingInfixLookupFactory,
dictionaryImpl=DocumentDictionaryFactory, sizeInBytes=6795 ]
totalSizeInBytes: 6795

The value of 6,795 bytes seems pretty small to me for a repository of
403 XML files containing about 1.5 MB of mark-up. Perhaps that is a clue
that the dictionary has not been fully populated, which probably
explains the empty result sets, but I cannot figure out why

Any assistance would be gratefully received.

Thank you.

On 5/6/2016 9:42 AM, Erick Erickson wrote:

First off, kudos for providing the details, that really helps!

The root of your problem is that your suggest field has stored="false".
DocumentDictionaryFactory reads through all the
docs in your corpus, extracts the stored data and puts it in the FST. Since
you don't have any stored data your FST is...er...minimal.

I'd also add
suggester_fuzzy_dir
to the searchComponent. You'll find the FST on disk in that directory where it
can be read next time Solr starts up. It is also helpful for figuring out
whether there are suggestions to be had.

And a minor nit, you probably don't want to specify suggest.dictionary
in your query,
that's already specified in your config.

And it looks like you're alive to the fact that with that setup
capitalization matters
as does the fact that these suggestions be matched from the beginning of the
field...

Best,
Erick

On Thu, May 5, 2016 at 1:05 AM, Grigoris Iliopoulos
wrote:

Hi there,

I want to use the Solr suggester component for city names. I have the
following settings:
schema.xml

Field definition

The field i want to apply the suggester on

The copy field

The field

solr-config.xml

true
10
mySuggester

suggest

mySuggester
FuzzyLookupFactory
DocumentDictionaryFactory
citySuggest
string

Then i run

http://localhost:8983/solr/company/suggest?suggest=true=mySuggester=json=Ath=true

to build the suggest component

Finally i run

http://localhost:8983/solr/company/suggest?suggest=true=mySuggester=json=Ath

but i get an empty result set

{"responseHeader":{"status":0,"QTime":0},"suggest":{"mySuggester":{"Ath":{"numFound":0,"suggestions":[]

Are there any obvious mistakes? Any thoughts?

Inventor-template vs Inventor template - issue with hyphen

2016-08-25 Thread Shamik Bandopadhyay

Hi,

  I'm trying to figure out search behaviour related to similar terms, one
with and without the hyphen. Both of them are generating a different result
set , the search without the hyphen is bringing back more result compared
to the other. Here's the fieldtype definition :






















If I run the search term through the analyzer, the final indexed data for
both term (hyphen and without) results in  --> *inventor templat*

I was under the impression that based on my analyzers, both search term
will produce same result.

Here's the output from debug and splainer.

*Inventor-template*
*-*

(+DisjunctionMaxQuery(((+CommandSrch:inventor
+CommandSrch:templat) | text:"inventor templat"^1.5 | Description:"inventor
templat"^2.0 | title:"inventor templat"^3.5 | keywords:"inventor
templat"^1.2)~0.01) Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)))/no_coord

+((+CommandSrch:inventor
+CommandSrch:templat) | text:"inventor templat"^1.5 | Description:"inventor
templat"^2.0 | title:"inventor templat"^3.5 | keywords:"inventor
templat"^1.2)~0.01
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)

>From Splainer:

10.974786 Sum of the following:
 9.203462 Dismax (max plus:0.01 times others)
   9.198681 title:"inventor templat"

   0.4781131 text:"inventor templat"

 1.7644342 Source2:sfdcarticles

 0.006889837 
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)


*Inventor template*
*--*

(+(+DisjunctionMaxQuery((CommandSrch:inventor |
text:inventor^1.5 | Description:inventor^2.0 | title:inventor^3.5 |
keywords:inventor^1.2)~0.01) +DisjunctionMaxQuery((CommandSrch:templat |
text:templat^1.5 | Description:templat^2.0 | title:templat^3.5 |
keywords:templat^1.2)~0.01)) Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)))/no_coord

+(+(CommandSrch:inventor |
text:inventor^1.5 | Description:inventor^2.0 | title:inventor^3.5 |
keywords:inventor^1.2)~0.01 +(CommandSrch:templat | text:templat^1.5 |
Description:templat^2.0 | title:templat^3.5 | keywords:templat^1.2)~0.01)
Source2:sfdcarticles^9.0 Source2:downloads^5.0
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)

>From splainer :

9.915069 Sum of the following:
 5.03947 Dismax (max plus:0.01 times others)
   5.038846 title:templat

   0.062400598 text:templat

 4.767776 Dismax (max plus:0.01 times others)
   4.7674117 title:inventor

   0.03642158 text:inventor

 0.098686054 Source2:CloudHelp

 0.009136423
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)


I'm using edismax.


Just wondering what I'm missing here. Any help will be appreciated.

Regards,
Shamik

Default stopword list

2016-08-25 Thread Steven White

Hi everyone,

I'm curious, the current "default" stopword list, for English and other
languages, how was it determined?  And for English, why "I" is not in the
stopword list?

Thanks in advanced.

Steve

Re: Question about indexing PDFs

2016-08-25 Thread Erick Erickson

That is always a dangerous assumption. Are you sure
you're searching on the proper field? Are you sure it's indexed? Are
you sure it's

The schema browser I indicated above will give you some
idea what's actually in the field. You can not only see the
fields Solr (actually Lucene) see in your index, but you can
also see what some of the terms are.

Adding =query and looking at the parsed query
will show you what fields are being searched against. The
most common causes of what you're describing are:

> not searching against the field you think you are. This
is very easy to do without knowing it.

> not actually having 'indexed="true" set in your schema

> not committing after inserting the doc

Best,
Erick

On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh <
betsey.ben...@stresearch.com> wrote:

> It looks like the metadata of the PDFs was indexed, but not the content
> (which is what I was interested in).  Searches on terms I know exist in
> the content come up empty.
>
> On 8/25/16, 2:16 PM, "Betsey Benagh"  wrote:
>
> >Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.
> >
> >
> >On 8/25/16, 1:56 PM, "Erick Erickson"  wrote:
> >
> >>when you say "I don't see it in the schema for that collection" are you
> >>talking schema.xml? managed_schema? Or actual documents in the index?
> >>Often
> >>these are defined by dynamic fields and the like in the schema files.
> >>
> >>Take a look at the admin UI>>schema browser>>drop down and you'll see all
> >>the actual fields in your index...
> >>
> >>Best,
> >>Erick
> >>
> >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
> >> >>> wrote:
> >>
> >>> Following the instructions in the quick start guide, I imported a bunch
> >>>of
> >>> PDF documents into my Solr 6.0 instance.  As far as I can tell from the
> >>> documentation, there should be a 'content' field indexing, well, the
> >>> content, but I don't see it in the schema for that collection.  Is
> >>>there
> >>> something obvious I might have missed?
> >>>
> >>> Thanks!
> >>>
> >>>
> >
>
>

Re: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

2016-08-25 Thread Erick Erickson

This is odd. The ADDREPLICA _should_ be immediately listed as "down", but
should shortly go to
"recovering"and then "active". The transition to "active" may take a while
as the index has to be
copied from the leader, but you shouldn't be stuck at "down" for very long.

Take a look at the Solr logs for both the leader of the shard and the
replica you're trying to add. They
often have more complete and helpful error messages...

Also note that you occasionally have to be patient. For instance, there's a
3 minute wait period for
leader election at times. It sounds, though, like things aren't getting
better for far longer than 3 minutes.

Best,
Erick

On Thu, Aug 25, 2016 at 2:00 PM, Jon Hawkesworth <
jon.hawkeswo...@medquist.onmicrosoft.com> wrote:

> Anyone got any suggestions how I can fix up my solrcloud 6.0.1 replica
> remains down issue?
>
>
>
> Today we stopped all the loading and querying, brought down all 4 solr
> nodes, went into zookeeper and deleted everything under /collections/
> transcribedReports/leader_initiated_recovery/shard1/ and brought the
> cluster back up (this seeming to be a reasonably similar situation to
> https://issues.apache.org/jira/browse/SOLR-7021 where this workaround is
> described, albeit for an older version of solr.
>
>
>
> After a while things looked ok but when we attempted to move the second
> replica back to the original node (by creating a third and then deleting
> the temp one which wasn't on the node we wanted it on), we immediately got
> a 'down' status on the node (and its stayed that way ever since), with ' Could
> not publish as ACTIVE after succesful recovery ' messages appearing in
> the logs
>
>
>
> Its as if there is something specifically wrong with that node that stops
> us from ever having a functioning replica of shard1 on it.
>
>
>
> weird thing is shard2 on the same (problematic) node seems fine.
>
>
>
> Other stuff we have tried includes
>
>
>
> issuing a REQUESTRECOVERY
>
> moving from 2 to 4 nodes
>
> adding more replicas on other nodes (new replicas immediately go into down
> state and stay that way).
>
>
>
> System is solrcloud 6.0.1 running on 4 nodes.  There's 1 collection with 4
> shards and and I'm trying to have 2 replicas on each of the 4 nodes.
>
> Currently each shard is managing approx 1.2 million docs (mostly just text
> 10-20k in size each usually).
>
>
>
> Any suggestions would be gratefully appreciated.
>
>
>
> Many thanks,
>
>
>
> Jon
>
>
>
>
>
> *Jon Hawkesworth*
> Software Developer
>
>
>
>
>
> Hanley Road, Malvern, WR13 6NP. UK
>
> O: +44 (0) 1684 312313
>
> *jon.hawkeswo...@mmodal.com  www.mmodal.com
> *
>
>
>
> *This electronic mail transmission contains confidential information
> intended only for the person(s) named. Any use, distribution, copying or
> disclosure by another person is strictly prohibited. If you are not the
> intended recipient of this e-mail, promptly delete it and all attachments.*
>
>
>

solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

2016-08-25 Thread Jon Hawkesworth

Anyone got any suggestions how I can fix up my solrcloud 6.0.1 replica remains 
down issue?

Today we stopped all the loading and querying, brought down all 4 solr nodes, 
went into zookeeper and deleted everything under 
/collections/transcribedReports/leader_initiated_recovery/shard1/ and brought 
the cluster back up (this seeming to be a reasonably similar situation to 
https://issues.apache.org/jira/browse/SOLR-7021 where this workaround is 
described, albeit for an older version of solr.

After a while things looked ok but when we attempted to move the second replica 
back to the original node (by creating a third and then deleting the temp one 
which wasn't on the node we wanted it on), we immediately got a 'down' status 
on the node (and its stayed that way ever since), with ' Could not publish as 
ACTIVE after succesful recovery ' messages appearing in the logs

Its as if there is something specifically wrong with that node that stops us 
from ever having a functioning replica of shard1 on it.

weird thing is shard2 on the same (problematic) node seems fine.

Other stuff we have tried includes

issuing a REQUESTRECOVERY
moving from 2 to 4 nodes
adding more replicas on other nodes (new replicas immediately go into down 
state and stay that way).

System is solrcloud 6.0.1 running on 4 nodes.  There's 1 collection with 4 
shards and and I'm trying to have 2 replicas on each of the 4 nodes.
Currently each shard is managing approx 1.2 million docs (mostly just text 
10-20k in size each usually).

Any suggestions would be gratefully appreciated.

Many thanks,

Jon


Jon Hawkesworth
Software Developer


[cid:image002.png@01D1FF1C.25E8DC80]

Hanley Road, Malvern, WR13 6NP. UK
O: +44 (0) 1684 312313
jon.hawkeswo...@mmodal.com
www.mmodal.com

This electronic mail transmission contains confidential information intended 
only for the person(s) named. Any use, distribution, copying or disclosure by 
another person is strictly prohibited. If you are not the intended recipient of 
this e-mail, promptly delete it and all attachments.

changing the /solr path, additional steps needed for 6.1

2016-08-25 Thread Chris Morley

This might help some people:
  
 To change the URL to server:port/ourspecialpath from server:port/solr is a 
bit inconvenient.  You have to change several files where the solr part of 
the request path is hardcoded:
  
 server/solr-webapp/webapp/WEB-INF/web.xml
 server/solr/solr.xml
 server/contexts/solr-jetty-context.xml
  
 Now, with the release of the New UI defaulted to on in 6.1, you also have 
to change:
 server/solr-webapp/webapp/js/angular/services.js
 (in a bunch of places)
  
 -Chris.

Re: another log question about solr 5

2016-08-25 Thread elisabeth benoit

Thanks! This is very helpful!

Best regards,
Elisabeth

2016-08-25 17:07 GMT+02:00 Shawn Heisey :

> On 8/24/2016 6:01 AM, elisabeth benoit wrote:
> > I was wondering was is the right way to prevent solr 5 from creating a
> new
> > log file at every startup  (and renaming the actual file mv
> > "$SOLR_LOGS_DIR/solr_gc.log" "$SOLR_LOGS_DIR/solr_gc_log_$(date
> > +"%Y%m%d_%H%M")"
>
> I think if you find and comment/remove the command in the startup script
> that renames the logfile, that would do it.  The default log4j config
> will rotate the logfiles.  You can comment the first part of the
> bin/solr section labeled "backup the log files before starting".  I
> would recommend NOT commenting the next part, which rotates the garbage
> collection log.
>
> You should also modify server/resources/log4j.properties to remove all
> mention of the CONSOLE output.  The console logfile is created by shell
> redirection, which means it is never rotated and can fill up your disk.
> It's a duplicate of information that goes into solr.log, so you don't
> need it.  This means removing ", CONSOLE" from the log4j.rootLogger line
> and entirely removing the lines that start with log4j.appender.CONSOLE.
>
> You might also want to adjust the log4j.appender.file.MaxFileSize line
> in log4j.properties -- 4 megabytes is very small, which means that your
> logfile history might not cover enough time to be useful.
>
> Dev note:I think we really need to include gc logfile rotation in the
> startup script.  If the java heap is properly sized, this file won't
> grow super-quickly, but it WILL grow, and that might cause issues.  I
> also think that the MaxFileSize default in log4j.properties needs to be
> larger.
>
> Thanks,
> Shawn
>
>

Re: Question about indexing PDFs

2016-08-25 Thread Betsey Benagh

Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.


On 8/25/16, 1:56 PM, "Erick Erickson"  wrote:

>when you say "I don't see it in the schema for that collection" are you
>talking schema.xml? managed_schema? Or actual documents in the index?
>Often
>these are defined by dynamic fields and the like in the schema files.
>
>Take a look at the admin UI>>schema browser>>drop down and you'll see all
>the actual fields in your index...
>
>Best,
>Erick
>
>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
>> wrote:
>
>> Following the instructions in the quick start guide, I imported a bunch
>>of
>> PDF documents into my Solr 6.0 instance.  As far as I can tell from the
>> documentation, there should be a 'content' field indexing, well, the
>> content, but I don't see it in the schema for that collection.  Is there
>> something obvious I might have missed?
>>
>> Thanks!
>>
>>

Re: Question about indexing PDFs

2016-08-25 Thread Betsey Benagh

It looks like the metadata of the PDFs was indexed, but not the content
(which is what I was interested in).  Searches on terms I know exist in
the content come up empty.

On 8/25/16, 2:16 PM, "Betsey Benagh"  wrote:

>Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.
>
>
>On 8/25/16, 1:56 PM, "Erick Erickson"  wrote:
>
>>when you say "I don't see it in the schema for that collection" are you
>>talking schema.xml? managed_schema? Or actual documents in the index?
>>Often
>>these are defined by dynamic fields and the like in the schema files.
>>
>>Take a look at the admin UI>>schema browser>>drop down and you'll see all
>>the actual fields in your index...
>>
>>Best,
>>Erick
>>
>>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
>>>> wrote:
>>
>>> Following the instructions in the quick start guide, I imported a bunch
>>>of
>>> PDF documents into my Solr 6.0 instance.  As far as I can tell from the
>>> documentation, there should be a 'content' field indexing, well, the
>>> content, but I don't see it in the schema for that collection.  Is
>>>there
>>> something obvious I might have missed?
>>>
>>> Thanks!
>>>
>>>
>

Re: Sorting non-english text

2016-08-25 Thread Vasu Y

Thank you Ahmet.

I have couple of questions on using CollationKeyAnalyzer:
1) Is it enough to specify this Analyzer in schema.xml as shown below or do
i need to pass any parameters like language etc.?
2) Do we need to define one CollationKeyAnalyzer  per language?
3) I also noticed that there is one more analyzer called
ICUCollationKeyAnalyzer; how does CollationKeyAnalyzer compare against
ICUCollationKeyAnalyzer in terms of memory usage & performance?
4) When looking at javadoc for CollationKeyAnalyzer, I noticed there are
some WARNINGS that says JVM vendor, version & patch, collation strength
needs to be same between indexing & query time. Does it mean, if for
example, I update JVM patch-version, then already indexed documents whose
indexed fields used CollationKeyAnalyzer needs to be re-indexed or else we
cannot query them?

Thanks,
Vasu

On Thu, Aug 25, 2016 at 7:59 PM, Ahmet Arslan 
wrote:

> Hi Vasu,
>
> There is a field type or something like that (CollationKeyAnalyzer) for
> language specific sorting.
>
> Ahmet
>
>
>
> On Thursday, August 25, 2016 12:29 PM, Vasu Y  wrote:
> Hi,
> I have a text field which can contain values (multiple tokens) in English;
> to support sorting, I had  in schema.xml to copy this to a new
> field of type "lowercase" (defined as below).
> I also have text fields of type text_de, text_es, text_fr, ja, cn etc. I
> intend to do  to copy them to a new field of type "lowercase" to
> support sorting.
>
> Would this "lowercase" field type work well for sorting non-English fields
> that are non-tokenized (or are single-term) or do you suggest to use a
> different tokenizer & filter?
>
>  
>   positionIncrementGap="100">
>
>  
>  
>
> 
>
> Thanks,
> Vasu
>

Re: Sorting non-english text

2016-08-25 Thread Ahmet Arslan

Hi,

I think there is a dedidated fieldType for this:

https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-UnicodeCollation

Ahmet

On Thursday, August 25, 2016 9:08 PM, Vasu Y  wrote:
Thank you Ahmet.

I have couple of questions on using CollationKeyAnalyzer:
1) Is it enough to specify this Analyzer in schema.xml as shown below or do
i need to pass any parameters like language etc.?
2) Do we need to define one CollationKeyAnalyzer  per language?
3) I also noticed that there is one more analyzer called
ICUCollationKeyAnalyzer; how does CollationKeyAnalyzer compare against
ICUCollationKeyAnalyzer in terms of memory usage & performance?
4) When looking at javadoc for CollationKeyAnalyzer, I noticed there are
some WARNINGS that says JVM vendor, version & patch, collation strength
needs to be same between indexing & query time. Does it mean, if for
example, I update JVM patch-version, then already indexed documents whose
indexed fields used CollationKeyAnalyzer needs to be re-indexed or else we
cannot query them?

Thanks,
Vasu

On Thu, Aug 25, 2016 at 7:59 PM, Ahmet Arslan 
wrote:

> Hi Vasu,
>
> There is a field type or something like that (CollationKeyAnalyzer) for
> language specific sorting.
>
> Ahmet
>
>
>
> On Thursday, August 25, 2016 12:29 PM, Vasu Y  wrote:
> Hi,
> I have a text field which can contain values (multiple tokens) in English;
> to support sorting, I had  in schema.xml to copy this to a new
> field of type "lowercase" (defined as below).
> I also have text fields of type text_de, text_es, text_fr, ja, cn etc. I
> intend to do  to copy them to a new field of type "lowercase" to
> support sorting.
>
> Would this "lowercase" field type work well for sorting non-English fields
> that are non-tokenized (or are single-term) or do you suggest to use a
> different tokenizer & filter?
>
>  
>   positionIncrementGap="100">
>
>  
>  
>
> 
>
> Thanks,
> Vasu
>

Re: Question about indexing PDFs

2016-08-25 Thread Erick Erickson

when you say "I don't see it in the schema for that collection" are you
talking schema.xml? managed_schema? Or actual documents in the index? Often
these are defined by dynamic fields and the like in the schema files.

Take a look at the admin UI>>schema browser>>drop down and you'll see all
the actual fields in your index...

Best,
Erick

On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh  wrote:

> Following the instructions in the quick start guide, I imported a bunch of
> PDF documents into my Solr 6.0 instance.  As far as I can tell from the
> documentation, there should be a 'content' field indexing, well, the
> content, but I don't see it in the schema for that collection.  Is there
> something obvious I might have missed?
>
> Thanks!
>
>

Re: Is it safe to upgrade an existing field to docvalues?

2016-08-25 Thread Alessandro Benedetti

Of course I see your point Ronald, and don't get me wrong, I don't think it
is a bad idea.
I simply think can bring some complexity and confusion if we start to use
it as a common approach.
Anyway let's see what the other Solr gurus think :)

Cheers

On Thu, Aug 25, 2016 at 2:21 PM, Ronald Wood  wrote:

> Alessandro, yes I can see how this could be conceived of as a more general
> problem; and yes useDocValues also strikes me as being unlike the other
> properties since it would only be used temporarily.
>
> We’ve actually had to migrate fields from one to another when changing
> types, along with awkward naming like ‘fieldName’ (int) to ‘fieldNameLong’.
> But I’m not sure how a change like that could actually be done in place.
>
> The point is stronger when it comes to term vectors etc. where data exists
> in separate files and switches in code control whether they are used or not.
>
> I guess where I would argue that docValues might be different is that so
> much new functionality depends on this that it might be worth treating it
> differently. Given that docValues now is on by default, I wonder if it will
> at some point be mandatory, in which case everyone would have to migrate to
> keep up with Solr version. (Of course, I don’t know what the general
> thinking is on this amongst the implementers.)
>
> Regardless, this change may be so important to us that we’d choose to
> branch the code on GitHub and apply the patch ourselves, use it while we
> transition, and then deploy an official build once we’re done. The
> difference in the level of effort between this approach and the
> alternatives would be too great. The risks of using a custom build for
> production would have to be weighed carefully, naturally.
>
> - Ronald S. Wood
>
>
> On 8/25/16, 06:49, "Alessandro Benedetti"  wrote:
>
> > switching is done in Solr on field.hasDocValues. The code would be
> amended
> > to (field.hasDocValues && field.useDocValues) throughout.
> >
>
> This is correct. Currently we use DocValues if they are available, and
> to
> check the availabilty we check the schema attribute.
> This can be problematic in the scenarios you described ( for example
> half
> the index has docValues for a field and the other half not yet ).
>
> Your proposal is interesting.
> Technically it should work and should allow transparent migration from
> not
> docValues to docValues.
> But it is a risky one, because we are decreasing the readability a bit
> (
> althought a user will specify the attribute only in special cases like
> yours) .
>
> The only problem I see is that the same discussion we had for docValues
> actually applies to all other invasive schema changes :
> 1) you change the field type
> 2) you enable or disable term vectors
> 3) you enable/disable term positions,offsets ect ect
>
> So basically this is actually a general problem, that probably would
> require a general re-think .
> So although  can be a quick fix that will work, I fear can open the
> road to
> messy configuration attributes.
>
> Cheers
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>
>
>
>


-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Is it safe to upgrade an existing field to docvalues?

2016-08-25 Thread Ronald Wood

Thanks, Toke. 

I’m still surveying the code; do you know of a place in the code that might be 
more problematic?

We’d be mainly concerned about searching, sorting and (simple, low-cardinality) 
faceting working for us.

Some features like grouping are not currently used by us, so in a pinch a 
custom build might be a partial patch. We’ll just have to see.

- Ronald S. Wood

On 8/25/16, 06:50, "Toke Eskildsen"  wrote:

Ronald Wood  wrote:
> Did you find you had to do a full conversion all at once because simply 
turning on
> docvalues in the schema caused issues?

Yes.

> I ask because my presupposition has been that we could turn it on without 
any
> harm as we incrementally converted our indexes.

If you don't use the field for any queries until all the values has been 
re-build, I guess that would work. The I-am-not-so-sure-part is how the merger 
handles the case of a field having DocValues in one segment and not in another.

But mixing docValued & non-docValued segments with a schema that says 
DocValues will make DocValue-using queries fail, as you seem to have 
encountered:

> But this won’t work if enabling docvalues in the schema will lead to 
errors when
> fields don’t have docvalues actually populated. I.e.. the 
“IllegalStateException:
> unexpected docvalues type NONE for field 'id' (expected=SORTED)” error I 
see.

> I’m still trying to get to the bottom of whether that error means I 
cannot safely do
> an incremental conversion in-place.

When you enable docValues in the Solr schema, Solr also uses that 
information when reading the data from the segments, so when the code detects 
the missing docValues in the segments themselves, it is already too far down 
the execution path to change strategy. Basically the contract (schema) is 
broken, so all bets are off.

Your gradual enabling would work, at least for faceting, if it was possible 
to force the selection code to only use the indexed values, but the current 
code does not have an option for forcing the use of the indexed value. You 
could add it as a (small) hack, if you are comfortable with that. I don't know 
how easy or hard it would to hack grouping.

- Toke Eskildsen

Question about indexing PDFs

2016-08-25 Thread Betsey Benagh

Following the instructions in the quick start guide, I imported a bunch of PDF 
documents into my Solr 6.0 instance.  As far as I can tell from the 
documentation, there should be a 'content' field indexing, well, the content, 
but I don't see it in the schema for that collection.  Is there something 
obvious I might have missed?

Thanks!

Re: help with DIH transformer to add a suffix to column names

2016-08-25 Thread Wendy

Hi Alex,

Thank you for your response.
It worked. I am very happy for the results. I reports the steps below. The
purpose is to create a dynamic field to simplify field definition in
managed-schema file and to simplify field rank in solrconfig.xml file.

STEPS:

1. file creation of db-data-config.xml


 






   
  
 
  

  
  
 
  


2. Modification of solrconfig.xml file: notice of the ranking

 

 
 
 




db-data-config.xml



  
  
  true  
  explicit
  edismax
   pdb_id_stem^20.0
   title_stem^20.0
keywords_stem^10.0
*_stem^0.3 
rest_fields_stem ^0.3  
  
7
1000
text 
  
 

3. Modification of managed-schema file: Notice of change ,
creation of  a dynamic field "*_stem",,, 

 

  
  

 


  
 
  
  
  
  
  
  
  
  
  
  
   
  
   
  
  
  
  
 

 pdb_id_stem

4. creation of a customer transformer:

package my.solr.transformer;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.DataImporter;
import org.apache.solr.handler.dataimport.Transformer;

public class FieldTransformer extends Transformer  {
public Map transformRow(Map row, Context
context) {

List> fields = ((Context)
context).getAllEntityFields();

int rowSize = row.size();
//System.out.println("row size to start = " + rowSize); 

//Converting HashMap keys into ArrayList
Set keySet = row.keySet();
List keyList = new ArrayList(keySet);

for (int i = 0; i < rowSize; i++) {
String columnName = keyList.get(i);
Object value = row.get(columnName);
if (value != null && 
!value.toString().trim().equals("")) {
   row.put(columnName + "_stem", value.toString().trim());
   //System.out.println("value  = " + 
value.toString().trim());
   //System.out.println("row.size =   " + row.size());
   
 };
 row.remove(columnName);

}
System.out.println("row size ended = " + row.size()); 

return row;

}


}

5. NOTE: when using customer transformer, need to add the following two jar
fiels to this destination:
cp  solr-dataimporthandler-6.1.0.jar  
  /opt/solr-6.1.0/server/solr-webapp/webapp/WEB-INF/lib

cp  solr-dataimporthandler-extras-6.1.0.jar 
  /opt/solr-6.1.0/server/solr-webapp/webapp/WEB-INF/lib

6. screen shot:
 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/help-with-DIH-transformer-to-add-a-suffix-to-column-names-tp4292448p4293261.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: another log question about solr 5

2016-08-25 Thread Shawn Heisey

On 8/24/2016 6:01 AM, elisabeth benoit wrote:
> I was wondering was is the right way to prevent solr 5 from creating a new
> log file at every startup  (and renaming the actual file mv
> "$SOLR_LOGS_DIR/solr_gc.log" "$SOLR_LOGS_DIR/solr_gc_log_$(date
> +"%Y%m%d_%H%M")"

I think if you find and comment/remove the command in the startup script
that renames the logfile, that would do it.  The default log4j config
will rotate the logfiles.  You can comment the first part of the
bin/solr section labeled "backup the log files before starting".  I
would recommend NOT commenting the next part, which rotates the garbage
collection log.

You should also modify server/resources/log4j.properties to remove all
mention of the CONSOLE output.  The console logfile is created by shell
redirection, which means it is never rotated and can fill up your disk. 
It's a duplicate of information that goes into solr.log, so you don't
need it.  This means removing ", CONSOLE" from the log4j.rootLogger line
and entirely removing the lines that start with log4j.appender.CONSOLE.

You might also want to adjust the log4j.appender.file.MaxFileSize line
in log4j.properties -- 4 megabytes is very small, which means that your
logfile history might not cover enough time to be useful.

Dev note:I think we really need to include gc logfile rotation in the
startup script.  If the java heap is properly sized, this file won't
grow super-quickly, but it WILL grow, and that might cause issues.  I
also think that the MaxFileSize default in log4j.properties needs to be
larger.

Thanks,
Shawn

Re: Use function in condition

2016-08-25 Thread Emir Arnautovic


Hi Nabil,

You have limited set functions, but there are logical functions: or, 
and, not and you have query function so can do more complex queries:


fq={!frange l=1}and(query($sub1),termfreq(field3, 300))sub1={!frange 
l=100}sum(field1,field2)

And will return 1 for doc matching both function terms.

It would be much simpler if Solr supported relational functions: gt, lt, eq.

Hope this gives you ideas how to proceed.

Emir

On 25.08.2016 12:06, nabil Kouici wrote:

Hi Emir,Thank you for your replay. I've tested the function range query and 
this is solving 50% my need. The problem is I'm not able to use it with other 
conditions. For exemple:
fq={!frange l=100}sum(field1,field2)  and field3:200

or
fq=({!frange l=100}sum(field1,field2))  and (field3:200)

This is giving me an exception:org.apache.solr.search.SyntaxError: Unexpected 
text after function: AND Field3:200
I know that I can use multiple fq but the problem is I can have complexe filter 
like (cond1 OR cond2 AND cond3)
Could you please help.
Regards,Nabil.

   De : Emir Arnautovic 
  À : solr-user@lucene.apache.org
  Envoyé le : Mercredi 17 août 2016 17h08
  Objet : Re: Use function in condition

Hi Nabil,


You can use frange queries, e.g. you can use fq={!frange
l=100}sum(field1,field2) to filter doc with sum greater than 100.

Regards,
Emir


On 17.08.2016 16:26, nabil Kouici wrote:

Hi,
Is it possible to use functions (function query 
https://cwiki.apache.org/confluence/display/solr/Function+Queries) in q or fq 
parameters to build a complex search expression.
For exemple, take only documents that sum(field1,field2)> 100. Another exemple: 
if(test,value1,value2):vallue3
Regards,Nabil.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: Sorting non-english text

2016-08-25 Thread Ahmet Arslan

Hi Vasu,

There is a field type or something like that (CollationKeyAnalyzer) for 
language specific sorting.

Ahmet



On Thursday, August 25, 2016 12:29 PM, Vasu Y  wrote:
Hi,
I have a text field which can contain values (multiple tokens) in English;
to support sorting, I had  in schema.xml to copy this to a new
field of type "lowercase" (defined as below).
I also have text fields of type text_de, text_es, text_fr, ja, cn etc. I
intend to do  to copy them to a new field of type "lowercase" to
support sorting.

Would this "lowercase" field type work well for sorting non-English fields
that are non-tokenized (or are single-term) or do you suggest to use a
different tokenizer & filter?

 
 
   
 
 
   


Thanks,
Vasu

Re: Most popular fields under a list of documents

2016-08-25 Thread Mikhail Khludnev

Did you consider field facet?

On Thu, Aug 25, 2016 at 3:35 PM, Algirdas Jokubauskas  wrote:

> Hi,
>
> So I've been trying to figure out how to accomplish this one, but couldn't
> find anything that would not kill performance.
>
> I have a document type with a bunch of info that I use for various tasks,
> but I want to add a new field which is a list of ints.
>
> Then I want to do a free text search of that document and get a list of top
> 10 most popular ints among the results.
>
> So if say I had these documents:
>
> DocA(ints(1,5,7), freetext: "Marry had a little lamb")
> DocB(ints(4,3,5), freetext: "Marry had a little wolf")
> DocC(ints(5,1,8), freetext: "Marry had a big goat")
>
> and if I search for "little", and ask for the most popular int I would get
> 5
>
> In a normal case I would ask for 10 most common and there would be a few
> hundred thousand docs and a few hundred ints in each doc.
>
> I'm stumped. Any tips? Thanks.
>
> - AJ
>



-- 
Sincerely yours
Mikhail Khludnev

Re: Is it safe to upgrade an existing field to docvalues?

2016-08-25 Thread Ronald Wood

Alessandro, yes I can see how this could be conceived of as a more general 
problem; and yes useDocValues also strikes me as being unlike the other 
properties since it would only be used temporarily.

We’ve actually had to migrate fields from one to another when changing types, 
along with awkward naming like ‘fieldName’ (int) to ‘fieldNameLong’. But I’m 
not sure how a change like that could actually be done in place.

The point is stronger when it comes to term vectors etc. where data exists in 
separate files and switches in code control whether they are used or not.

I guess where I would argue that docValues might be different is that so much 
new functionality depends on this that it might be worth treating it 
differently. Given that docValues now is on by default, I wonder if it will at 
some point be mandatory, in which case everyone would have to migrate to keep 
up with Solr version. (Of course, I don’t know what the general thinking is on 
this amongst the implementers.)

Regardless, this change may be so important to us that we’d choose to branch 
the code on GitHub and apply the patch ourselves, use it while we transition, 
and then deploy an official build once we’re done. The difference in the level 
of effort between this approach and the alternatives would be too great. The 
risks of using a custom build for production would have to be weighed 
carefully, naturally.

- Ronald S. Wood 

On 8/25/16, 06:49, "Alessandro Benedetti"  wrote:

> switching is done in Solr on field.hasDocValues. The code would be amended
> to (field.hasDocValues && field.useDocValues) throughout.
>

This is correct. Currently we use DocValues if they are available, and to
check the availabilty we check the schema attribute.
This can be problematic in the scenarios you described ( for example half
the index has docValues for a field and the other half not yet ).

Your proposal is interesting.
Technically it should work and should allow transparent migration from not
docValues to docValues.
But it is a risky one, because we are decreasing the readability a bit (
althought a user will specify the attribute only in special cases like
yours) .

The only problem I see is that the same discussion we had for docValues
actually applies to all other invasive schema changes :
1) you change the field type
2) you enable or disable term vectors
3) you enable/disable term positions,offsets ect ect

So basically this is actually a general problem, that probably would
require a general re-think .
So although  can be a quick fix that will work, I fear can open the road to
messy configuration attributes.

Cheers
-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Most popular fields under a list of documents

2016-08-25 Thread Algirdas Jokubauskas

Hi,

So I've been trying to figure out how to accomplish this one, but couldn't
find anything that would not kill performance.

I have a document type with a bunch of info that I use for various tasks,
but I want to add a new field which is a list of ints.

Then I want to do a free text search of that document and get a list of top
10 most popular ints among the results.

So if say I had these documents:

DocA(ints(1,5,7), freetext: "Marry had a little lamb")
DocB(ints(4,3,5), freetext: "Marry had a little wolf")
DocC(ints(5,1,8), freetext: "Marry had a big goat")

and if I search for "little", and ask for the most popular int I would get 5

In a normal case I would ask for 10 most common and there would be a few
hundred thousand docs and a few hundred ints in each doc.

I'm stumped. Any tips? Thanks.

- AJ

Re: Range Filter for Multi-Valued Date Fields

2016-08-25 Thread Iana Bondarska

thank for explanation, seems that between isn't equivalent to 2 range
filters for multivalued fields.

2016-08-24 8:19 GMT+03:00 Mikhail Khludnev :

> It executes both half closed ranges first, here the undesired first doc
> comes in. Then it intersect these document sets, and here again, the
> undesired first doc comes through.
>
> On Tue, Aug 23, 2016 at 5:15 PM, Iana Bondarska 
> wrote:
>
> > Hello Mikhail,
> > I convert filters that come from other part of application and in general
> > cannot combine many filters into one , since conditions can be quite
> > complex.
> > Could you please provide more details why is this expected behavior -
> > (p_happyDates:[1975-10-31T00:00:00.000Z+TO+*]+AND+p_
> > happyDates:[*+TO+1975-10-31T23:59:59.999Z]) is  AND filter with 2
> > conditions date>="1975-10-31T00:00:00.000Z" and  date<="1975-10-
> > 31T23:59:59.999Z" , seems that it should return same results that
> > =p_happyDates:[1975-10-31T00:00:00.000Z+TO+1975-10-31T23:59:59.999Z]
> >
> >
> >
> > 2016-08-23 15:00 GMT+03:00 Mikhail Khludnev :
> >
> > > Hello Iana,
> > >
> > > I consider is as expected behavior, perhaps usually it's done as
> > > =p_happyDates:[1975-10-31T00:00:00.000Z+TO+1975-10-
> 31T23:59:59.999Z],
> > > which is not equivalent to combining half closed ranges with boolean
> > query.
> > > I wonder why did you do like that?
> > >
> > > On Tue, Aug 23, 2016 at 2:33 PM, Iana Bondarska 
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > could you help me please with multiple range filters on multi valued
> > > > fields:
> > > > I have following dataset:
> > > > {
> > > > "p_happyDates":[
> > > > "1986-05-16T20:00:00Z",
> > > > "1875-04-29T21:57:56Z",
> > > > "1906-07-04T21:57:56Z"]
> > > > },
> > > > {
> > > > "p_happyDates":[
> > > > "1986-05-16T20:00:00Z",
> > > > "1975-10-31T21:57:56Z",
> > > > "1966-12-28T21:00:00Z"]
> > > > }
> > > > I apply filters:
> > > > =(p_happyDates:[1975-10-31T00:00:00.000Z+TO+*]+AND+p_
> > > > happyDates:[*+TO+1975-10-31T23:59:59.999Z])
> > > > I expect to see only second record.
> > > > Actually I see both records. Even if I add parameter q.op=AND -
> result
> > is
> > > > the same.
> > > > Is this expected behavior or known issue for multivalued fields?
> > > >
> > > > Best Regards,
> > > > Iana Bondarska
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Is it safe to upgrade an existing field to docvalues?

2016-08-25 Thread Toke Eskildsen

Ronald Wood  wrote:
> Did you find you had to do a full conversion all at once because simply 
> turning on
> docvalues in the schema caused issues?

Yes.

> I ask because my presupposition has been that we could turn it on without any
> harm as we incrementally converted our indexes.

If you don't use the field for any queries until all the values has been 
re-build, I guess that would work. The I-am-not-so-sure-part is how the merger 
handles the case of a field having DocValues in one segment and not in another.

But mixing docValued & non-docValued segments with a schema that says DocValues 
will make DocValue-using queries fail, as you seem to have encountered:

> But this won’t work if enabling docvalues in the schema will lead to errors 
> when
> fields don’t have docvalues actually populated. I.e.. the 
> “IllegalStateException:
> unexpected docvalues type NONE for field 'id' (expected=SORTED)” error I see.

> I’m still trying to get to the bottom of whether that error means I cannot 
> safely do
> an incremental conversion in-place.

When you enable docValues in the Solr schema, Solr also uses that information 
when reading the data from the segments, so when the code detects the missing 
docValues in the segments themselves, it is already too far down the execution 
path to change strategy. Basically the contract (schema) is broken, so all bets 
are off.

Your gradual enabling would work, at least for faceting, if it was possible to 
force the selection code to only use the indexed values, but the current code 
does not have an option for forcing the use of the indexed value. You could add 
it as a (small) hack, if you are comfortable with that. I don't know how easy 
or hard it would to hack grouping.

- Toke Eskildsen

Re: Is it safe to upgrade an existing field to docvalues?

2016-08-25 Thread Alessandro Benedetti

> switching is done in Solr on field.hasDocValues. The code would be amended
> to (field.hasDocValues && field.useDocValues) throughout.
>

This is correct. Currently we use DocValues if they are available, and to
check the availabilty we check the schema attribute.
This can be problematic in the scenarios you described ( for example half
the index has docValues for a field and the other half not yet ).

Your proposal is interesting.
Technically it should work and should allow transparent migration from not
docValues to docValues.
But it is a risky one, because we are decreasing the readability a bit (
althought a user will specify the attribute only in special cases like
yours) .

The only problem I see is that the same discussion we had for docValues
actually applies to all other invasive schema changes :
1) you change the field type
2) you enable or disable term vectors
3) you enable/disable term positions,offsets ect ect

So basically this is actually a general problem, that probably would
require a general re-think .
So although  can be a quick fix that will work, I fear can open the road to
messy configuration attributes.

Cheers
-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Use function in condition

2016-08-25 Thread nabil Kouici

Hi Emir,Thank you for your replay. I've tested the function range query and 
this is solving 50% my need. The problem is I'm not able to use it with other 
conditions. For exemple:
fq={!frange l=100}sum(field1,field2)  and field3:200

or
fq=({!frange l=100}sum(field1,field2))  and (field3:200)

This is giving me an exception:org.apache.solr.search.SyntaxError: Unexpected 
text after function: AND Field3:200 
I know that I can use multiple fq but the problem is I can have complexe filter 
like (cond1 OR cond2 AND cond3)
Could you please help.
Regards,Nabil.

  De : Emir Arnautovic 
 À : solr-user@lucene.apache.org 
 Envoyé le : Mercredi 17 août 2016 17h08
 Objet : Re: Use function in condition

Hi Nabil,

You can use frange queries, e.g. you can use fq={!frange 
l=100}sum(field1,field2) to filter doc with sum greater than 100.

Regards,
Emir

On 17.08.2016 16:26, nabil Kouici wrote:
> Hi,
> Is it possible to use functions (function query 
> https://cwiki.apache.org/confluence/display/solr/Function+Queries) in q or fq 
> parameters to build a complex search expression.
> For exemple, take only documents that sum(field1,field2)> 100. Another 
> exemple: if(test,value1,value2):vallue3
> Regards,Nabil.

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Sorting non-english text

2016-08-25 Thread Vasu Y

Hi,
 I have a text field which can contain values (multiple tokens) in English;
to support sorting, I had  in schema.xml to copy this to a new
field of type "lowercase" (defined as below).
I also have text fields of type text_de, text_es, text_fr, ja, cn etc. I
intend to do  to copy them to a new field of type "lowercase" to
support sorting.

Would this "lowercase" field type work well for sorting non-English fields
that are non-tokenized (or are single-term) or do you suggest to use a
different tokenizer & filter?

 
 
   
 
 
   


Thanks,
Vasu

Search Configurations Merchandising tool

2016-08-25 Thread Srinivasa Meenavalli

Hi,

Is there any Search Merchandising  tool available in Solr similar to Endeca 
experience Manager to manage Synonyms,Protwords,Keyword redirects,   Template 
management etc ?   Are there any plans to develop with in Solr ?


Regards
Srinivas Meenavalli
Disclaimer: The contents of this e-mail and attachment(s) thereto are 
confidential and intended for the named recipient(s) only. It shall not attach 
any liability on the originator or Zensar Technologies Limited or its 
affiliates. Any views or opinions presented in this email are solely those of 
the author and may not necessarily reflect the opinions of Zensar Technologies 
Limited or its affiliates. Any form of reproduction, dissemination, copying, 
disclosure, modification, distribution and / or publication of this message 
without the prior written consent of the author of this e-mail is strictly 
prohibited. If you have received this email in error please delete it and 
notify the sender immediately. Before opening any mail and attachments please 
check them for viruses and defect. Zensar Technologies Ltd or its affiliate do 
not accept any liability for virus infected mails.

Re: Is it safe to upgrade an existing field to docvalues?

2016-08-25 Thread Toke Eskildsen

Alessandro Benedetti  wrote:
> So basically using your tool you build a copy of the index ( similar to
> what optimize does) without affecting the main index right ?

Yes. Your step-by-step is spot on.

In the end we re-indexed everything, because there were other issues with the 
index we wanted to fix, so DVEnabler is very much a limited implementation. I 
am sure that the conversion code can be made a lot faster. 

I could envision a better tool that just used a source index and a destination 
schema and did the conversion with no setup-fuss. One could also add other 
usable conversions besides DV-switching: Multi-value-fields with always 1 
value/doc could be changed to true single-value, and vice-versa. String fields 
with numerics could be converted to true numerics. etc. Most of the data is 
already in the index, so it is "just" a question of re-packing it.

> This is actually a useful tool when re-indexing could be extremely long.

Thank you.

Important note: I stated that I created that tool, which I apologize for. 
Thomas Egense and I wrote it jointly.

- Toke Eskildsen

37 matches

Mail list logo