RE: Speculation on Memory needed to efficently run a Solr Instance.

2016-01-15 Thread Gian Maria Ricci - aka Alkampfer
THanks a lot I'll have a look to Sematext SPM. 

Actually the index is not static, but the number of new documents will be
small and probably they will be indexed during the night, so I'm not
expecting too much problem from merge factor. We can index new document
during the night and then optimize the index. (during night there are no
searches).

--
Gian Maria Ricci
Cell: +39 320 0136949



-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com] 
Sent: venerdì 15 gennaio 2016 11:06
To: solr-user@lucene.apache.org
Subject: Re: Speculation on Memory needed to efficently run a Solr Instance.

Hi,
OS does not care much about search v.s. retrieve so amount of RAM needed for
file caches would depend on your index usage patterns. If you are not
retrieving stored fields much and most/all results are only 
id+score, than it can be assumed that you can go with less RAM than
actual index size. In such case you can question if you need stored fields
in index. Also if your index/usage pattern is such that only small subset of
documents is retrieved with stored fields, than it can also be assumed it
will never need to cache entire fdt file.
One thing that you forgot (unless you index is static) is segments merging -
in worst case system will have two "copies" of index and having extra memory
can help in such cases.
The best approach is to use some tool and monitor IO and memory metrics. 
One such tool is Sematext's SPM (http://sematext.com/spm) where you can see
metrics for both system and SOLR.

Thanks,
Emir

On 15.01.2016 10:43, Gian Maria Ricci - aka Alkampfer wrote:
>
> Hi,
>
> When it is time to calculate how much RAM a solr instance needs to run 
> with good performance, I know that it is some form of art, but I’m 
> looking at a general “formula” to have at least one good starting point.
>
> Apart the RAM devoted to Java HEAP, that is strongly dependant on how 
> I configure caches, and the distribution of queries in my system, I’m 
> particularly interested in the amount of RAM to leave to operating 
> system to use File Cache.
>
> Suppose I have an index of 51 Gb of dimension, clearly having that 
> amount of ram devoted to the OS is the best approach, so all index 
> files can be cached into memory by the OS, thus I can achieve maximum 
> speed.
>
> But if I look at the detail of the index, in this particular example I 
> see that the bigger file has .fdt extension, it is the stored field 
> for the documents, so it affects retrieval of document data, not the 
> re
al search process. Since this file is 24 GB of size, it is almost 
> half of the space of the index.
>
> My question is: it could be safe to assume that a good starting point 
> for the amount of RAM to leave to the OS is the dimension of the index 
> less the dimension of the .fdt file because it has less importance in 
> the search process?
>
> Are there any particular setting at OS level (CentOS linux) to have 
> maximum benefit from OS file cache? (documentation at 
> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Produc
> tion#TakingSolrtoProduction-MemoryandGCSettingsdoes
> not have any information related to OS configuration). Elasticsearch
> (https://www.elastic.co/guide/en/elasticsearch/reference/1.4/setup-con
> figuration.html) generally have some suggestions such as using 
> mlockall, disable swap etc etc, I wonder if there are similar 
> suggestions for solr.
>
> Many thanks for all the great help you are giving me in this mailing 
> list.
>
> --
> Gian Maria Ricci> Cell: +39 320 0136949
>
> https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZk
> VVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d
> -e1-ft#http://www.codewrecks.com/files/signature/mvp.png
> https
> ://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrmGLUT3
> fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBclKA=s0
> -d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg
> https://ci3.googleuserconte
> nt.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_GpcIZNEPjCzOYMrPl-r1DViPE378qN
> AQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT=s0-d-e1-ft#http://www.code
> wrecks.com/files/signature/twitter.jpg
> https://ci5.googleusercontent.com/proxy
> /iuDOD2sdaxRDvTwS8MO7-CcXchpNJX96uaWuvagoVLcjpAPsJi88XeOonE4vHT6udVimo
> 7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d-e1-ft#http://www.codewrecks.com/files/
> signature/rss.jpg
> https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86
YXcwaKfn3x7ydAObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg
=s0-d-e1-ft#http://www.codewrecks.com/files/signature/skype.jpg
>

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr
& Elasticsearch Support * http://sematext.com/



Re: Issue in custom filter

2016-01-15 Thread Ahmet Arslan
Hi Simitha,

Please try below :

  final String term = charTermAttr.toString();
  final String s = convertedTerm = Converter.convert(term);


// If not changed, don't waste the time adjusting the token.if ((s != 
null) && !s.equals(term))
charTermAttr.setEmpty().append(s);



Ahmet

On Friday, January 15, 2016 11:59 AM, Smitha Rajiv  
wrote:



Hi

I have a requirement such that while indexing if tokens contains numbers,
it needs to be converted into corresponding words.

e.g : term1 part 2 assignments -> termone part two assignments.

I have created a custom filter with following code:

@Override
public boolean incrementToken() throws IOException {
if (!input.incrementToken())
return false;
char[] buffer = charTermAttr.buffer();
String newTerm = new String(buffer);
convertedTerm = Converter.convert(newTerm);
charTermAttr.setEmpty();
charTermAttr.copyBuffer(convertedTerm.toCharArray(), 0,

convertedTerm.length());
return true;

}
But its given weird results when i analyze.

After applying the custom filter i am getting the result as
termone partone twoartone assignments.

It looks like the buffer length which i am setting for the first token is
not getting reset while picking up the next token.I have a feeling that
somewhere i am messing up with the offsets.

Could you please help me in this.

Thanks & Regards,
Smitha


Re: Position increment in WordDelimiterFilter.

2016-01-15 Thread Emir Arnautovic
Can you please send us tokens you get (and positions) when you analyze 
*WiFi device*


On 15.01.2016 13:15, Modassar Ather wrote:

Are you saying that WiFi Wi-Fi and Wi Fi should not match each other?
I am using WhiteSpaceTokenizer in my analysis chain so wi fi becomes two
different token. Please refer to my examples given in previous mail about
the issues faced.
Wi Fi are two term which will match but what happens if for a content
having *WiFi device* is searched with *"WiFi device"*. It will not match as
there is a position increment by WordDelimiterFilter for WiFi.
"WiFi device"~1 will match which is confusing that there is no gap in the
content why a slop is required.

Why do you use WordDelimiterFilter? Can you give us few examples where it
is useful?
It is useful when a word like* lucene-search documentation *is indexed with
WordDelimiterFilter and it is broken in two terms like lucene and search
then it will be helpful to get the documents containing it for queries like
lucene documentation or search documentation.

Best,
Modassar

On Fri, Jan 15, 2016 at 2:14 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Modassar,
Are you saying that WiFi Wi-Fi and Wi Fi should not match each other? Why
do you use WordDelimiterFilter? Can you give us few examples where it is
useful?

Thanks,
Emir


On 15.01.2016 05:13, Modassar Ather wrote:


Thanks for your responses.

It seems to me that you don't want to split on numbers.
It is not with number only. Even if you try to analyze WiFi it will create
4 token one of which will be at position 2. So basically the issue is with
position increment which causes few of the queries behave unexpectedly.

Which release of Solr are you using?
I am using Lucene/Solr-5.4.0.

Best,
Modassar

On Thu, Jan 14, 2016 at 9:44 PM, Jack Krupansky 
wrote:

Hi,

I have following definition for WordDelimiterFilter.



The analysis of 3d shows following four tokens and their positions.

token position
3d 1
3   1
3d 1
d   2

Please help me understand why d is at 2? Should not it also be at


position


1.
Is it a bug and if not is there any attribute which I can use to
restrict
the position increment?

Thanks,
Modassar



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Speculation on Memory needed to efficently run a Solr Instance.

2016-01-15 Thread Emir Arnautovic

Hi,
OS does not care much about search v.s. retrieve so amount of RAM needed 
for file caches would depend on your index usage patterns. If you are 
not retrieving stored fields much and most/all results are only 
id+score, than it can be assumed that you can go with less RAM than 
actual index size. In such case you can question if you need stored 
fields in index. Also if your index/usage pattern is such that only 
small subset of documents is retrieved with stored fields, than it can 
also be assumed it will never need to cache entire fdt file.
One thing that you forgot (unless you index is static) is segments 
merging - in worst case system will have two "copies" of index and 
having extra memory can help in such cases.
The best approach is to use some tool and monitor IO and memory metrics. 
One such tool is Sematext's SPM (http://sematext.com/spm) where you can 
see metrics for both system and SOLR.


Thanks,
Emir

On 15.01.2016 10:43, Gian Maria Ricci - aka Alkampfer wrote:


Hi,

When it is time to calculate how much RAM a solr instance needs to run 
with good performance, I know that it is some form of art, but I’m 
looking at a general “formula” to have at least one good starting point.


Apart the RAM devoted to Java HEAP, that is strongly dependant on how 
I configure caches, and the distribution of queries in my system, I’m 
particularly interested in the amount of RAM to leave to operating 
system to use File Cache.


Suppose I have an index of 51 Gb of dimension, clearly having that 
amount of ram devoted to the OS is the best approach, so all index 
files can be cached into memory by the OS, thus I can achieve maximum 
speed.


But if I look at the detail of the index, in this particular example I 
see that the bigger file has .fdt extension, it is the stored field 
for the documents, so it affects retrieval of document data, not the 
real search process. Since this file is 24 GB of size, it is almost 
half of the space of the index.


My question is: it could be safe to assume that a good starting point 
for the amount of RAM to leave to the OS is the dimension of the index 
less the dimension of the .fdt file because it has less importance in 
the search process?


Are there any particular setting at OS level (CentOS linux) to have 
maximum benefit from OS file cache? (documentation at 
https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-MemoryandGCSettingsdoes 
not have any information related to OS configuration). Elasticsearch 
(https://www.elastic.co/guide/en/elasticsearch/reference/1.4/setup-configuration.html) 
generally have some suggestions such as using mlockall, disable swap 
etc etc, I wonder if there are similar suggestions for solr.


Many thanks for all the great help you are giving me in this mailing 
list.


--
Gian Maria Ricci
Cell: +39 320 0136949

https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZkVVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d-e1-ft#http://www.codewrecks.com/files/signature/mvp.png 
https://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrmGLUT3fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBclKA=s0-d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg 
https://ci3.googleusercontent.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_GpcIZNEPjCzOYMrPl-r1DViPE378qNAQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT=s0-d-e1-ft#http://www.codewrecks.com/files/signature/twitter.jpg 
https://ci5.googleusercontent.com/proxy/iuDOD2sdaxRDvTwS8MO7-CcXchpNJX96uaWuvagoVLcjpAPsJi88XeOonE4vHT6udVimo7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d-e1-ft#http://www.codewrecks.com/files/signature/rss.jpg 
https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86YXcwaKfn3x7ydAObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg=s0-d-e1-ft#http://www.codewrecks.com/files/signature/skype.jpg




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Can we create multiple cluster in single Zookeeper instance

2016-01-15 Thread Mugeesh Husain
Thanks Shawn B. for opinion.

Actually i have a question , if i will use single zookeeper, 

suppose I have a  3 cluster and each of cluster used zookeeper instance(only
one zk).

how we will manage zk in a way all of cluster will not communicate each
other?

if you still any clarification i can draw a diagram for this.

 
Shawn Heisey-2 wrote
> On 1/14/2016 10:22 AM, Mugeesh Husain wrote:
>> I have  a question i want to create 2-3 cluster using solrlcoud using
>> single
>> zookeeper instance, it is possible ?
> 
> Yes, if you use a chroot on the zkHost parameter for each collection.
> 
> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-ZooKeeperchroot
> 
> I would recommend *always* using a chroot, even if you don't have
> multiple clusters.
> 
> When you talk about a single zookeeper instance, be aware that if you
> only have one zookeeper host, there is no redundancy.  You must have at
> least three zookeeper hosts.
> 
> Thanks,
> Shawn


Shawn Heisey-2 wrote
> On 1/14/2016 10:22 AM, Mugeesh Husain wrote:
>> I have  a question i want to create 2-3 cluster using solrlcoud using
>> single
>> zookeeper instance, it is possible ?
> 
> Yes, if you use a chroot on the zkHost parameter for each collection.
> 
> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-ZooKeeperchroot
> 
> I would recommend *always* using a chroot, even if you don't have
> multiple clusters.
> 
> When you talk about a single zookeeper instance, be aware that if you
> only have one zookeeper host, there is no redundancy.  You must have at
> least three zookeeper hosts.
> 
> Thanks,
> Shawn





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-we-create-multiple-cluster-in-single-Zookeeper-instance-tp4250791p4250982.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing rich data with solr 5.3

2016-01-15 Thread kostali hassan
thank you Erik for your precious advice.

2016-01-14 17:24 GMT+00:00 Erik Hatcher :

> And also, bin/post can be your friend when it comes to troubleshooting or
> introspecting Tika parsing via /update/extract.  Like this:
>
> $ bin/post -c test -params "extractOnly=true=ruby=yes" -out yes
> docs/SYSTEM_REQUIREMENTS.html
> java -classpath /Users/erikhatcher/solr-5.3.0/dist/solr-core-5.3.0.jar
> -Dauto=yes -Dparams=extractOnly=true=ruby=yes -Dout=yes -Dc=test
> -Ddata=files org.apache.solr.util.SimplePostTool
> /Users/erikhatcher/solr-5.3.0/docs/SYSTEM_REQUIREMENTS.html
> SimplePostTool version 5.0.0
> Posting files to [base] url
> http://localhost:8983/solr/test/update?extractOnly=true=ruby=yes.
> ..
> Entering auto mode. File endings considered are
> xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> POSTing file SYSTEM_REQUIREMENTS.html (text/html) to [base]/extract
> {
>   'responseHeader'=>{
> 'status'=>0,
> 'QTime'=>3},
>   ''=>'
> http://www.w3.org/1999/xhtml;>
> 
> 
>- from
> https://lucidworks.com/blog/2015/08/04/solr-5-new-binpost-utility/
>
> But I also recommend having the Tika desktop app handy, in which you can
> drag and drop a file and see the gory details of how it parses the file.
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com
>
>
>
> > On Jan 14, 2016, at 10:55 AM, Erick Erickson 
> wrote:
> >
> > No good way except to try them. For getting details on Tika parsing
> > failures, I much prefer the SolrJ process that the link I sent you
> > outlines.
> >
> > Best,
> > Erick
> >
> > On Thu, Jan 14, 2016 at 7:52 AM, kostali hassan
> >  wrote:
> >> thank you Eric I have prb with this files; last question how to define
> or
> >> get the list of files cant be indexing or bad files.
> >>
> >>
> >>>
> >>>
> >>>
> >>>
>
>


Re: SolR 5.3.1 deletes index files

2016-01-15 Thread Moll, Dr. Andreas
Hi,

we still have the problem that SolR deletes index files on closing the
application if the index was changed in the meantime from the production
application (which has an embedded SolR-Server).
The problem also occurs if we use a local file system instead of a NFS.
I have changed the loglevel to DEBUG and got some interesting lines, especially:

1140211 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded class
org.apache.lucene.index.MergeTrigger from WebAppClassLoader=174573182@a67c67e

Why does SolR try a merge action on closing the core, if this SolR instance
did not write any changes on the index?

I have also tried to add the following lines to the solrconfig, but it
didn't change anything:


  1000
  1000
  1000DAY


Below you find the log and an ls-Output before and after shuting down SolR.
Maybe someone can help us along?

Thanks and best regards

Andreas Moll

1140203 INFO  (Thread-0) [   x:recht] o.a.s.c.SolrCore [recht]  CLOSING 
SolrCore org.apache.solr.core.SolrCore@7cb0d12c
1140206 INFO  (Thread-0) [   x:recht] o.a.s.u.UpdateHandler closing 
DirectUpdateHandler2{commits=0,autocommit maxTime=15000ms,autocommits=0,soft 
autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
1140207 INFO  (Thread-0) [   x:recht] o.a.s.u.SolrCoreState Closing 
SolrCoreState
1140207 INFO  (Thread-0) [   x:recht] o.a.s.u.DefaultSolrCoreState 
SolrCoreState ref count has reached 0 - closing IndexWriter
1140207 INFO  (Thread-0) [   x:recht] o.a.s.u.DefaultSolrCoreState closing 
IndexWriter with IndexWriterCloser
1140208 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded class 
org.apache.solr.request.LocalSolrQueryRequest from 
WebAppClassLoader=174573182@a67c67e
1140209 DEBUG (Thread-0) [   x:recht] o.a.s.u.SolrIndexWriter Closing Writer 
DirectUpdateHandler2
1140211 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded class 
org.apache.lucene.index.MergeTrigger from WebAppClassLoader=174573182@a67c67e
1140213 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded class 
org.apache.lucene.store.FileSwitchDirectory from 
WebAppClassLoader=174573182@a67c67e
1140236 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded class 
java.nio.file.FileSystem
1140236 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded class 
java.nio.file.FileSystem from null
1140238 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded class 
java.nio.file.FileStore
1140238 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded class 
java.nio.file.FileStore from null
1140259 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded class 
java.nio.file.attribute.FileTime
1140259 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded class 
java.nio.file.attribute.FileTime from null
1140518 DEBUG (Thread-0) [   x:recht] o.a.s.c.CachingDirectoryFactory Releasing 
directory: 
/mnt/solr/jpprodt1/abzug/prodman_solrhome/solr_recht/core_recht/data/index 1 
false
1140520 INFO  (Thread-0) [   x:recht] o.a.s.c.SolrCore [recht] Closing main 
searcher on request.
1140521 DEBUG (Thread-0) [   x:recht] o.a.s.s.SolrIndexSearcher Closing 
Searcher@664499[recht] main

fieldValueCache{lookups=0,hits=0,hitratio=0.0,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.0,cumulative_inserts=0,cumulative_evictions=0}

filterCache{lookups=0,hits=0,hitratio=0.0,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.0,cumulative_inserts=0,cumulative_evictions=0}

queryResultCache{lookups=0,hits=0,hitratio=0.0,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.0,cumulative_inserts=0,cumulative_evictions=0}

documentCache{lookups=0,hits=0,hitratio=0.0,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.0,cumulative_inserts=0,cumulative_evictions=0}

perSegFilter{lookups=0,hits=0,hitratio=0.0,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.0,cumulative_inserts=0,cumulative_evictions=0}
1140534 DEBUG (Thread-0) [   x:recht] o.a.s.c.CachingDirectoryFactory Releasing 
directory: 
/mnt/solr/jpprodt1/abzug/prodman_solrhome/solr_recht/core_recht/data/index 0 
false
1140535 INFO  (Thread-0) [   x:recht] o.a.s.c.CachingDirectoryFactory Closing 
NRTCachingDirectoryFactory - 2 directories currently being tracked
1140536 DEBUG (Thread-0) [   x:recht] o.a.s.c.CachingDirectoryFactory Closing 
NRTCachingDirectoryFactory - currently tracking: 

Re: Position increment in WordDelimiterFilter.

2016-01-15 Thread Modassar Ather
Are you saying that WiFi Wi-Fi and Wi Fi should not match each other?
I am using WhiteSpaceTokenizer in my analysis chain so wi fi becomes two
different token. Please refer to my examples given in previous mail about
the issues faced.
Wi Fi are two term which will match but what happens if for a content
having *WiFi device* is searched with *"WiFi device"*. It will not match as
there is a position increment by WordDelimiterFilter for WiFi.
"WiFi device"~1 will match which is confusing that there is no gap in the
content why a slop is required.

Why do you use WordDelimiterFilter? Can you give us few examples where it
is useful?
It is useful when a word like* lucene-search documentation *is indexed with
WordDelimiterFilter and it is broken in two terms like lucene and search
then it will be helpful to get the documents containing it for queries like
lucene documentation or search documentation.

Best,
Modassar

On Fri, Jan 15, 2016 at 2:14 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Modassar,
> Are you saying that WiFi Wi-Fi and Wi Fi should not match each other? Why
> do you use WordDelimiterFilter? Can you give us few examples where it is
> useful?
>
> Thanks,
> Emir
>
>
> On 15.01.2016 05:13, Modassar Ather wrote:
>
>> Thanks for your responses.
>>
>> It seems to me that you don't want to split on numbers.
>> It is not with number only. Even if you try to analyze WiFi it will create
>> 4 token one of which will be at position 2. So basically the issue is with
>> position increment which causes few of the queries behave unexpectedly.
>>
>> Which release of Solr are you using?
>> I am using Lucene/Solr-5.4.0.
>>
>> Best,
>> Modassar
>>
>> On Thu, Jan 14, 2016 at 9:44 PM, Jack Krupansky > >
>> wrote:
>>
>> Which release of Solr are you using? Last year (or so) there was a Lucene
>>> change that had the effect of keeping all terms for WDF at the same
>>> position. There was also some discussion about whether this was either a
>>> bug or a bug fix, but I don't recall any resolution.
>>>
>>> -- Jack Krupansky
>>>
>>> On Thu, Jan 14, 2016 at 4:15 AM, Modassar Ather 
>>> wrote:
>>>
>>> Hi,

 I have following definition for WordDelimiterFilter.

 >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
 catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>

 The analysis of 3d shows following four tokens and their positions.

 token position
 3d 1
 3   1
 3d 1
 d   2

 Please help me understand why d is at 2? Should not it also be at

>>> position
>>>
 1.
 Is it a bug and if not is there any attribute which I can use to
 restrict
 the position increment?

 Thanks,
 Modassar


> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: Classes in solr_home /lib cannot import from solr/dist

2016-01-15 Thread Callum Lamb
Good to know Solr already loads them, that removed a bunch of lines from my
solrconfig.xml.

Having to copy the required jars from dist/ to lib/ isn't ideal but if
that's the only solution then at least I can stop searching for a solution
and figure out how best to deal with this limitation.

I assume the reason for this is that the libs in solr.home.home/lib are
loaded at runtime? I don't know much about how this works in Java but i'm
guessing Solr can access the classes in the Jars but not the other way
around?

Thanks for your help guys.

On Thu, Jan 14, 2016 at 5:03 PM, Shawn Heisey  wrote:

> On 1/14/2016 5:36 AM, Callum Lamb wrote:
> > I've got an extension jar that contains a class which extends from
> >
> > org.apache.solr.handler.dataimport.DataSource
> >
> > But it only works if it's within the solr/dist folder. However when
> stored
> > in the lib/ folder within Solr home. When it tries to load the class it
> > cannot find it's parent:
> >
> > Exception in thread "Thread-69" java.lang.NoClassDefFoundError:
> > org/apache/solr/handler/dataimport/DataSource
> > at
> >
> org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:374)
> > at
> >
> org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:102)
> > Caused by: java.lang.ClassNotFoundException:
> > org.apache.solr.handler.dataimport.DataSource
> >
> > The classes in the lib folder don't have access to the class within the
> > dist folder in their classpath when they are loaded.
> >
> > I'd like the keep my solr install separate from my
> configs/plugins/indexes
> > so I want to avoid putting it into the dist folder unless I absolutely
> have
> > to.
>
> If you're going to put jars in $SOLR_HOME/lib, then you should *only*
> put jars in that directory, and NOT load jars explicitly.  The 
> directives should not be used in solrconfig.xml when jars are loaded
> from this directory, because Solr will automatically load jars from this
> location and make them available to all cores.
>
> If moving all your extra jars (including things like the dataimport jar)
> to $SOLR_HOME/lib and taking out jar loading in solrconfig.xml doesn't
> help, then depending on the Solr version, you *might* be running into
> SOLR-6188.
>
> https://issues.apache.org/jira/browse/SOLR-6188
>
> You'll want to be sure that you don't the same jar more than once.  This
> is the root of the specific problem that SOLR-6188 solves.  Loading the
> same jar more than once can also happen if the jar is in the lib
> directory AND mentioned on a  config element.
>
> Thanks,
> Shawn
>
>

-- 

Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations.

This email and any attachments may include content that is confidential, 
privileged 
or otherwise protected under applicable law. Unauthorised disclosure, 
copying, distribution 
or use of the contents is prohibited and may be unlawful. If you have 
received this email in error,
including without appropriate authorisation, then please reply to the 
sender about the error 
and delete this email and any attachments.



Re: Issue in custom filter

2016-01-15 Thread Smitha Rajiv
Thanks Ahmet.It worked.

As per your suggestion i have changed the code as below.

final String term=charTermAttr.toString();
final String convertedTerm = Converter.convert(term);
charTermAttr.setEmpty().append(convertedTerm);
return true;

now for the input stream "term1 part 2 assessment" gives the result
"termone part two assessment".

Thanks for your support.

Regards,
Smitha


On Fri, Jan 15, 2016 at 3:40 PM, Ahmet Arslan 
wrote:

> Hi Simitha,
>
> Please try below :
>
>   final String term = charTermAttr.toString();
>   final String s = convertedTerm = Converter.convert(term);
>
>
> // If not changed, don't waste the time adjusting the token.if ((s
> != null) && !s.equals(term))
> charTermAttr.setEmpty().append(s);
>
>
>
> Ahmet
>
> On Friday, January 15, 2016 11:59 AM, Smitha Rajiv <
> smitharaji...@gmail.com> wrote:
>
>
>
> Hi
>
> I have a requirement such that while indexing if tokens contains numbers,
> it needs to be converted into corresponding words.
>
> e.g : term1 part 2 assignments -> termone part two assignments.
>
> I have created a custom filter with following code:
>
> @Override
> public boolean incrementToken() throws IOException {
> if (!input.incrementToken())
> return false;
> char[] buffer = charTermAttr.buffer();
> String newTerm = new String(buffer);
> convertedTerm = Converter.convert(newTerm);
> charTermAttr.setEmpty();
> charTermAttr.copyBuffer(convertedTerm.toCharArray(), 0,
>
> convertedTerm.length());
> return true;
>
> }
> But its given weird results when i analyze.
>
> After applying the custom filter i am getting the result as
> termone partone twoartone assignments.
>
> It looks like the buffer length which i am setting for the first token is
> not getting reset while picking up the next token.I have a feeling that
> somewhere i am messing up with the offsets.
>
> Could you please help me in this.
>
> Thanks & Regards,
> Smitha
>


Leader Election Time

2016-01-15 Thread Robert Brown

Hi,

I have 2 shards, 1 leader and 1 replica in each.

I've just removed a leader from one of the shards but the replica hasn't 
become a leader yet.


How quickly should this normally happen?

tickTime=2000
dataDir=/home/rob/zoodata
clientPort=2181
initLimit=5
syncLimit=2

Thanks,
Rob



Query results change

2016-01-15 Thread Brian Narsi
We have an index of 25 fields. Currently number of records in index is
about 120,000. We are using

parser: edismax

qf: contains 8 fields

fq: 1 field

mm = 1

qs = 6

pf: containing g 3 fields

bf: containing 1 field

We have noticed that sometimes results change between two searches even if
everything is constant.

What we have identified is if we reindex data and optimize it remedies the
situation.

Is that expected behavior? Or should we also look into other factors?

Thanks


Re: Can we create multiple cluster in single Zookeeper instance

2016-01-15 Thread Shawn Heisey
On 1/15/2016 4:14 AM, Mugeesh Husain wrote:
> Actually i have a question , if i will use single zookeeper, 
>
> suppose I have a  3 cluster and each of cluster used zookeeper instance(only
> one zk).
>
> how we will manage zk in a way all of cluster will not communicate each
> other?

This is not the proper way to set things up.  Each of those three
clusters will have a single point of failure - the zookeeper server.  If
zookeeper goes down, you will not be able to change your index, and if
the clients are using (CloudSolrClient), new clients will not be able to
connect to the cluster while zookeeper is down.  You haven't described
each cluster, so I do not know if there are more single points of failure.

The first thing you need when setting up SolrCloud is a fully redundant
three-node zookeeper ensemble.  Then you configure your SolrCloud
machines for you first cluster with a chroot on the zkHost parameter. 
You need a minimum of two Solr machines for redundancy.  The second
cluster gets configured with a different chroot on its zkHost, and the
third cluster gets another chroot.

Depending on how busy the servers are, you might be able to run both
Solr and Zookeeper on some of the machines.  If you have the extra
hardware, it's recommended to separate them.

Thanks,
Shawn



Re: Query results change

2016-01-15 Thread Binoy Dalal
You should try debugging such queries to see how exactly they're being
executed.
That will give you an idea as to why you're seeing the results you see.

On Fri, 15 Jan 2016, 19:05 Brian Narsi  wrote:

> We have an index of 25 fields. Currently number of records in index is
> about 120,000. We are using
>
> parser: edismax
>
> qf: contains 8 fields
>
> fq: 1 field
>
> mm = 1
>
> qs = 6
>
> pf: containing g 3 fields
>
> bf: containing 1 field
>
> We have noticed that sometimes results change between two searches even if
> everything is constant.
>
> What we have identified is if we reindex data and optimize it remedies the
> situation.
>
> Is that expected behavior? Or should we also look into other factors?
>
> Thanks
>
-- 
Regards,
Binoy Dalal


Re: Classes in solr_home /lib cannot import from solr/dist

2016-01-15 Thread Shawn Heisey
On 1/15/2016 5:36 AM, Callum Lamb wrote:
> Good to know Solr already loads them, that removed a bunch of lines from my
> solrconfig.xml.
>
> Having to copy the required jars from dist/ to lib/ isn't ideal but if
> that's the only solution then at least I can stop searching for a solution
> and figure out how best to deal with this limitation.
>
> I assume the reason for this is that the libs in solr.home.home/lib are
> loaded at runtime? I don't know much about how this works in Java but i'm
> guessing Solr can access the classes in the Jars but not the other way
> around?

Classloaders in Java are a complex topic that I do not fully understand.

The contents of the $SOLR_HOME/lib directory are loaded by the main Solr
classloader before any cores are started, and all of the cores that get
started afterwards are able to use those classes.  If the core-level
classloader chooses to load one of the jars a second time, then there
can be problems.

Rather than try and understand all the complexities of class loading, I
find it better to simply place all the jars in the one location that
Solr loads automatically and take the decision away from the individual
cores.  It makes everything easier.

Thanks,
Shawn



Re: SolR 5.3.1 deletes index files

2016-01-15 Thread Daniel Collins
Can I just clarify something.  The title of this thread implies Solr is
losing data when it shuts down which would be really bad(!)  The core isn't
deleting any data, it is performing a merge, so the data exists, just in
fewer larger segments instead of all the smaller segments you had before.

So the issue you have is why does the core do a merge on shutdown?  Now
that is a valid question, but in terms of risk, its a much lesser problem.
There is no data loss here, right??

I think what you need to do is investigate your MergePolicy configuration.
Sounds like you want NoMergePolicy, which I assume from the name means it
never merges. :)

On 15 January 2016 at 10:58, Moll, Dr. Andreas  wrote:

> Hi,
>
> we still have the problem that SolR deletes index files on closing the
> application if the index was changed in the meantime from the production
> application (which has an embedded SolR-Server).
> The problem also occurs if we use a local file system instead of a NFS.
> I have changed the loglevel to DEBUG and got some interesting lines,
> especially:
>
> 1140211 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded
> class
> org.apache.lucene.index.MergeTrigger from
> WebAppClassLoader=174573182@a67c67e
>
> Why does SolR try a merge action on closing the core, if this SolR instance
> did not write any changes on the index?
>
> I have also tried to add the following lines to the solrconfig, but it
> didn't change anything:
>
> 
>   1000
>   1000
>   1000DAY
> 
>
> Below you find the log and an ls-Output before and after shuting down SolR.
> Maybe someone can help us along?
>
> Thanks and best regards
>
> Andreas Moll
>
> 1140203 INFO  (Thread-0) [   x:recht] o.a.s.c.SolrCore [recht]  CLOSING
> SolrCore org.apache.solr.core.SolrCore@7cb0d12c
> 1140206 INFO  (Thread-0) [   x:recht] o.a.s.u.UpdateHandler closing
> DirectUpdateHandler2{commits=0,autocommit
> maxTime=15000ms,autocommits=0,soft
> autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
> 1140207 INFO  (Thread-0) [   x:recht] o.a.s.u.SolrCoreState Closing
> SolrCoreState
> 1140207 INFO  (Thread-0) [   x:recht] o.a.s.u.DefaultSolrCoreState
> SolrCoreState ref count has reached 0 - closing IndexWriter
> 1140207 INFO  (Thread-0) [   x:recht] o.a.s.u.DefaultSolrCoreState closing
> IndexWriter with IndexWriterCloser
> 1140208 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded
> class org.apache.solr.request.LocalSolrQueryRequest from
> WebAppClassLoader=174573182@a67c67e
> 1140209 DEBUG (Thread-0) [   x:recht] o.a.s.u.SolrIndexWriter Closing
> Writer DirectUpdateHandler2
> 1140211 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded
> class org.apache.lucene.index.MergeTrigger from
> WebAppClassLoader=174573182@a67c67e
> 1140213 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded
> class org.apache.lucene.store.FileSwitchDirectory from
> WebAppClassLoader=174573182@a67c67e
> 1140236 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded
> class java.nio.file.FileSystem
> 1140236 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded
> class java.nio.file.FileSystem from null
> 1140238 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded
> class java.nio.file.FileStore
> 1140238 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded
> class java.nio.file.FileStore from null
> 1140259 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded
> class java.nio.file.attribute.FileTime
> 1140259 DEBUG (Thread-0) [   x:recht] o.e.j.w.WebAppClassLoader loaded
> class java.nio.file.attribute.FileTime from null
> 1140518 DEBUG (Thread-0) [   x:recht] o.a.s.c.CachingDirectoryFactory
> Releasing directory:
> /mnt/solr/jpprodt1/abzug/prodman_solrhome/solr_recht/core_recht/data/index
> 1 false
> 1140520 INFO  (Thread-0) [   x:recht] o.a.s.c.SolrCore [recht] Closing
> main searcher on request.
> 1140521 DEBUG (Thread-0) [   x:recht] o.a.s.s.SolrIndexSearcher Closing
> Searcher@664499[recht] main
>
> fieldValueCache{lookups=0,hits=0,hitratio=0.0,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.0,cumulative_inserts=0,cumulative_evictions=0}
>
> filterCache{lookups=0,hits=0,hitratio=0.0,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.0,cumulative_inserts=0,cumulative_evictions=0}
>
> queryResultCache{lookups=0,hits=0,hitratio=0.0,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.0,cumulative_inserts=0,cumulative_evictions=0}
>
> 

Re: SolR 5.3.1 deletes index files

2016-01-15 Thread Daniel Collins
I know Solr used to have issues with indexes on NFS, there was a
segments.gen file specifically for issues around that, though that was
removed in 5.0. But you say this happens on local disks too, so that would
rule NFS out of it.

I still think you should look at ensuring your merge policy is turned off
in solrconfig.xml (if I understand your scenario, you have 1 instance which
is read-only for searching, and another writing to the same index
location), and did your turn infostream on as Erick suggested?

2016-01-15 16:01 GMT+00:00 Moll, Dr. Andreas :

> Hi,
>
> If you look at the files at the ls-Output in my last post you will see
> that SolR has deleted the
> segments_f -file. Thus the index can no longer be loaded.
>
> I also had other cases in which the data directory of SolR was empty after
> the SolR shutdown.
>
> And yes, it ist bad.
>
> Best regards
>
> Andreas Moll
>
> >Can I just clarify something.  The title of this thread implies Solr is
> >losing data when it shuts down which would be really bad(!)
> >The core isn't
> >deleting any data, it is performing a merge, so the data exists, just in
> >fewer larger segments instead of all the smaller segments you had before.
>
>
> Vertraulichkeitshinweis
> Diese Information und jeder uebermittelte Anhang beinhaltet vertrauliche
> Informationen und ist nur fuer die Personen oder das Unternehmen bestimmt,
> an welche sie tatsaechlich gerichtet ist. Sollten Sie nicht der
> Bestimmungsempfaenger sein, weisen wir Sie darauf hin, dass die
> Verbreitung, das (auch teilweise) Kopieren sowie der Gebrauch der
> empfangenen E-Mail und der darin enthaltenen Informationen gesetzlich
> verboten sein kann und gegebenenfalls Schadensersatzpflichten ausloesen
> kann. Sollten Sie diese Nachricht aufgrund eines Uebermittlungsfehlers
> erhalten haben, bitten wir Sie den Sender unverzueglich hiervon in Kenntnis
> zu setzen.
> Sicherheitswarnung: Bitte beachten Sie, dass das Internet kein sicheres
> Kommunikationsmedium ist. Obwohl wir im Rahmen unseres
> Qualitaetsmanagements und der gebotenen Sorgfalt Schritte eingeleitet
> haben, um einen Computervirenbefall weitestgehend zu verhindern, koennen
> wir wegen der Natur des Internets das Risiko eines Computervirenbefalls
> dieser E-Mail nicht ausschliessen.
>


RE: Pro and cons of using Solr Cloud vs standard Master Slave Replica

2016-01-15 Thread Davis, Daniel (NIH/NLM) [C]
In the multi-tenant model, SolrCloud shines because the configuration 
directories need not include any details about the cluster.SolrCloud also 
shines if the number of documents and/or indexing rate requires sharding.

But master-slave with replica configuration is OK if you have just a couple of 
related cores and their configuration isn't too dynamic.I know that in my 
very old-school systems environment, getting all the ports/firewalls configured 
right for SolrCloud and maintaining security is a bit hairy.

Hoping this helps,

-Original Message-
From: outlook_288fbf38c031d...@outlook.com 
[mailto:outlook_288fbf38c031d...@outlook.com] On Behalf Of Gian Maria Ricci - 
aka Alkampfer
Sent: Friday, January 15, 2016 3:26 AM
To: solr-user@lucene.apache.org
Subject: RE: Pro and cons of using Solr Cloud vs standard Master Slave Replica

Yes, I've checked that jira some weeks ago and it is the reason why I was 
telling that there is still no clear procedure to backup SolrCloud in current 
latest version.  I'm glad that the priority is Major, but until it is not 
closed in an official version, I have to tell to customers that there is not 
easy and supported backup procedure for SolrCloud configuration :(.

--
Gian Maria Ricci
Cell: +39 320 0136949



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: giovedì 14 gennaio 2016 16:46
To: solr-user 
Subject: Re: Pro and cons of using Solr Cloud vs standard Master Slave Replica

re: SolrCloud backup/restore: https://issues.apache.org/jira/browse/SOLR-5750

not committed yet, but getting attention.



On Thu, Jan 14, 2016 at 6:19 AM, Gian Maria Ricci - aka Alkampfer 
 wrote:
> Actually there are situation where a restore is needed, suppose that someone 
> does some error and deletes all documents from a collection, or maybe deletes 
> a series of document, etc. I know that this is not likely to happen, but in 
> mission critical enterprise system, we always need a detailed procedure for 
> disaster recovering.
>
> For such scenario we need to plan the worst case, where everything is lost.
>
> With Master Slave is just a matter of recreating machines, reconfigure the 
> core, and restore a backup, and the game is done, with SolrCloud is not 
> really clear for me how can I backup / restore data. From what I've found in 
> the internet I need to backup every shard of the collection, and, if we need 
> to restore everything from a backup, we can recreate the collection and then 
> restore all the individual shards. I do not know if this is a supported 
> scenario / procedure, but theoretically it could work.
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
>
>
> -Original Message-
> From: Alessandro Benedetti [mailto:abenede...@apache.org]
> Sent: giovedì 14 gennaio 2016 10:46
> To: solr-user@lucene.apache.org
> Subject: Re: Pro and cons of using Solr Cloud vs standard Master Slave 
> Replica
>
> It's true that SolrCloud is adding some complexity.
> But few observations :
>
> SolrCloud has some disadvantages and c an't beat the easiness and 
> simpleness
>> of
>> Master Slave Replica. So I can only encourage to keep Master Slave 
>> Replica in future versions.
>
>
> I agree, it can happen situations when you have really simple and not 
> critical systems.
> Anyway old style replication is still used in SolrCloud, so I think it is 
> going to stay for a while ( until is replaced with something else) .
>
> To answer to Gian :
>
> One of the problem I've found is that I've not found a simple way to 
> backup
>> the content of a collection to restore in situation of disaster recovery.
>> With simple master / slave scenario we can use the replication 
>> handler to generate backups that can be easily used to restore 
>> content of a core, while with SolrCloud is not clear how can we 
>> obtain a full backup
>
>
> To be fair, Disaster recovery is when SolrCloud shines.
> If you lose random nodes across your collection, you simply need to fix them 
> and spin up again .
> The system will automatically restore the content to the last version availa 
> ble ( the tlog first and the  leader ( if the tlog is not enough) will help 
> the dead node to catch up .
> If you lose all the replicas for a shard and you lose the content in disk of 
> all this replicas ( index and tlog), SolrCloud can't help you.
> For this unlikely scenarios a backup is suggested.
> You could restore anyway the backup only to one node, and the replicas are 
> going to catch up .
>
> Probably is just a matter of backupping every shard with standard
>> replication handler and then restore each shard after recreating the 
>> collection
>
>
> Definitely not, SolrCloud is there to avoid this manual stuff.
>
> Cheers
>
>
> On 14 January 2016 at 08:58, Gian Maria Ricci - aka Alkampfer < 
> alkamp...@nablasoft.com> wrote:
>
>> I agree that SolrCloud has not only advantages, I really understand 

Re: Pro and cons of using Solr Cloud vs standard Master Slave Replica

2016-01-15 Thread Jack Krupansky
Yeah, and to the original question, there is no master list of features and
how SolrCloud vs. legacy distributed mode compare feature by feature.

And until SolrCloud actually does subsume every single (important) feature
of legacy distributed mode, Solr probably still needs to continue to
support legacy distributed mode, including backup.

The doc does need better coverage of backup and restore at the cluster
level, including configuration files. What's there now is basically the old
single-node replication backup. What exactly is the recommended best
practice for backing up a single shard, let alone all shards. Should
backups be collection-based as well?


-- Jack Krupansky

On Fri, Jan 15, 2016 at 3:26 AM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> Yes, I've checked that jira some weeks ago and it is the reason why I was
> telling that there is still no clear procedure to backup SolrCloud in
> current latest version.  I'm glad that the priority is Major, but until it
> is not closed in an official version, I have to tell to customers that
> there is not easy and supported backup procedure for SolrCloud
> configuration :(.
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: giovedì 14 gennaio 2016 16:46
> To: solr-user 
> Subject: Re: Pro and cons of using Solr Cloud vs standard Master Slave
> Replica
>
> re: SolrCloud backup/restore:
> https://issues.apache.org/jira/browse/SOLR-5750
>
> not committed yet, but getting attention.
>
>
>
> On Thu, Jan 14, 2016 at 6:19 AM, Gian Maria Ricci - aka Alkampfer <
> alkamp...@nablasoft.com> wrote:
> > Actually there are situation where a restore is needed, suppose that
> someone does some error and deletes all documents from a collection, or
> maybe deletes a series of document, etc. I know that this is not likely to
> happen, but in mission critical enterprise system, we always need a
> detailed procedure for disaster recovering.
> >
> > For such scenario we need to plan the worst case, where everything is
> lost.
> >
> > With Master Slave is just a matter of recreating machines, reconfigure
> the core, and restore a backup, and the game is done, with SolrCloud is not
> really clear for me how can I backup / restore data. From what I've found
> in the internet I need to backup every shard of the collection, and, if we
> need to restore everything from a backup, we can recreate the collection
> and then restore all the individual shards. I do not know if this is a
> supported scenario / procedure, but theoretically it could work.
> >
> > --
> > Gian Maria Ricci
> > Cell: +39 320 0136949
> >
> >
> >
> > -Original Message-
> > From: Alessandro Benedetti [mailto:abenede...@apache.org]
> > Sent: giovedì 14 gennaio 2016 10:46
> > To: solr-user@lucene.apache.org
> > Subject: Re: Pro and cons of using Solr Cloud vs standard Master Slave
> > Replica
> >
> > It's true that SolrCloud is adding some complexity.
> > But few observations :
> >
> > SolrCloud has some disadvantages and c an't beat the easiness and
> > simpleness
> >> of
> >> Master Slave Replica. So I can only encourage to keep Master Slave
> >> Replica in future versions.
> >
> >
> > I agree, it can happen situations when you have really simple and not
> critical systems.
> > Anyway old style replication is still used in SolrCloud, so I think it
> is going to stay for a while ( until is replaced with something else) .
> >
> > To answer to Gian :
> >
> > One of the problem I've found is that I've not found a simple way to
> > backup
> >> the content of a collection to restore in situation of disaster
> recovery.
> >> With simple master / slave scenario we can use the replication
> >> handler to generate backups that can be easily used to restore
> >> content of a core, while with SolrCloud is not clear how can we
> >> obtain a full backup
> >
> >
> > To be fair, Disaster recovery is when SolrCloud shines.
> > If you lose random nodes across your collection, you simply need to fix
> them and spin up again .
> > The system will automatically restore the content to the last version
> availa ble ( the tlog first and the  leader ( if the tlog is not enough)
> will help the dead node to catch up .
> > If you lose all the replicas for a shard and you lose the content in
> disk of all this replicas ( index and tlog), SolrCloud can't help you.
> > For this unlikely scenarios a backup is suggested.
> > You could restore anyway the backup only to one node, and the replicas
> are going to catch up .
> >
> > Probably is just a matter of backupping every shard with standard
> >> replication handler and then restore each shard after recreating the
> >> collection
> >
> >
> > Definitely not, SolrCloud is there to avoid this manual stuff.
> >
> > Cheers
> >
> >
> > On 14 January 2016 at 08:58, Gian Maria Ricci - aka Alkampfer <
> alkamp...@nablasoft.com> wrote:
> >
> >> 

Re: Speculation on Memory needed to efficently run a Solr Instance.

2016-01-15 Thread Erick Erickson
And to make matters worse, much worse (actually, better)...

See: https://issues.apache.org/jira/browse/SOLR-8220

That ticket (and there will be related ones) is about returning
data from DocValues fields rather than from the stored data
in some situations. Which means it will soon (I hope) be
entirely possible to not have an .fdt file at all. There are some
caveats to that approach, but it can completely bypass the
read-from-disk, decompress, return the data process.

Do note, however, that you can't have analyzed text be docValues
so this will be suitable only for string, numerics and the like fields.

Best,
Erick

On Fri, Jan 15, 2016 at 2:56 AM, Gian Maria Ricci - aka Alkampfer
 wrote:
> THanks a lot I'll have a look to Sematext SPM.
>
> Actually the index is not static, but the number of new documents will be
> small and probably they will be indexed during the night, so I'm not
> expecting too much problem from merge factor. We can index new document
> during the night and then optimize the index. (during night there are no
> searches).
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
>
>
> -Original Message-
> From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
> Sent: venerdì 15 gennaio 2016 11:06
> To: solr-user@lucene.apache.org
> Subject: Re: Speculation on Memory needed to efficently run a Solr Instance.
>
> Hi,
> OS does not care much about search v.s. retrieve so amount of RAM needed for
> file caches would depend on your index usage patterns. If you are not
> retrieving stored fields much and most/all results are only
> id+score, than it can be assumed that you can go with less RAM than
> actual index size. In such case you can question if you need stored fields
> in index. Also if your index/usage pattern is such that only small subset of
> documents is retrieved with stored fields, than it can also be assumed it
> will never need to cache entire fdt file.
> One thing that you forgot (unless you index is static) is segments merging -
> in worst case system will have two "copies" of index and having extra memory
> can help in such cases.
> The best approach is to use some tool and monitor IO and memory metrics.
> One such tool is Sematext's SPM (http://sematext.com/spm) where you can see
> metrics for both system and SOLR.
>
> Thanks,
> Emir
>
> On 15.01.2016 10:43, Gian Maria Ricci - aka Alkampfer wrote:
>>
>> Hi,
>>
>> When it is time to calculate how much RAM a solr instance needs to run
>> with good performance, I know that it is some form of art, but I’m
>> looking at a general “formula” to have at least one good starting point.
>>
>> Apart the RAM devoted to Java HEAP, that is strongly dependant on how
>> I configure caches, and the distribution of queries in my system, I’m
>> particularly interested in the amount of RAM to leave to operating
>> system to use File Cache.
>>
>> Suppose I have an index of 51 Gb of dimension, clearly having that
>> amount of ram devoted to the OS is the best approach, so all index
>> files can be cached into memory by the OS, thus I can achieve maximum
>> speed.
>>
>> But if I look at the detail of the index, in this particular example I
>> see that the bigger file has .fdt extension, it is the stored field
>> for the documents, so it affects retrieval of document data, not the
>> re
> al search process. Since this file is 24 GB of size, it is almost
>> half of the space of the index.
>>
>> My question is: it could be safe to assume that a good starting point
>> for the amount of RAM to leave to the OS is the dimension of the index
>> less the dimension of the .fdt file because it has less importance in
>> the search process?
>>
>> Are there any particular setting at OS level (CentOS linux) to have
>> maximum benefit from OS file cache? (documentation at
>> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Produc
>> tion#TakingSolrtoProduction-MemoryandGCSettingsdoes
>> not have any information related to OS configuration). Elasticsearch
>> (https://www.elastic.co/guide/en/elasticsearch/reference/1.4/setup-con
>> figuration.html) generally have some suggestions such as using
>> mlockall, disable swap etc etc, I wonder if there are similar
>> suggestions for solr.
>>
>> Many thanks for all the great help you are giving me in this mailing
>> list.
>>
>> --
>> Gian Maria Ricci> Cell: +39 320 0136949
>>
>> https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZk
>> VVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d
>> -e1-ft#http://www.codewrecks.com/files/signature/mvp.png
>> https
>> ://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrmGLUT3
>> fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBclKA=s0
>> -d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg
>> https://ci3.googleuserconte
>> 

Re: SolR 5.3.1 deletes index files

2016-01-15 Thread Moll, Dr. Andreas
Hi,

If you look at the files at the ls-Output in my last post you will see that 
SolR has deleted the
segments_f -file. Thus the index can no longer be loaded.

I also had other cases in which the data directory of SolR was empty after the 
SolR shutdown.

And yes, it ist bad.

Best regards

Andreas Moll

>Can I just clarify something.  The title of this thread implies Solr is
>losing data when it shuts down which would be really bad(!)
>The core isn't
>deleting any data, it is performing a merge, so the data exists, just in
>fewer larger segments instead of all the smaller segments you had before.


Vertraulichkeitshinweis
Diese Information und jeder uebermittelte Anhang beinhaltet vertrauliche 
Informationen und ist nur fuer die Personen oder das Unternehmen bestimmt, an 
welche sie tatsaechlich gerichtet ist. Sollten Sie nicht der 
Bestimmungsempfaenger sein, weisen wir Sie darauf hin, dass die Verbreitung, 
das (auch teilweise) Kopieren sowie der Gebrauch der empfangenen E-Mail und der 
darin enthaltenen Informationen gesetzlich verboten sein kann und 
gegebenenfalls Schadensersatzpflichten ausloesen kann. Sollten Sie diese 
Nachricht aufgrund eines Uebermittlungsfehlers erhalten haben, bitten wir Sie 
den Sender unverzueglich hiervon in Kenntnis zu setzen.
Sicherheitswarnung: Bitte beachten Sie, dass das Internet kein sicheres 
Kommunikationsmedium ist. Obwohl wir im Rahmen unseres Qualitaetsmanagements 
und der gebotenen Sorgfalt Schritte eingeleitet haben, um einen 
Computervirenbefall weitestgehend zu verhindern, koennen wir wegen der Natur 
des Internets das Risiko eines Computervirenbefalls dieser E-Mail nicht 
ausschliessen.


Re: Query results change

2016-01-15 Thread Erick Erickson
Probably the fact that information from deleted/updated
documents is still hanging around in the corpus until
merged away.

The nub of the issue is that terms in deleted documents
(or the replaced doc if you update) still influence tf/idf
calculations. If you optimize as Binoy suggests, all of
the information relating to deleted docs is removed.

If this is a SolrCloud setup, you can be getting
scores from different replicas of the same shard. Due to
the fact that merging (which purges deleted information)
can occur at different times on different replicas, the scores
calculated for a particular doc might be different depending
on which replica calculated it.

In either setup (SolrCloud or not), background merging can
change the result order by removing information associated
with deleted docs.

All that said, does this have _practical_ consequences or
is this mostly a curiosity question?

Best,
Erick

On Fri, Jan 15, 2016 at 5:40 AM, Binoy Dalal  wrote:
> You should try debugging such queries to see how exactly they're being
> executed.
> That will give you an idea as to why you're seeing the results you see.
>
> On Fri, 15 Jan 2016, 19:05 Brian Narsi  wrote:
>
>> We have an index of 25 fields. Currently number of records in index is
>> about 120,000. We are using
>>
>> parser: edismax
>>
>> qf: contains 8 fields
>>
>> fq: 1 field
>>
>> mm = 1
>>
>> qs = 6
>>
>> pf: containing g 3 fields
>>
>> bf: containing 1 field
>>
>> We have noticed that sometimes results change between two searches even if
>> everything is constant.
>>
>> What we have identified is if we reindex data and optimize it remedies the
>> situation.
>>
>> Is that expected behavior? Or should we also look into other factors?
>>
>> Thanks
>>
> --
> Regards,
> Binoy Dalal


Re: Speculation on Memory needed to efficently run a Solr Instance.

2016-01-15 Thread Jack Krupansky
Personally, I'll continue to recommend that the ideal goal is to fully
cache the entire Lucene index in system memory, as well as doing a proof of
concept implementation to validate actual performance for your actual data.
You can do a POC with a small fraction of your full data, like 15% or even
10%, and then it's fairly safe to simply multiple those numbers to get the
RAM needed for the full 100% of your data (or even 120% to allow for modest
growth.)

Be careful about distinguishing search and query - sure, only a subset of
the data is needed to find the matching documents, but then the stored data
must be fetched to return the query results (search/lookup vs. query
results.) If the stored values are not also cached, you will increase the
latency of your overall query (returning results) even if the
search/match/lookup was reasonably fast.

So, the model is to prototype with a measured subset of your data, see how
the latency and system memory usage work out, and then scale that number up
for total memory requirement.

Again to be clear, if you really do need the best/minimal overall query
latency, your best bet is to have sufficient system memory to fully cache
the entire index. If you actually don't need minimal latency, then of
course you can feel free to trade off RAM for lower latency.



-- Jack Krupansky

On Fri, Jan 15, 2016 at 4:43 AM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> Hi,
>
>
>
> When it is time to calculate how much RAM a solr instance needs to run
> with good performance, I know that it is some form of art, but I’m looking
> at a general “formula” to have at least one good starting point.
>
>
>
> Apart the RAM devoted to Java HEAP, that is strongly dependant on how I
> configure caches, and the distribution of queries in my system, I’m
> particularly interested in the amount of RAM to leave to operating system
> to use File Cache.
>
>
>
> Suppose I have an index of 51 Gb of dimension, clearly having that amount
> of ram devoted to the OS is the best approach, so all index files can be
> cached into memory by the OS, thus I can achieve maximum speed.
>
>
>
> But if I look at the detail of the index, in this particular example I see
> that the bigger file has .fdt extension, it is the stored field for the
> documents, so it affects retrieval of document data, not the real search
> process. Since this file is 24 GB of size, it is almost half of the space
> of the index.
>
>
>
> My question is: it could be safe to assume that a good starting point for
> the amount of RAM to leave to the OS is the dimension of the index less the
> dimension of the .fdt file because it has less importance in the search
> process?
>
>
>
> Are there any particular setting at OS level (CentOS linux) to have
> maximum benefit from OS file cache? (documentation at
> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-MemoryandGCSettings
> does not have any information related to OS configuration). Elasticsearch (
> https://www.elastic.co/guide/en/elasticsearch/reference/1.4/setup-configuration.html)
> generally have some suggestions such as using mlockall, disable swap etc
> etc, I wonder if there are similar suggestions for solr.
>
>
>
> Many thanks for all the great help you are giving me in this mailing list.
>
>
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
> [image:
> https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZkVVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d-e1-ft#http://www.codewrecks.com/files/signature/mvp.png]
>  [image:
> https://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrmGLUT3fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBclKA=s0-d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg]
>  [image:
> https://ci3.googleusercontent.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_GpcIZNEPjCzOYMrPl-r1DViPE378qNAQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT=s0-d-e1-ft#http://www.codewrecks.com/files/signature/twitter.jpg]
>  [image:
> https://ci5.googleusercontent.com/proxy/iuDOD2sdaxRDvTwS8MO7-CcXchpNJX96uaWuvagoVLcjpAPsJi88XeOonE4vHT6udVimo7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d-e1-ft#http://www.codewrecks.com/files/signature/rss.jpg]
>  [image:
> https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86YXcwaKfn3x7ydAObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg=s0-d-e1-ft#http://www.codewrecks.com/files/signature/skype.jpg]
>
>
>


Re: Speculation on Memory needed to efficently run a Solr Instance.

2016-01-15 Thread Toke Eskildsen
Jack Krupansky  wrote:

> Again to be clear, if you really do need the best/minimal overall query
> latency, your best bet is to have sufficient system memory to fully cache
> the entire index. If you actually don't need minimal latency, then of
> course you can feel free to trade off RAM for lower latency.

This bears repeating and I wish it would be added each time someone presents 
the "free cache = index size" rule of thumb. Thank you for stating it so 
clearly, Jack.

- Toke Eskildsen


Solr relevancy scoring issue

2016-01-15 Thread sara hajili
Hi all .
I have a issue with solr scoring.
How solr scoring treat ?
I mean is it linearly?


Re: Issue with stemming and lemmatizing

2016-01-15 Thread Jack Krupansky
Yes, you can do all of that, but... Solr is more of a toolkit rather than a
packaged solution, so you will have plug together all the pieces yourself.
There are a variety of stemmers in Solr and any number of techniques for
have to index and query using the stemmed and unstemmed variants of words.

Plenty of doc for you to start reading. Once you get the basics, then you
can move on to more specific and advanced details:
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers%2C+Tokenizers%2C+and+Filters




-- Jack Krupansky

On Fri, Jan 15, 2016 at 2:58 PM, sara hajili  wrote:

> I wanna to write my own text tokenizer.
> And my question is about what solr treat with stemming  or lemmatizing?
> Solr store both lemmatizerd token and orginal token together?
> I mean if in index time solr lemmatize creation to create.
> And in query time.user want to search about exactly creation not creat.
> How solr do that?!
> If I lemmatize query string creation  to create
> In this way solr find all create not creatin.
> How solr behave with stemmer and lemmatizer?index both original and
> lemmatized word?
>


Boost query vs function query in edismax query

2016-01-15 Thread sara hajili
Hi all as I underestood.
Both of them are for affecting on relevence scoring.but u have more
dominate on relevence scoring when using  boosted  query.is it true?
I am willing to understand more about difference between 2.
And know what is best situation for using each other?
Tnx.


state.json base_url has internal IP of ec2 instance set instead of 'public DNS' entry in some cases

2016-01-15 Thread Brendan Grainger
Hi,

I am creating a new collection using the following get request:

http://ec2_host:8983/solr/admin/collections?action=CREATE=collection_name_1=oem/conf=1

What I’m finding is that now and then base_url for the replica in state.json is 
set to the internal IP of the AWS node. i.e.:

"base_url":"http://10.29.XXX.XX:8983/solr”,

On other attempts it’s set to the public DNS name of the node:

"base_url":"http://ec2_host:8983/solr”,

In my /etc/defaults/solr.in.sh I have:

SOLR_HOST=“ec2_host”

which I thought is what I needed to get the public DNS name set in base_url. 

Am I doing this incorrectly or is there something I’m missing here? The issue 
this causes is zookeeper gives back an internal IP to my indexing processes 
when the internal IP is set on base_url and they then can’t find the server. 

Thanks!






Re: collapse filter query

2016-01-15 Thread sara hajili
Tnx Joel.
I wanted to get distinct result from solr.so I found to approach collapse
filter  and facet.
And more like this doesn't support facet. And as u said solr 5.3 has bug on
collapse filter.
If I wont to immigrate to solr 5.4.
Is any other approach to get distinct value that I can use in solr  5.3.1?
On Jan 12, 2016 1:39 AM, "Joel Bernstein"  wrote:

> I went to go work on the issue and found it was already fixed 7 weeks ago.
> The bug fix is available in Solr 5.4.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Jan 11, 2016 at 3:12 PM, Susheel Kumar 
> wrote:
>
> > You can go to https://issues.apache.org/jira/browse/SOLR/ and create
> Jira
> > ticket after signing in.
> >
> > Thanks,
> > Susheel
> >
> > On Mon, Jan 11, 2016 at 2:15 PM, sara hajili 
> > wrote:
> >
> > > Tnx.How I can create a jira ticket?
> > > On Jan 11, 2016 10:42 PM, "Joel Bernstein"  wrote:
> > >
> > > > I believe this is a bug. I think the reason this is occurring is that
> > you
> > > > have an index segment with no values at all in the collapse field. If
> > you
> > > > could create a jira ticket for this I will look at resolving the
> issue.
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Mon, Jan 11, 2016 at 2:03 PM, sara hajili 
> > > > wrote:
> > > >
> > > > > I am using solr 5.3.1
> > > > > On Jan 11, 2016 10:30 PM, "Joel Bernstein" 
> > wrote:
> > > > >
> > > > > > Which version of Solr are you using?
> > > > > >
> > > > > > Joel Bernstein
> > > > > > http://joelsolr.blogspot.com/
> > > > > >
> > > > > > On Mon, Jan 11, 2016 at 6:39 AM, sara hajili <
> > hajili.s...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > hi all
> > > > > > > i have a MLT query and i wanna to use collapse filter query.
> > > > > > > and i wanna to use collapse expand nullPolicy.
> > > > > > > in this way when i used it :
> > > > > > > {!collapse field=original_post_id nullPolicy=expand}
> > > > > > > i got my appropriate result .
> > > > > > > (in solr web UI)
> > > > > > >
> > > > > > > but in regular search handler "/select",when i used
> > > > > > > {!collapse field=original_post_id nullPolicy=expand}
> > > > > > > i got error:
> > > > > > >
> > > > > > > {
> > > > > > >   "responseHeader":{
> > > > > > > "status":500,
> > > > > > > "QTime":2,
> > > > > > > "params":{
> > > > > > >   "q":"*:*",
> > > > > > >   "indent":"true",
> > > > > > >   "fq":"{!collapse field=original_post_id
> > nullPolicy=expand}",
> > > > > > >   "wt":"json"}},
> > > > > > >   "error":{
> > > > > > > "trace":"java.lang.NullPointerException\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.CollapsingQParserPlugin$IntScoreCollector.finish(CollapsingQParserPlugin.java:763)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:211)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1678)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1497)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:555)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:522)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:277)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat
> > > > > > > org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)\n\tat
> > > > > > >
> > > > >
> > >
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)\n\tat
> > > > > > >
> > > >
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)\n\tat
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)\n\tat
> > > > > > >
> > > > > > >
> 

Issue with stemming and lemmatizing

2016-01-15 Thread sara hajili
I wanna to write my own text tokenizer.
And my question is about what solr treat with stemming  or lemmatizing?
Solr store both lemmatizerd token and orginal token together?
I mean if in index time solr lemmatize creation to create.
And in query time.user want to search about exactly creation not creat.
How solr do that?!
If I lemmatize query string creation  to create
In this way solr find all create not creatin.
How solr behave with stemmer and lemmatizer?index both original and
lemmatized word?


Re: collapse filter query

2016-01-15 Thread Joel Bernstein
The bug only occurs if you collapse on a numeric field. If you can re-index
the field into a String field it should work fine.

You can also use grouping with facets. Depending on you usecase this might
be your best choice:

https://cwiki.apache.org/confluence/display/solr/Result+Grouping

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jan 15, 2016 at 3:04 PM, sara hajili  wrote:

> Tnx Joel.
> I wanted to get distinct result from solr.so I found to approach collapse
> filter  and facet.
> And more like this doesn't support facet. And as u said solr 5.3 has bug on
> collapse filter.
> If I wont to immigrate to solr 5.4.
> Is any other approach to get distinct value that I can use in solr  5.3.1?
> On Jan 12, 2016 1:39 AM, "Joel Bernstein"  wrote:
>
> > I went to go work on the issue and found it was already fixed 7 weeks
> ago.
> > The bug fix is available in Solr 5.4.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Mon, Jan 11, 2016 at 3:12 PM, Susheel Kumar 
> > wrote:
> >
> > > You can go to https://issues.apache.org/jira/browse/SOLR/ and create
> > Jira
> > > ticket after signing in.
> > >
> > > Thanks,
> > > Susheel
> > >
> > > On Mon, Jan 11, 2016 at 2:15 PM, sara hajili 
> > > wrote:
> > >
> > > > Tnx.How I can create a jira ticket?
> > > > On Jan 11, 2016 10:42 PM, "Joel Bernstein" 
> wrote:
> > > >
> > > > > I believe this is a bug. I think the reason this is occurring is
> that
> > > you
> > > > > have an index segment with no values at all in the collapse field.
> If
> > > you
> > > > > could create a jira ticket for this I will look at resolving the
> > issue.
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Mon, Jan 11, 2016 at 2:03 PM, sara hajili <
> hajili.s...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I am using solr 5.3.1
> > > > > > On Jan 11, 2016 10:30 PM, "Joel Bernstein" 
> > > wrote:
> > > > > >
> > > > > > > Which version of Solr are you using?
> > > > > > >
> > > > > > > Joel Bernstein
> > > > > > > http://joelsolr.blogspot.com/
> > > > > > >
> > > > > > > On Mon, Jan 11, 2016 at 6:39 AM, sara hajili <
> > > hajili.s...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > hi all
> > > > > > > > i have a MLT query and i wanna to use collapse filter query.
> > > > > > > > and i wanna to use collapse expand nullPolicy.
> > > > > > > > in this way when i used it :
> > > > > > > > {!collapse field=original_post_id nullPolicy=expand}
> > > > > > > > i got my appropriate result .
> > > > > > > > (in solr web UI)
> > > > > > > >
> > > > > > > > but in regular search handler "/select",when i used
> > > > > > > > {!collapse field=original_post_id nullPolicy=expand}
> > > > > > > > i got error:
> > > > > > > >
> > > > > > > > {
> > > > > > > >   "responseHeader":{
> > > > > > > > "status":500,
> > > > > > > > "QTime":2,
> > > > > > > > "params":{
> > > > > > > >   "q":"*:*",
> > > > > > > >   "indent":"true",
> > > > > > > >   "fq":"{!collapse field=original_post_id
> > > nullPolicy=expand}",
> > > > > > > >   "wt":"json"}},
> > > > > > > >   "error":{
> > > > > > > > "trace":"java.lang.NullPointerException\n\tat
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.CollapsingQParserPlugin$IntScoreCollector.finish(CollapsingQParserPlugin.java:763)\n\tat
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:211)\n\tat
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1678)\n\tat
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1497)\n\tat
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:555)\n\tat
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:522)\n\tat
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:277)\n\tat
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat
> > > > > > > >
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)\n\tat
> > > > > > > >
> > > > > >
> > > >
> > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)\n\tat
> > > > > > 

Re: Solr Block join not working after parent update

2016-01-15 Thread Jack Krupansky
Read the note at the bottom of the doc page:
"One limitation of indexing nested documents is that the whole block of
parent-children documents must be updated together whenever any changes are
required. In other words, even if a single child document or the parent
document is changed, the whole block of parent-child documents must be
indexed together."

See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-NestedChildDocuments

As Mikhail indicatefed, "*the whole block of parent-child documents must be
indexed together.*" They must also be updated together.

-- Jack Krupansky

On Fri, Jan 15, 2016 at 3:31 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> On Thu, Jan 14, 2016 at 10:01 PM, sairamkumar <
> sairam.subraman...@gmail.com>
> wrote:
>
> > This is a show stopper. Kindly suggest solution/alternative.
>
>
> update whole block.
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


Re: Query results change

2016-01-15 Thread Brian Narsi
Data is indexed using Data Import Handler with clean=true, commit=true and
optimize=true. After that there are no updates or delete.

The setup is SolrCloud with 2 shards and 2 replicas each.

If the data and query has not changed, one expects to see the same results
on repeated searches; so it is a matter of users confidence in search
results.

Thanks

On Fri, Jan 15, 2016 at 10:12 AM, Erick Erickson 
wrote:

> Probably the fact that information from deleted/updated
> documents is still hanging around in the corpus until
> merged away.
>
> The nub of the issue is that terms in deleted documents
> (or the replaced doc if you update) still influence tf/idf
> calculations. If you optimize as Binoy suggests, all of
> the information relating to deleted docs is removed.
>
> If this is a SolrCloud setup, you can be getting
> scores from different replicas of the same shard. Due to
> the fact that merging (which purges deleted information)
> can occur at different times on different replicas, the scores
> calculated for a particular doc might be different depending
> on which replica calculated it.
>
> In either setup (SolrCloud or not), background merging can
> change the result order by removing information associated
> with deleted docs.
>
> All that said, does this have _practical_ consequences or
> is this mostly a curiosity question?
>
> Best,
> Erick
>
> On Fri, Jan 15, 2016 at 5:40 AM, Binoy Dalal 
> wrote:
> > You should try debugging such queries to see how exactly they're being
> > executed.
> > That will give you an idea as to why you're seeing the results you see.
> >
> > On Fri, 15 Jan 2016, 19:05 Brian Narsi  wrote:
> >
> >> We have an index of 25 fields. Currently number of records in index is
> >> about 120,000. We are using
> >>
> >> parser: edismax
> >>
> >> qf: contains 8 fields
> >>
> >> fq: 1 field
> >>
> >> mm = 1
> >>
> >> qs = 6
> >>
> >> pf: containing g 3 fields
> >>
> >> bf: containing 1 field
> >>
> >> We have noticed that sometimes results change between two searches even
> if
> >> everything is constant.
> >>
> >> What we have identified is if we reindex data and optimize it remedies
> the
> >> situation.
> >>
> >> Is that expected behavior? Or should we also look into other factors?
> >>
> >> Thanks
> >>
> > --
> > Regards,
> > Binoy Dalal
>


Re: degrades qtime in a 20million doc collection

2016-01-15 Thread Anria B.
Thanks Toke for this.  It gave us a ton to think about, and it really helps
supporting the notion of several smaller indexes over one very large one,
where we can rather distribute a few JVM processes with less size each, than
have one massive one that is according to this, less efficient.



Toke Eskildsen wrote
> I would guess the 100 ms improvement was due to a factor not related to
> heap size. With the exception of a situation where the heap is nearly
> full, increasing Xmx will not improve Solr performance significantly.
> 
> Quick note: Never set Xmx in the range 32GB-40GB (40GB is approximate):
> At the 32GB point, the JVM switches to larger pointers, which means that
> effective heap space is _smaller_ for Xmx=33GB than it is for Xmx=31GB:
> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
> 
> - Toke Eskildsen, State and University Library, Denmark





--
View this message in context: 
http://lucene.472066.n3.nabble.com/fq-degrades-qtime-in-a-20million-doc-collection-tp4250567p4251176.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: state.json base_url has internal IP of ec2 instance set instead of 'public DNS' entry in some cases

2016-01-15 Thread Chris Hostetter

: What I’m finding is that now and then base_url for the replica in 
: state.json is set to the internal IP of the AWS node. i.e.:
: 
: "base_url":"http://10.29.XXX.XX:8983/solr”,
: 
: On other attempts it’s set to the public DNS name of the node:
: 
: "base_url":"http://ec2_host:8983/solr”,
: 
: In my /etc/defaults/solr.in.sh I have:
: 
: SOLR_HOST=“ec2_host”
: 
: which I thought is what I needed to get the public DNS name set in base_url. 

i believe you are correct.  the "now and then" part of your question is 
weird -- it seems to indicate that sometimes the "correct" thing is 
happening, and other times it is not.  

/etc/defaults/solr.in.sh isn't the canonical path for solr.in.sh 
according to the docs/install script for running a production solr 
instance...

https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-ServiceInstallationScript

...how *exactly* are you running Solr on all of your nodes?

because my guess is that you've got some kind of inconsistent setup where 
sometimes when you startup (or restart) a new node it does refer to your 
solr.in.sh file, and other times it does not -- so sometimes solr never 
sees your SOLR_HOST option.  In those cases, when it regesters itself with 
ZooKeeper it uses the current IP as a fallback, and then that info gets 
backed into the metadata for the replicas that get created on that node 
at that point in time.

FWIW, you should be able to spot check that the SOLR_HOST is being applied 
correctly by looking at the java process command line args (using PS, or 
loading the SOlr UI in your browser) and checking for the "-Dhost=..." 
option -- if it's not there, then your solr.in.sh probably wasn't read in 
correctly



-Hoss
http://www.lucidworks.com/

Re: degrades qtime in a 20million doc collection

2016-01-15 Thread Yonik Seeley
On Wed, Jan 13, 2016 at 7:01 PM, Shawn Heisey  wrote:
[...]
>> 2.   q=*=someField:SomeVal   ---> takes 2.5 seconds
>> 3.q=someField:SomeVal -->  300ms
[...]
>>
>> have any of you encountered such a thing?
>> that FQ degrades query time by so much?
> A value of * for your query will be slow.  This is a wildcard query.

Some of the responses in this thread led me to believe that the
important part of Shawn's original answer was overlooked.
Most likely "fq" was not slowing down the request, the additional
wildcard query "q=*" was.

-Yonik


Re: degrades qtime in a 20million doc collection

2016-01-15 Thread Anria B.
hi Yonik

We definitely didn't overlook that q=* being a wildcard scan, we just had so
many systemic problems to focus on I neglected to thank Shawn for that
particular piece of useful information. 

I must admit, I seriously never knew this. Ever since q=* was allowed I was
so happy that it never occurred to me to investigate its details.   Now I
know :)

Combining all the information from everybody here really brought home where
our shortcomings were

1. yes, the q=* was quickly replaced by q=*:* everywhere - quick win
2. caching strategies are being reformed 
3. We're looking into making smaller shards / cores since we do require
super frequent commits, so on smaller bitsets the commit times should be way
less, and we can use the smaller heap sizes to stay optimized in that realm

One last question though please :

Schema investigations :  the  are frequently on Multivalued string
fields, and we believe that it may also be slowing down the  even more,
but we were wondering why.   When we run  on single valued fields its
faster than the multi-valued fields, even when the multi-valued fields
frequently have only a single value in it.

Thanks again for everybody's help and pointers and hints, you kept us busy
with changing our mindset on a lot of things here.

Regards
Anria



--
View this message in context: 
http://lucene.472066.n3.nabble.com/fq-degrades-qtime-in-a-20million-doc-collection-tp4250567p4251212.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: degrades qtime in a 20million doc collection

2016-01-15 Thread Toke Eskildsen
Anria B.  wrote:
> Thanks Toke for this.  It gave us a ton to think about, and it really helps
> supporting the notion of several smaller indexes over one very large one,>
> where we can rather distribute a few JVM processes with less size each, than
> have one massive one that is according to this, less efficient.

There are not many clear-cut answers in Solr land...

There is a fixed overhead to running a Solr instance and you need to have some 
wriggle room in the heap for temporary peaks, such as index updates. This calls 
for a few or only a single Solr instance handling multiple collections.

On the other hand, large Java heaps are prone to long stop-the-World garbage 
collections and there is the memory overhead when exceeding 32GB.


Locally we run 50 Solr instances with 8GB heap each, each holding a single 
shard. At some point I would like to try changing this to 25 instances with 
15GB and 2 shards or maybe 12 instances with 28GB and 4 shards. I will not 
exceed 31GB in a single JVM unless forced to.

- Toke Eskildsen


Re: state.json base_url has internal IP of ec2 instance set instead of 'public DNS' entry in some cases

2016-01-15 Thread Brendan Grainger
Hi Hoss,

Thanks for the reply. I installed the service using the install script. I 
double checked it and it looks like it install solr.in.sh in 
/etc/defaults/solr.in.sh. It actually looks like if it is in /var the install 
script moves it into /etc/defaults (unless I’m reading this wrong):

https://github.com/apache/lucene-solr/blob/trunk/solr/bin/install_solr_service.sh#L281
 


I checked the process and even on restarts it looks like this:

ps aux | grep solr
  my_solr_user  9522  0.2  1.5 3010216 272656 ?  Sl   20:06   0:26 
/usr/lib/jvm/java-8-oracle/bin/java -server -Xms512m -Xmx512m -XX:NewRatio=3 
-XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8 
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ConcGCThreads=4 
-XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark 
-XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly 
-XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000 
-XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -verbose:gc 
-XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution 
-XX:+PrintGCApplicationStoppedTime -Xloggc:/var/solr/logs/solr_gc.log 
-DzkClientTimeout=15000 -DzkHost=ec2_host:2181/solr -Djetty.port=8983 
-DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Dhost=ec2_host -Duser.timezone=UTC 
-Djetty.home=/opt/solr/server -Dsolr.solr.home=/var/solr/data 
-Dsolr.install.dir=/opt/solr 
-Dlog4j.configuration=file:/var/solr/log4j.properties -Xss256k -jar start.jar 
-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs 
--module=http

Note I replaced the user I’m running it as with my_solr_user and the actual ec2 
public DNS with e2_host in the output above.

I am new to SolrCloud so it’s more than likely I’ve screwed up some 
configuration setting somewhere.

Thank you for your help,
Brendan

> On Jan 15, 2016, at 6:07 PM, Chris Hostetter  wrote:
> 
> 
> : What I’m finding is that now and then base_url for the replica in 
> : state.json is set to the internal IP of the AWS node. i.e.:
> : 
> : "base_url":"http://10.29.XXX.XX:8983/solr”,
> : 
> : On other attempts it’s set to the public DNS name of the node:
> : 
> : "base_url":"http://ec2_host:8983/solr”,
> : 
> : In my /etc/defaults/solr.in.sh I have:
> : 
> : SOLR_HOST=“ec2_host”
> : 
> : which I thought is what I needed to get the public DNS name set in 
> base_url. 
> 
> i believe you are correct.  the "now and then" part of your question is 
> weird -- it seems to indicate that sometimes the "correct" thing is 
> happening, and other times it is not.  
> 
> /etc/defaults/solr.in.sh isn't the canonical path for solr.in.sh 
> according to the docs/install script for running a production solr 
> instance...
> 
> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-ServiceInstallationScript
> 
> ...how *exactly* are you running Solr on all of your nodes?
> 
> because my guess is that you've got some kind of inconsistent setup where 
> sometimes when you startup (or restart) a new node it does refer to your 
> solr.in.sh file, and other times it does not -- so sometimes solr never 
> sees your SOLR_HOST option.  In those cases, when it regesters itself with 
> ZooKeeper it uses the current IP as a fallback, and then that info gets 
> backed into the metadata for the replicas that get created on that node 
> at that point in time.
> 
> FWIW, you should be able to spot check that the SOLR_HOST is being applied 
> correctly by looking at the java process command line args (using PS, or 
> loading the SOlr UI in your browser) and checking for the "-Dhost=..." 
> option -- if it's not there, then your solr.in.sh probably wasn't read in 
> correctly
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/



Re: state.json base_url has internal IP of ec2 instance set instead of 'public DNS' entry in some cases

2016-01-15 Thread Brendan Grainger
Hi Hoss,

Thanks for your help. Going over the install page again I realized I had 
originally not adjusted the value of SOLR_HOST and it had started up using the 
default internal IP. I changed that to the public DNS and restarted solr. 
However in /live_nodes I then had 2 values: one for the public DNS and one for 
the internal IP. It looks like it didn’t get removed. I removed it using the 
zookeeper cli and all is working fine now.

I’m unsure, but wondering if the behavior I saw is somehow related to this: 
http://www.gossamer-threads.com/lists/lucene/java-dev/297790 
 However, as I 
said I’m pretty new to this so I could be completely wrong.

Thanks again
Brendan


> On Jan 15, 2016, at 6:07 PM, Chris Hostetter  wrote:
> 
> 
> : What I’m finding is that now and then base_url for the replica in 
> : state.json is set to the internal IP of the AWS node. i.e.:
> : 
> : "base_url":"http://10.29.XXX.XX:8983/solr”,
> : 
> : On other attempts it’s set to the public DNS name of the node:
> : 
> : "base_url":"http://ec2_host:8983/solr”,
> : 
> : In my /etc/defaults/solr.in.sh I have:
> : 
> : SOLR_HOST=“ec2_host”
> : 
> : which I thought is what I needed to get the public DNS name set in 
> base_url. 
> 
> i believe you are correct.  the "now and then" part of your question is 
> weird -- it seems to indicate that sometimes the "correct" thing is 
> happening, and other times it is not.  
> 
> /etc/defaults/solr.in.sh isn't the canonical path for solr.in.sh 
> according to the docs/install script for running a production solr 
> instance...
> 
> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-ServiceInstallationScript
> 
> ...how *exactly* are you running Solr on all of your nodes?
> 
> because my guess is that you've got some kind of inconsistent setup where 
> sometimes when you startup (or restart) a new node it does refer to your 
> solr.in.sh file, and other times it does not -- so sometimes solr never 
> sees your SOLR_HOST option.  In those cases, when it regesters itself with 
> ZooKeeper it uses the current IP as a fallback, and then that info gets 
> backed into the metadata for the replicas that get created on that node 
> at that point in time.
> 
> FWIW, you should be able to spot check that the SOLR_HOST is being applied 
> correctly by looking at the java process command line args (using PS, or 
> loading the SOlr UI in your browser) and checking for the "-Dhost=..." 
> option -- if it's not there, then your solr.in.sh probably wasn't read in 
> correctly
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/



Re: Position increment in WordDelimiterFilter.

2016-01-15 Thread Emir Arnautovic

Modassar,
Are you saying that WiFi Wi-Fi and Wi Fi should not match each other? 
Why do you use WordDelimiterFilter? Can you give us few examples where 
it is useful?


Thanks,
Emir

On 15.01.2016 05:13, Modassar Ather wrote:

Thanks for your responses.

It seems to me that you don't want to split on numbers.
It is not with number only. Even if you try to analyze WiFi it will create
4 token one of which will be at position 2. So basically the issue is with
position increment which causes few of the queries behave unexpectedly.

Which release of Solr are you using?
I am using Lucene/Solr-5.4.0.

Best,
Modassar

On Thu, Jan 14, 2016 at 9:44 PM, Jack Krupansky 
wrote:


Which release of Solr are you using? Last year (or so) there was a Lucene
change that had the effect of keeping all terms for WDF at the same
position. There was also some discussion about whether this was either a
bug or a bug fix, but I don't recall any resolution.

-- Jack Krupansky

On Thu, Jan 14, 2016 at 4:15 AM, Modassar Ather 
wrote:


Hi,

I have following definition for WordDelimiterFilter.



The analysis of 3d shows following four tokens and their positions.

token position
3d 1
3   1
3d 1
d   2

Please help me understand why d is at 2? Should not it also be at

position

1.
Is it a bug and if not is there any attribute which I can use to restrict
the position increment?

Thanks,
Modassar



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Solr Block join not working after parent update

2016-01-15 Thread sairamkumar
Hi,
Solr search with child field(s) is not working after an update in the parent
field(s). Parent entity has 20 million and child has 30 million records.

This is a show stopper. Kindly suggest solution/alternative.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Block-join-not-working-after-parent-update-tp4250936.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Monitor backup progress when location parameter is used.

2016-01-15 Thread Gian Maria Ricci - aka Alkampfer
Ok thanks, I also think that it's worth a jira, because for restore operation 
we have a convenient restorestatus command that tells exactly the status of the 
restore operation, I think that a backupstatus command could be useful.

--
Gian Maria Ricci
Cell: +39 320 0136949


-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: giovedì 14 gennaio 2016 17:00
To: solr-user@lucene.apache.org
Subject: Re: Monitor backup progress when location parameter is used.

I think the doc is wrong or at least misleading:
https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backups+of+SolrCores

"The backup operation can be monitored to see if it has completed by sending 
the details command to the /replication handler..."

>From reading the code, it looks like the snapshot details are only stored and 
>returned after the snapshot completes, either successfully or fails, but there 
>is nothing set or reported if a snapshot is in progress. So, if you don't see 
>a "backup" section in the response, that means the snapshot is in progress.

I think it's worth a Jira - either to improve the doc or improve the code to 
report backup as "inProgress... StartedAt...".

You can also look at the log... "Creating backup snapshot" indicates the backup 
has started and "Done creating backup snapshot" indicates success or "Exception 
while creating snapshot" indicates failure. If only that first message appeals, 
it means the backup is still in progress.


-- Jack Krupansky

On Thu, Jan 14, 2016 at 9:23 AM, Gian Maria Ricci - aka Alkampfer < 
alkamp...@nablasoft.com> wrote:

> If I start a backup operation using the location parameter
>
>
>
> *http://localhost:8983/solr/mycore/replication?command=backup=myc
> ore&
>  ore&>location=z:\temp\backupmycore*
>
>
>
> How can I monitor when the backup operation is finished? Issuing a 
> standard *details* operation
>
>
>
> *http://localhost:8983/solr/  mycore
> /replication?command=details*
>
>
>
> does not gives me useful information, because there are no information 
> on backup on returning data.
>
>
>
>
>
> 
>
>
>
> 
>
> 0
>
> 1
>
> 
>
> 
>
> 57.62 GB
>
>  name="indexPath">X:\NoSql\Solr\solr-5.3.1\server\solr\mycore\data\inde
> x/
>
> 
>
> 
>
> 1452534703494
>
> 1509
>
> 
>
> _2cw.fdt
>
> _2cw.fdx
>
> _2cw.fnm
>
> _2cw.nvd
>
> _2cw.nvm
>
> _2cw.si
>
> _2cw_Lucene50_0.doc
>
> _2cw_Lucene50_0.dvd
>
> _2cw_Lucene50_0.dvm
>
> _2cw_Lucene50_0.pos
>
> _2cw_Lucene50_0.tim
>
> _2cw_Lucene50_0.tip
>
> segments_15x
>
> 
>
> 
>
> 
>
> true
>
> false
>
> 1452534703494
>
> 1509
>
> 
>
>  name="confFiles">schema.xml,stopwords.txt,elevate.xml
>
> 
>
> optimize
>
> 
>
> true
>
> 
>
> 
>
>
>
> 
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
> [image:
> https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZk
> VVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d
> -e1-ft#http://www.codewrecks.com/files/signature/mvp.png]
>  [image:
> https://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrm
> GLUT3fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBcl
> KA=s0-d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg]
>  [image:
> https://ci3.googleusercontent.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_Gpc
> IZNEPjCzOYMrPl-r1DViPE378qNAQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT
> =s0-d-e1-ft#http://www.codewreck
s.com/files/signature/twitter.jpg]
>  [image:
> https://ci5.googleusercontent.com/proxy/iuDOD2sdaxRDvTwS8MO7-CcXchpNJX
> 96uaWuvagoVLcjpAPsJi88XeOonE4vHT6udVimo7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d
> -e1-ft#http://www.codewrecks.com/files/signature/rss.jpg]
>  [image:
> https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86YXcwaKfn
> 3x7ydAObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg=s
> 0-d-e1-ft#http://www.codewrecks.com/files/signature/skype.jpg]
>
>
>


RE: Pro and cons of using Solr Cloud vs standard Master Slave Replica

2016-01-15 Thread Gian Maria Ricci - aka Alkampfer
Yes, I've checked that jira some weeks ago and it is the reason why I was 
telling that there is still no clear procedure to backup SolrCloud in current 
latest version.  I'm glad that the priority is Major, but until it is not 
closed in an official version, I have to tell to customers that there is not 
easy and supported backup procedure for SolrCloud configuration :(.

--
Gian Maria Ricci
Cell: +39 320 0136949



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: giovedì 14 gennaio 2016 16:46
To: solr-user 
Subject: Re: Pro and cons of using Solr Cloud vs standard Master Slave Replica

re: SolrCloud backup/restore: https://issues.apache.org/jira/browse/SOLR-5750

not committed yet, but getting attention.



On Thu, Jan 14, 2016 at 6:19 AM, Gian Maria Ricci - aka Alkampfer 
 wrote:
> Actually there are situation where a restore is needed, suppose that someone 
> does some error and deletes all documents from a collection, or maybe deletes 
> a series of document, etc. I know that this is not likely to happen, but in 
> mission critical enterprise system, we always need a detailed procedure for 
> disaster recovering.
>
> For such scenario we need to plan the worst case, where everything is lost.
>
> With Master Slave is just a matter of recreating machines, reconfigure the 
> core, and restore a backup, and the game is done, with SolrCloud is not 
> really clear for me how can I backup / restore data. From what I've found in 
> the internet I need to backup every shard of the collection, and, if we need 
> to restore everything from a backup, we can recreate the collection and then 
> restore all the individual shards. I do not know if this is a supported 
> scenario / procedure, but theoretically it could work.
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
>
>
> -Original Message-
> From: Alessandro Benedetti [mailto:abenede...@apache.org]
> Sent: giovedì 14 gennaio 2016 10:46
> To: solr-user@lucene.apache.org
> Subject: Re: Pro and cons of using Solr Cloud vs standard Master Slave 
> Replica
>
> It's true that SolrCloud is adding some complexity.
> But few observations :
>
> SolrCloud has some disadvantages and c an't beat the easiness and 
> simpleness
>> of
>> Master Slave Replica. So I can only encourage to keep Master Slave 
>> Replica in future versions.
>
>
> I agree, it can happen situations when you have really simple and not 
> critical systems.
> Anyway old style replication is still used in SolrCloud, so I think it is 
> going to stay for a while ( until is replaced with something else) .
>
> To answer to Gian :
>
> One of the problem I've found is that I've not found a simple way to 
> backup
>> the content of a collection to restore in situation of disaster recovery.
>> With simple master / slave scenario we can use the replication 
>> handler to generate backups that can be easily used to restore 
>> content of a core, while with SolrCloud is not clear how can we 
>> obtain a full backup
>
>
> To be fair, Disaster recovery is when SolrCloud shines.
> If you lose random nodes across your collection, you simply need to fix them 
> and spin up again .
> The system will automatically restore the content to the last version availa 
> ble ( the tlog first and the  leader ( if the tlog is not enough) will help 
> the dead node to catch up .
> If you lose all the replicas for a shard and you lose the content in disk of 
> all this replicas ( index and tlog), SolrCloud can't help you.
> For this unlikely scenarios a backup is suggested.
> You could restore anyway the backup only to one node, and the replicas are 
> going to catch up .
>
> Probably is just a matter of backupping every shard with standard
>> replication handler and then restore each shard after recreating the 
>> collection
>
>
> Definitely not, SolrCloud is there to avoid this manual stuff.
>
> Cheers
>
>
> On 14 January 2016 at 08:58, Gian Maria Ricci - aka Alkampfer < 
> alkamp...@nablasoft.com> wrote:
>
>> I agree that SolrCloud has not only advantages, I really understand 
>> that it offers many more features, but it introduces some complexity.
>>
>> One of the problem I've found is that I've not found a simple way to 
>> backup the content of a collection to restore in situation of disaste
> r
>> recovery. With simple master / slave scenario we can use the 
>> replication handler to generate backups that can be easily used to 
>> restore content of a core, while with SolrCloud is not clear how can we 
>> obtain a full backup.
>> Probably is just a matter of backupping every shard with standard 
>> replication handler and then restore each shard after recreating the 
>> collection, but I've not found (probably I need to search better) 
>> official documentation on backup / restore procedures for SolrCloud.
>>
>> Thanks.
>>
>> --
>> Gian Maria Ricci
>> Cell: +39 320 0136949
>>
>>
>> -Original Message-
>> From: 

Re: Solr Block join not working after parent update

2016-01-15 Thread Mikhail Khludnev
On Thu, Jan 14, 2016 at 10:01 PM, sairamkumar 
wrote:

> This is a show stopper. Kindly suggest solution/alternative.


update whole block.


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Speculation on Memory needed to efficently run a Solr Instance.

2016-01-15 Thread Gian Maria Ricci - aka Alkampfer
Hi, 

 

When it is time to calculate how much RAM a solr instance needs to run with
good performance, I know that it is some form of art, but I'm looking at a
general "formula" to have at least one good starting point.

 

Apart the RAM devoted to Java HEAP, that is strongly dependant on how I
configure caches, and the distribution of queries in my system, I'm
particularly interested in the amount of RAM to leave to operating system to
use File Cache.

 

Suppose I have an index of 51 Gb of dimension, clearly having that amount of
ram devoted to the OS is the best approach, so all index files can be cached
into memory by the OS, thus I can achieve maximum speed.

 

But if I look at the detail of the index, in this particular example I see
that the bigger file has .fdt extension, it is the stored field for the
documents, so it affects retrieval of document data, not the real search
process. Since this file is 24 GB of size, it is almost half of the space of
the index.

 

My question is: it could be safe to assume that a good starting point for
the amount of RAM to leave to the OS is the dimension of the index less the
dimension of the .fdt file because it has less importance in the search
process?

 

Are there any particular setting at OS level (CentOS linux) to have maximum
benefit from OS file cache? (documentation at

https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#T
akingSolrtoProduction-MemoryandGCSettings does not have any information
related to OS configuration). Elasticsearch
(https://www.elastic.co/guide/en/elasticsearch/reference/1.4/setup-configura
tion.html) generally have some suggestions such as using mlockall, disable
swap etc etc, I wonder if there are similar suggestions for solr.

 

Many thanks for all the great help you are giving me in this mailing list. 

 

--
Gian Maria Ricci
Cell: +39 320 0136949

 

   


 



Issue in custom filter

2016-01-15 Thread Smitha Rajiv
Hi

I have a requirement such that while indexing if tokens contains numbers,
it needs to be converted into corresponding words.

e.g : term1 part 2 assignments -> termone part two assignments.

I have created a custom filter with following code:

@Override
public boolean incrementToken() throws IOException {
if (!input.incrementToken())
return false;
char[] buffer = charTermAttr.buffer();
String newTerm = new String(buffer);
convertedTerm = Converter.convert(newTerm);
charTermAttr.setEmpty();
charTermAttr.copyBuffer(convertedTerm.toCharArray(), 0,

 convertedTerm.length());
return true;

}
But its given weird results when i analyze.

After applying the custom filter i am getting the result as
termone partone twoartone assignments.

It looks like the buffer length which i am setting for the first token is
not getting reset while picking up the next token.I have a feeling that
somewhere i am messing up with the offsets.

Could you please help me in this.

Thanks & Regards,
Smitha